Operating systems evolve all the time. This also includes ESXi. The easiest way to keep up with the latest and greatest is to perform upgrades. Each time there is a custom risk that we can be affected, unfortunately, by new bugs that were not discovered before release.
I want to present such case when upgrades can disturb, under certain circumstance, the availability of the environment.
Let's take as example the upgrade of ESXi host to version 6.5 and one of the VMware Knowledge Base articles related to it (KB2151749).
Nowadays 10GB NICs are used everywhere as they are providing great bandwidth, lower cost, increased scalability, and simplified management.
Upgrading to ESXi version 6.5 will cause a host failure with PSOD. Just the simple combination of these 2 factors, 10GB NIC and ESXi 6.5 can lead to an outage with the following backtrace:
2017-09-16T15:34:30.908Z cpu6:65645)@BlueScreen: #PF Exception 14 in world 65645:HELPER_UPLIN IP 0x41802c496258 addr 0x0 PTEs:0x292379a027;0x2efe54c027;0xbfffffffff001; 2017-09-16T15:34:30.908Z cpu6:65645)Code start: 0x41802c200000 VMK uptime: 4:02:26:10.151 2017-09-16T15:34:30.908Z cpu6:65645)0x4390c369bd00:[0x41802c496258]UplinkTreePackQueueFilters@vmkernel#nover+0x188 stack: 0xe15427000 2017-09-16T15:34:30.909Z cpu6:65645)0x4390c369bd90:[0x41802c49e142]UplinkLB_LoadBalanceCB@vmkernel#nover+0x1e42 stack: 0x1 2017-09-16T15:34:30.909Z cpu6:65645)0x4390c369bf20:[0x41802c4916f2]UplinkAsyncProcessCallsHelperCB@vmkernel#nover+0x116 stack: 0x43048761eac0 2017-09-16T15:34:30.910Z cpu6:65645)0x4390c369bf50:[0x41802c2c9e0d]helpFunc@vmkernel#nover+0x3c5 stack: 0x4300b9b2a050 2017-09-16T15:34:30.910Z cpu6:65645)0x4390c369bfe0:[0x41802c4c91b5]CpuSched_StartWorld@vmkernel#nover+0x99 stack: 0x0 2017-09-16T15:34:30.913Z cpu6:65645)base fs=0x0 gs=0x418041800000 Kgs=0x0.
Why such a usual and simple combination can interrupt the availability of the ESXi host?
According to the KB, “This issue occurs because the Netqueue commit phase abruptly stops due to a failure of hardware activation of a Rx queue. As a result, the Internal data structure of the Netqueue layer's could go out of sync causing a host PSOD.”
OK, we can see that NetQueue is the deciding factor of this outage.
The 10GB NICs provides lots of benefits but they also have a downside. This may come by the way it is configured, as most of the time it will be used as a shared resource (it’s unlikely to use 10GB NIC per VM in the host).
At this point a queuing mechanism is needed, so NetQueue will have the role to deliver network traffic to the system in multiple receive queues that can be processed separately, allowing processing to be scaled to multiple CPUs, improving receive-side networking performance. By this the bottleneck is eliminated as each vNIC will have his own queue.
I found a nice article explaining in details how NetQueue works and I suggest you to read it: "Using VMware NetQueue to virtualize high-bandwidth servers" by George Crump. I want to thank George for such a detailed presentation.
Fixes and workarounds
So let’s go back to our KB. We know the problem but how we can fix it?
VMware recommendation is to apply the ESXi 6.5 P02 or to use one of the two workarounds available:
Since both workarounds doesn’t look optimal probably it is time for a new update of hosts to ESXi 6.5 P02.
But, can we expect that P02 is working perfect and no other issues might encounter? Hard to tell, as problem might appear form simple combination of components as we have seen above or multiple factors.
The number of KB’s related to known issues that might affect the environments are many but hard to discover them in time to apply the resolution.
A great support to find such issues described in Knowledge Base articles it is provided by Runecast Analyzer.
The PSOD described in the article can be easily avoided if we know about it in advance, if someone could match our environment configurations and the KBs articles and point where the problems might occur.
Runecast provides proactive fault avoidance to minimize the risk of virtualized datacenter downtime and security breaches. The Analyzer is an automated system that correlates VMware vSphere and vSAN configurations and logs with the official VMware repository of known issues, best practices and security auditing rules.
Data Scientist Engineer