The hostd process of your ESXi 5.5 or 6.0 hosts may start crashing repeatedly if you configure a VDS (virtual distributed switch) with 4 or more physical NICs. The problem is documented in VMware KB55845.
The problem affects ESXi hosts version 5.5 prior to build 2403361 (ESXi550-201501001) and 6.0 prior to build 3380124 (Update 1b).
Nowadays, with 10 Gbit adapters often used in ESXi host configurations, you might be running a 2 pNIC configuration. However, especially in larger ESXi hosts or due to specific design constraints, you might have chosen to run virtual distributed switches with 4 or more physical NICs, which is quite common.
Hostd crashing sounds like a serious problem to have. Let’s take a look at what hostd does and what the impact could be.
The hostd service (also referred to as vmware-hostd, hostd management agent, host management service) runs directly on the ESXi host and is responsible for many of its operations:
- Creation, migration and powering on of virtual machines
- Keeping track of registered virtual machines, status of the VMs, list of storage volumes visible to the ESXi host
- Communication with other agents, like vpxa (used by vCenter) and FDM (used for HA)
- Direct connection to the ESXi host - via a GUI client or CLI.
In short, hostd is the main communication channel to the VMkernel.
The good news is that if hostd is down, your VMs will keep running. The bad news is that your ESXi host will become practically unmanageable. All operations mentioned above will be unavailable. Considering that the distributed switch would be used by many ESXi hosts with same configuration (4 or more NICs), you may have manageability issues with many of your ESXi hosts. The ESXi hosts will be disconnecting from vCenter and unmanageable even if you try connecting directly to them.
If you troubleshoot further, you will spot similar messages in hostd.log:
YYYY-MM-DDTHH:MM:SS.SSSZ info hostd[FFE16A80] [Originator@6876 sub=Default] hostd-118622-3073146.txt time=YYYY-MM-DD HH:MM:SS.000 --> Crash Report build=3073146 --> Signal 11 received, si_code 1, si_errno 0 --> Bad access at 0
The described problem affects fairly early releases of ESXi 5.5 and 6.0. Especially, considering that ESXi 5.5 End Of General Support is in September 2018, chances are that you are not running 5.5 in your environment, not even 6.0. In any case, the general recommendation by VMware is to upgrade to 6.5 or 6.7, or in order to avoid this particular issue, update at least to 6.0 Update 1b.
Just in case, make sure to review your host versions and build numbers again, if you have distributed switches with 4 NICs or more.
One easy way to verify if you are affected by this and other issues from the VMware Knowledge Base, is to run Runecast Analyzer.