One of the most severe issues for VMware environments is the Purple Screen of Death (PSOD). When this happens your ESXi host “dies” and with it all the VMs and services running on it. Most VMware admins have experienced it at least once and know that the troubleshooting process will almost certainly land on you and will be resolved through finding the relevant VMware Knowledge Base(KB) article.
At Runecast, we regularly analyze the entire VMware Knowledge Base (kb.vmware.com) which consists of more than 30,000 articles. We also monitor Twitter feeds, blogs and forums to discover any news of critical or newly reported issues.
Runecast are knowledge automation pioneers. Our AI powered algorithms are extracting actionable insights from different sources of written human knowledge in order to proactively make virtualized infrastructures more resilient, secure and efficient.
Our engineers (all of whom are VCAP-DCA and vExpert) and advanced systems have analyzed and classified this huge repository of articles, so we are extremely familiar with PSOD issues. There are more than 83 KBs articles mentioning PSOD in Runecast Analyzer’s database; here are 5 that stand out:
1. KB2146388 - ESXi host fails with PSOD when using Intel Xeon Processor E5 v4, E7 v4, and D-1500 families
Updated: Aug 21, 2017 | Total Views: 10656
Affects: VMware ESXi - 5.5, 6.0, 6.5
Cause: Known issue with specific CPU families
Fix: upgrade the system BIOS (firmware) in order to apply microcode patch
Issue Description: This issue is the most pervasive, since it affects all the currently supported ESXi versions and a large number of CPU models. Even if you are up to date with the host patches you may not be aware of this issue until it happens. Additionally, the fix involves manual remediation actions. When this issue was first discovered, vendors did their best at bundling the firmware updates, but in the meantime outages were happening.
2. KB2151749 - ESXi host fails with PSOD after upgrading to 6.5
Updated: Dec 26, 2017 | Total Views: 7355
Affects: VMware ESXi - 6.5
Cause: Netqueue bug when using 10Gb NIC
Fix: Upgrade host to P02 (ESXi-6.5.0-20171204001-standard)
Issue Description: This was a nasty surprise after you upgraded to 6.5. It has been fixed in December together with ESXi 6.5 Patch 02, so if you are using the latest patch version, you are safe. There is also a workaround mentioned in the KB, however this is not recommended as it can impact performance, as described by our Data Scientist Engineer Ionut Radu in his recent blog article.
3. KB2149592 - ESXi IO connectivity issues or PSOD with VT-d interrupt remapper disabled (2149592)
Updated: Sep 14, 2017 | Total Views: 2301
Affects: VMware ESXi - 5.5, 6.0, 6.5
Cause: Intel VT-d interrupt remapper bug
Fix: Depending on situation: host patches, BIOS patches or workaround.
Issue Description: There is a lot of confusion around this issue, as there are several VMware KB articles (KB2147325, KB1030265, KB2149043, KB2149592) recommending different and even conflicting fixes and workarounds, depending on vendor and CPU family. A thorough understanding of the issue, correlated with data from your environment (ESXi patch level, CPU family and even PCI-e cards slot location - for HPE servers) is advised in order to make sure you are not impacted. Constantin Ivanov has a great article covering the details of this issue.
4. KB2147271 - PSOD on ESXi while vMotion: PF Exception 14 in world xxxxxxx :vmotionRecvH
Updated: May 24, 2018 | Total Views: 1331
Affects: VMware ESXi - 5.5, 6.0
Cause: Race condition during specific DVFilter operations
Fix: ESXi host patch.
Issue Description: We’ve seen this one happening in our test environments where we have many different combinations of vSphere builds. It appears to occur because of a race condition during vMotion on DVFilterCheckpointGet and DVFilterDestroyFilter functions.
5. KB2150280 - ESXi host fails with purple screen error: "NOT_IMPLEMENTED bora/vmkernel/filesystems/devfs/devfs.c:2655"
Updated: Jul 25, 2017 | Total Views: 651
Affects: VMware ESXi - 6.0
Cause: devfs heap can fill up when there are too many storage devices
Fix: Change host advanced setting
Issue Description: This issue can affect all builds of ESXi 6.0.x regardless of the patch level, since there is no corrective patch in place. The fix involves a manual action: to change a host advanced setting on each 6.0 host where you have a very large number of storage devices. While this setup is not the most common, I still wanted to mention it in this list because it shows how important it is to be proactively prepared.
This list could easily be extended as there are many ESXi build and driver/firmware combinations that can lead to a PSOD.
The best way to deal with the PSOD issue or with any other issue for this matter would be to prevent it from happening in the first place. That’s what we are doing at Runecast; so this can be achieved in your environment automatically and continuously. We discover the combinations that are known to cause issues and we proactively expose these to admins and engineers so they can prevent them from happening. Register now and download our trial, it takes just 2 minutes to deploy and get the value!
About the author:
Aylin Sali (Runecast CTO)
Aylin Sali is a virtualization and cloud enthusiast with more than 10 years of IT experience and an overwhelming desire for automation. He is a VCAP DCA & DCD and 5x vExpert.
Aylin is on Twitter as: @V4Virtual