Aylin Sali
PSOD
VMware
Security and Compliance
IT Operations
Educational
In this article:

This article is available in the following languages:

Dutch
Portuguese
Spanish
Russian
French
Italian
Hungarian
Romanian
German
English

TL;DR

The most troublesome aspect of a PSOD is that it makes you lose trust in your infrastructure and the anxiety it creates. Until you don’t solve the root cause, the thought that this can happen again or on another server can keep you up at night.

Use Runecast Analyzer (Free Trial) to check if any of your hosts are affected by conditions that can cause the VMware purple screen of death.

What is PSOD?

PSOD stands for Purple Screen of Diagnostics, often referred to as Purple Screen of Death: from the more known Blue Screen of Death encountered on Microsoft Windows.

It’s a diagnostic screen displayed by VMware ESXi when the kernel detects a fatal error in which it either is unable to safely recover from, or cannot continue to run without having a much higher risk of a major data loss.

It shows the memory state at the time of the crash and also additional details which are important in troubleshooting the cause of the crash: ESXi version and build, exception type, register dump, backtrace, server uptime, error messages and information about the core dump (a file generated after the the error, containing further diagnostic information).

This screen is visible on the console of the server. In order to see it, you will need to either be in the datacenter and connect a monitor or remotely using the server’s out-of-band management (iLO, iDRAC, IMM… depending on your vendor).

Example of Purple Screen of Diagnostics

DID YOU KNOW? 
The screen is referred to as either    Purple    or   Pink  , but in fact the color is   Dark Magenta  (RGB:171,0,171 | CMYK:0.00, 1.00, 0.00, 0.33)

Why does PSOD happen?

The PSOD is a kernel panic. Even though we all know that ESXi is not based on UNIX, the panic implementation fits the UNIX definition. The ESXi kernel (vmkernel) triggers this safety measure in response to an event/error which is unrecoverable and would mean that continuing to run would pose a high risk for the services and VMs. To put it simply: when the ESXi hosts feels it became corrupted, it commits “seppuku” and, while bleeding its purple blood, writes a suicide letter detailing why it did it!

The most common causes for a PSOD are:
1. Hardware failures, mostly RAM or CPU related. They normally throw out a “MCE” or “NMI” error.

  • “MCE” - Machine Check Exception, which is a mechanism within the CPU to detect and report hardware issues. There are important details for identifying the root cause of the issue in the codes displayed on the purple screen.
  • “NMI” - non-maskable interrupt, which is a hardware interrupt that cannot be ignored by the processor. Since NMI is a very important message about a HW failure, the default response starting with ESXi 5.0 and later is to trigger a PSOD. Earlier versions were just logging the error and continuing. Same as with MCEs, purple screen caused by NMI will provide important codes that are crucial for troubleshooting.

2. Software bugs

3. Misbehaving drivers; bugs in drivers that try to access some incorrect index or non-existing method (ex: KB2148123)

DID YOU KNOW? 
You can even trigger manually a PSOD for testing purposes or if you are just curious to see it happen. 
Log in to the ESXi host via DCUI or SSH with a privileged account and run:

vsish -e set /reliability/crashMe/Panic

Obviously a test system is recommended, ideally a virtual nested ESXi so you can easily observe the console. Also make sure you finish reading this article to understand the implications of this action and the effect on your test system. 

What’s the impact of PSOD?

When the panic occurs and the host crashes, it terminates all the services running on it together with all the virtual machines hosted. The VMs are not gracefully shutdown, but rather abruptly powered off. If the host is part of a cluster and you’ve configured HA, these VMs will be started on the other hosts in the cluster. Besides the outage and the unavailability of the VMs during the time they are down, some critical applications like database servers, message queues or backup jobs may be affected by the “dirty” shutdown.

Additionally, all other services provided by the host will be terminated, so if your host is a member of a VSAN cluster, a PSOD will impact vSAN as well.

For me, the most troublesome aspect of a PSOD is that it makes you lose trust in your infrastructure and the anxiety it creates, at least until you get to the bottom of it. Ok, you can recover by rebooting and may have HA or even FT so the impact may not be devastating… but until you don’t solve the root cause, the thought that this can happen again or on an another server can keep you up at night.

What to do when PSOD happens?

1. Analyze the purple screen message
One of the most important things to do when you have a PSOD is to take a screenshot. If you are connecting remotely(IMM, iLO, iDRAC,...) to the console it will be easy taking a screenshot, but if you have to go to the datacenter, you may need to literally take out your phone and snap a picture of the screen. There’s a lot of useful information about the cause of the crash in that screen.

The purple screen message


2. Contact VMware support
Before you start further investigation and troubleshooting it is advisable to contact VMware support, if you have a support contract. In parallel with your investigation they will be able to assist you in making the Root Cause Analysis (RCA).

3. Reboot the affected ESXi host
In order to recover the server you will need to reboot it. I would also advise keeping it in maintenance mode until you perform the full RCA, identify the cause and fix it. If you can’t afford keeping it in maintenance mode, at least fine tune your DRS rules so that only un-important VMs will run on it, so that if another PSOD hits the impact will be minimal.

4. Get the core dump
After the server boots up you should collect the coredump. The coredump, also called vmkernel-zdump is a file containing logs with similar, but more detailed information to that seen on the purple diagnostic screen and will be used in further troubleshooting. Even if the cause of the crash might seem obvious from the PSOD message that you analyzed in step 1, it is advisable to confirm it by looking at the logs from the coredump.

Depending on your configuration you may have the core dump in one of these forms:

a. On the scratch partition
b. As a .dump file on one of the host’s datastores
c. As a .dump file on the vCenter - through the netdump service

The coredump becomes especially important if the configuration of the host is to automatically reset after a PSOD, in which case you will not get to see the message on screen.

You can copy the dumpfile out of the ESXi host using SCP and then open it using a text editor (like Notepad++). This will contain the contents of the memory at the time of the crash and the first parts of it contain the messages you saw on the purple screen. The whole file may be requested by VMware support, but you can only extract the vmkernel log, which is a bit more … digestible:

Error message generated by the purple screen


5. Decipher the error

Troubleshooting and Root Cause Analysis can make one feel like Sherlock Holmes. PSODs can sometimes turn into a Arthur Conan Doyle inspired story, but in most cases it’s a pretty straightforward process where it will be hard to get to the fifth “why” of the 5 Whys technique.

The most important symptom, and the one you should start with, is the error message generated by the purple screen. Luckily, the number of error messages that can be produced is finite:

Exception Type 0 #DE: Divide Error
Exception Type 1 #DB: Debug Exception
Exception Type 2 NMI: Non-Maskable Interrupt
Exception Type 3 #BP: Breakpoint Exception
Exception Type 4 #OF: Overflow (INTO instruction)
Exception Type 5 #BR: Bounds check (BOUND instruction)
Exception Type 6 #UD: Invalid Opcode
Exception Type 7 #NM: Coprocessor not available
Exception Type 8 #DF: Double Fault
Exception Type 10 #TS: Invalid TSS
Exception Type 11 #NP: Segment Not Present
Exception Type 12 #SS: Stack Segment Fault
Exception Type 13 #GP: General Protection Fault
Exception Type 14 #PF: Page Fault
Exception Type 16 #MF: Coprocessor error
Exception Type 17 #AC: Alignment Check
Exception Type 18 #MC: Machine Check Exception
Exception Type 19 #XF: SIMD Floating-Point Exception
Exception Type 20-31: Reserved
Exception Type 32-255: User-defined (clock scheduler)

Since the kernel panic is handled by the CPU, for more information about these Exceptions see Intel 64 and IA-32 Architectures Software Developer’s Manual, Volume 1: Basic Architecture and Intel 64 and IA-32 Architectures Software Developer’s Manual, Volume 3A.

The most common cases are covered in separate VMware KB articles and I will just maintain a reference table of such errors here since the articles are very detailed and well documented. So use this table as an index for the PSOD errors:

Example Error Detailed KB Article
LINT1/NMI (motherboard nonmaskable interrupt), undiagnosed Using hardware NMI facilities to troubleshoot unresponsive hosts (1014767)
Panic requested by one or more 3rd party NMI handlers
COS Error: Oops Understanding an "Oops" purple diagnostic screen (1006802)
Lost Heartbeat Understanding a "Lost Heartbeat" purple diagnostic screen (1009525)
ASSERT bora/vmkernel/main/pframe_int.h:527 Understanding ASSERT and NOT_IMPLEMENTED purple diagnostic screens (1019956)
NOT_IMPLEMENTED /build/mts/release/bora-84374/bora/vmkernel/main/util.c:83 Understanding ASSERT and NOT_IMPLEMENTED purple diagnostic screens (1019956)
Spin count exceeded (iplLock) - possible deadlock Understanding a "Spin count exceeded" purple diagnostic screen (1020105)
PCPU 1 locked up. Failed to ack TLB invalidate Understanding a Failed to ack TLB invalidate purple diagnostic screen (1020214)
#GP Exception(13) in world 4130:helper13-0 @ 0x41803399e303 Understanding Exception 13 and Exception 14 purple diagnostic screen events (1020181)
#PF Exception type 14 in world 136:helper0-0 @ 0x4a8e6e
Machine Check Exception: Unable to continueHardware (Machine) Error Decoding Machine Check Exception (MCE) output after a purple screen error (1005184)
Hardware (Machine) Error
PCPU: 1 hardware errors seen since boot (1 corrected by hardware)

Avoid PSOD in your environment

Get full protection and proactive remediation with Runecast

Get your free trial

How to prevent PSOD?

Most of the software related PSODs are resolved by patches, so make sure you are up to date with the latest versions.

Make sure that your servers are on VMware’s Hardware Compatibility Checklist, together with all the devices and adapters. This will protect from some of the unexpected hardware related issues, but it will also ensure that VMware support will be able to support you in case of a PSOD.

As described above in “Why it happens”, misbehaving drivers are also an often cause of PSODs, so it’s imperative to regularly check vendors’ support websites for updated firmware and drivers and especially for the documented PSOD causing drivers to respond as soon as possible by upgrading them.

At Runecast, we regularly analyze the entire VMware Knowledge Base (kb.vmware.com) which consists of more than 30,000 articles. We are extracting actionable insights from the KBs in order to proactively make virtualized infrastructures more resilient, secure and efficient. We are very familiar with the PSOD and are able to identify most of the preconditions that can lead to this problem. By proactively analyzing your environment, Runecast Analyzer will help you steer away from these issues, so you can have the peace of mind that most PSODs lurking in your environment are prevented.

Proactive PSOD Checks with Runecast
PSOD Findings Details View on Runecast

Meet other Runecasters here:

Take Runecast for a spin

Get full protection and proactive remediation with Runecast

Get your free trial