An Inquiry into the Nature of (Not) Reporting Data Center Problems


It is widely known that ‘technical issues’ are a primary factor leading to data center outages and/or VMware errors such as the ill-fated Purple Screen of Death (PSOD). But how might communication rank as a factor? The VMware Knowledge Base (VMware KB), technical forums, and chat groups remain preeminent go-to places for troubleshooting problems like “vCenter error” or “hypervisor is not running” – for example, Reddit’s VMware group with about 75K members or the Telegram VMware chat group with almost 2K members. But despite the (already cliche) “Fail fast” mantra of today’s Agile approach… presenting (near) failures beyond help venues seems not to have caught on, at least at conferences where people go to get psyched up and motivated, or in meetings where a too-common (political) aim within companies is to prioritize making the team look good.

Presentations often tend to center instead on the ‘failures of others’ as a contrast to the presenter’s own ‘genius’ practices and/or ideas. If part of the motive for this is the ever-present requirement to drive sales revenue or make a team look good, then it’s a natural reaction to focus only on our strengths and the incompetence of competition or another team. But as colleagues and audiences are increasingly savvy about how the world works, perhaps they’d benefit more in seeing how companies and teams deal with problems that can affect all.

Instead of presenting for example “How our company stays immune to vSphere errors” a better communication approach might be “When our company nearly failed at handling our vSphere errors, this is what we did next.” In an industry where downtime can cost billions, there is no room for grandstanding (leave that to motivational speakers). When your responsibility is to protect the data center, the first step toward doing that the most effective way possible is to admit the issues – including mistakes – that are learned from daily.

Mercedes-Benz had an ad campaign several decades ago that said something like “All cars break down sooner or later. Ours just do it later.” Certainly this angle was a refreshing turn from the advertising norms at the time (and still today). In 2019, we know well enough about entropy and that we are not immune to the breakdown of all things.


Chart

What has not yet caught on as a mainstream concept: No significant breakthrough occurs without a preceding breakdown. It’s like the quote that is often attributed to Henry Ford: “If you want something done the most efficient way possible, give it to a lazy man.” Likewise, it can be said that no breakdown occurs without an eventual corresponding breakthrough of some sort. For our breakthroughs to have real significance, they should be paired with our own breakdowns – not necessarily those of our competitors. The latter is easily distinguishable as a disingenuous approach.

The Humanoid Factor

“[I]t was designed to operate with minimal assistance from humans, who were without exception the moving part most likely to fail.” ― John Scalzi, The End of All Things

We are also quite conscious that human error is the yet-to-be-determined variable in any and all equations, and human error does not seem to impress as much as it might have when we first began to evaluate it in business. Acronyms like SNAFU/FUBAR and even Darwin Awards seem now to be passing fads, and a quick web search turns up plenty of statistics to support human error as another leading cause of data center outages.

Problems in Datacenter

More impressive to note are the ways in which we humans compensate for our errors, how we rise to the occasion and seemingly perform impossible feats when faced with the adverse effects of our mistakes. The most useful learning moments are often the real-life actions that a team (or individual) took to remedy a seriously screwed-up situation.

The stereotype that the ‘IT Crowd’ lack communication and soft skills doesn’t have to be a norm any more than ‘starving artist’ needs to be. To exist in a less-reactive present, envision what the ideal end game could look like, then set goals moving backwards to right now. That is, if you want a ‘bulletproof’ data center as an ideal result a year from now, then setting goals to achieve such will likely result in being more transparent in our communications today.

If one thinks back to a grandparent and his or her philosophies to life, work, and relationships, it’s evident that with maturity comes a more macro understanding that everything happens from a seemingly infinite series of cause and effect. As children, we have a tendency to blame aspects around us – siblings, cousins, neighbors, the dog, the Devil, the weather, etc. – for our own errors in judgement, whereas a mature mind acknowledges (with only minor insecurity) that human error is inevitable. The true and most compelling test of character is how we respond to it. Like the old Zen-like saying goes: “You can’t control the things around you, you can only control your response to them.” So... response + ability = responsibility.

The Evolution

Any shift toward greater transparency reveals needs for new innovations and technologies. This is what happened when a team at IBM was tasked with building the company’s VMware Center of Excellence, to manage the accounts of all IBM customers running VMware in the EMEA region.

When they implemented the Defect Prevention Process (DPP) to investigate problem tickets, analyze the root causes with the aim of finding patterns, and make recommendations for how to reduce such issues, they soon observed that around 90% of the root causes happening across all customers’ VMware-based environments were stemming from known problems already documented within the VMware Knowledge Base (KB) articles or in other collective knowledge sources such as forums, blogs, white papers, etc.

At some point, there was a customer whose ESXi failed, and the high availability did not kick in, which led to a major ESXi error with service impact that continued for about two days before they were finally able to find the root cause – a configuration issue – in a KB article about ESXi patches. Making a world-class decision, IBM tasked the team with checking all other customers to see if any of them could also be at risk for this particular concern.

Some time later, when members of this IBM team had left the company to pursue other opportunities, the idea was born to develop a way to: use the collective knowledge available (including VMware Best Practices), automate intelligence-gathering capabilities, and use it all to proactively analyze and troubleshoot potential issues – in real time against these tens of thousands of sources – before they could have a chance to become critical problems. The ex-IBM team – including VMware Certified Design Expert (VCDX) #74, Stan Markov – named their new automated predictive-analytics software Runecast Analyzer and built the company Runecast Solutions around it, leading to an entirely new direction in must-have VMware tools.

Today, Runecast Solutions Ltd. is headquartered in London, UK, with several offices worldwide, and is a leading provider of patent-pending, actionable predictive analytics for VMware vSphere, vSAN, NSX and Horizon performance management. Its award-winning Runecast Analyzer software, regularly lauded by industry experts, provides real-time, continuous VMware support intelligence for companies of all sizes. IDG Connect named Runecast one of “20 Red-Hot, Pre-IPO Companies to Watch in the 2019 B2B Tech” space. 

In this case, the IBM policy of openness about a discovered vulnerability led to the development of an innovative technology to give VMware admins an almost omniscient view that they didn’t previously realize that they had been missing.

From a review in Virtualizationhowto.com: "Runecast is just one of those products that once you see it, you have to have it for your VMware vSphere environment. You then ask yourself, why have I not been running this all along?" –Brandon Lee

The Conclusion

Doctors take the Hippocratic Oath to be qualified professionals. Attorneys take the Bar Exam. I would posit that, as global technology professionals, we must take full responsibility for the evolution of technology at this moment in its history – which includes honest communication about the state of things. Like it or not, we have all somehow stepped into a game that requires constant leveling up, and it’s clear that the next level is more firmly in our grasp if we can be more transparent publicly about where we are succeeding – and failing – at any given moment. If we can do that... we can do anything.

In your quest for leveling up, the article that inspired this commentary mentions eight areas that the author feels are “the most important factors contributing to a data center’s ability to deliver a very high level of operational excellence”: Staffing, Training, Resources, Incident Reporting, Escalation, Communication, Back-up Plans, and knowing well your Top 10 Incidents. The author further explains how to best take full professional responsibility in each of these eight areas. Also, check out what the guys at Runecast can do for your data center right now: get a Runecast Analyzer free trial here.

Register now

Your Runecast Team