A few weeks ago, this tweet by David Peñaloza riled me up a bit.
I have problems with people who believe that EVERYTHING can be escalated, or solved by applying more pressure.
Recently I’ve encountered this several times. What’s making people think they can always have power over others? Specially between companies. Can you “escalate”
— David Peñaloza (@davidsamuelps) November 13, 2020
So much so that I tweeted this in response:
This idea strikes a nerve with me. And also has the “shut up because you’re going to say more than is wise” alarms going off. I’m gonna think and understand my visceral agreement before I say more. https://t.co/PyB1srfKXo
— Eyvonne Sharp (@SharpNetwork) November 13, 2020
Whenever I respond with intense emotion, I try to unpack my thoughts before diving into a rant. I’ve had some time to think.
Anyone who’s been a key technical contributor during a large scale enterprise outage knows the routine. An outage occurs. On-call teams across the organization jump on a call. Each group opens tickets with every vendor in the environment. Leadership demands 15-minute updates. Engineers troubleshoot and gather data in real-time, which is heard, misunderstood, miscommunicated, and relayed out of context. If the outage is severe enough, vendor account teams are engaged. Someone demands escalation.
The details may vary depending on the scale of the outage, the IT team’s health, and the level of vendor investment within the organization. In my experience, this scenario plays out with shocking regularity.
First, I want to acknowledge that many times escalation is necessary. If a team is stuck and isn’t making progress, they need to get unstuck. Sometimes, engineers are too close to the problem and cannot see it clearly. Other times, a support engineer or team lead does not have the skills to deal with the issue at hand. If a team repeatedly needs fifteen more minutes with no substantive change or new information, intervention may be necessary. These issues may require a nudge, push, or demand from someone in authority.
More commonly, the demand for escalation arises from a fundamentally flawed assumption — the belief that those with executive power and positional authority can fix a problem merely by applying increasing pressure. Many have operated under this assumption for so long, it has become a culturally ingrained self-fulfilling prophecy.
Several incorrect beliefs feed this false assumption.
False Assumption 1: The people working on the issue are incompetent.
In many cases, the immediate demand for escalation comes from believing that the individuals working on the issue are incompetent, over their head, or uninformed. In many instances, IT leaders trust vendor representatives more than they will trust their people — the people who build, maintain, and support systems day-to-day. This should not be. The primary source for documentation, in-depth knowledge, and operational guidance must come from the team responsible for the system’s ultimate operation. Vendors can and should help. They have deep expertise in their particular product but often lack foundational organizational details, knowledge of the interaction surfaces between solutions, and the business drivers for the systems in question.
This lack of trust is toxic to an IT organization. It exposes the environment to risk, devalues the highest contributors, and leaves you vulnerable to unscrupulous vendors. If a vendor representative is unhelpful, by all means, escalate. However, escalation within minutes of the report of an outage signals larger organizational problems.
False Assumption 2: More pressure will fix the issue faster.
No one wants to sit on a 24-hour escalation call. No. One. Yelling, threats of penalty, and blame-shifting do nothing to resolve an issue. In my experience, these behaviors delay resolution, destroy culture, and demoralize the very people required to solve the problem. Escalation often brings more non-productive eyes to the issue, requires more communication, provides more avenues for misinformation, and piles pressure onto the small troupe of people who can resolve the issue. Adding stress, requiring excessive updates, and threatening people with their jobs will not improve their clarity of thought. Also, pressure may rush changes before full consideration and create additional problems.
I once participated in a 200 member outage call in which the most senior leader demanded all troubleshooting happen on that call. Eventually, a brave group of engineers created a secondary call where they talked openly about the data they were gathering and resolved the issue while staying engaged on the “leadership” call. This level of dysfunction increases cognitive load to an unmanageable level and dramatically increases MTTR.
False Assumption 3: When systems come back, everything’s fine.
After a severe outage, teams breathe a sigh of relief, congratulate one another on their hard work, and continue with their daily lives unchanged. Many leaders require an RCA to explain the outage. In the absence of a healthy culture, this process is worthless. Without two-way-trust, the people writing the RCA will withhold information, include vague explanations, and lay blame in the most convenient place — often a recently departed employee or a vendor failure. Based on an incomplete RCA, management will demand changes to ensure this never happens again. Ill-informed solutions will be implemented that increase technical debt, may not mitigate the risk, and may add risk in another area of the IT stack.
When an organization insists on escalation at the first sign of a problem, they’re signaling mistrust, an unhealthy culture, and organizational dysfunction. Pricey support contracts ensure vendors will cope with these challenges. IT teams inside these organizations suffer most as they’re disincentivized, demoralized, and discouraged. Over time, the attrition of great talent reinforces wrongheadedness and furthers the negative spiral.
If your organization has developed a knee-jerk habit of escalation at the first sign of a problem, consider a healthy dose of self-reflection. What harmful attitudes and behaviors drive your escalation habit? What can you change today, before your next outage, to make life better for everyone?