Incidents are lessons, not failures

By Eduardo Crespo, VP of EMEA

Jul 26, 2024

3 minutes

PagerDuty

Delivering digital operations excellence - DevOps, incident management, and keeping organisations running - is a constant challenge. As customer digital expectations rise, so do the complexities of the tech stack and cloud services integrations. But to insist on 100% uptime and rush through incident management without taking learnings into account creates a poor culture that can damage the ability of the DevOps team. This is not how a business creates resilient infrastructure and high-performing teams.

A better strategy is to encourage a culture of learning and improvement without blame. This requires ensuring that learnings are measured just as much as metrics like MTTR, MTTA, MTBF, etc. Organisations best develop this capability via incident and infrastructure analysis and a mature acceptance of operational incidents as an inevitable part of digital operations. Thus, incidents too are a part of the process of improvement, not something that can be totally banished.

What's hot and what's not

Our research of the impact of downtime and disruption suggests that incidents are costly to ignore. In fact, the average cost of an incident is reported as $800k. Yet incidents are fact of life. So, planning for them, and the investigation and analysis after, is a step on the road to mature digital operations that get better and better over time.

But in the real world, what has often happened is that some "unlucky engineer" fields the task to run an investigation after-the-fact. This way around, without the right prep work, they struggle to find the data and feedback they need from the tools and parties involved. And, unless the business has a culture of learning, it may be that no one else in the business cares for anything other than the top line answer to: 'Will that ever happen again?' That's not a hot strategy for deeply embedding business improvements.

Aim for continuous improvement without finger-pointing. Understanding that improvement comes from facing and overcoming incidents, is a hotter business strategy. It encourages long-term planning, documentation, and knowledge sharing. All of which supports stronger digital operations and the tech team's employee experience to keep the right metrics in the green.

Improving the cultural and operational response

Leading organisations have developed a robust approach to operational resilience. First, in the face of operational threats, these organisations do a better job at identifying patterns across their tech stack, tools and teams. This allows for both continuous improvements, and the development of more resilient systems and teams. This inevitably means a focus on automation.

With the right tooling, technical staff can uncover and extract the information they require to fix problems, strengthen systems from the same fate reoccurring, and encoding that knowledge within the business. The days of binders full of pages - with added yellow sticky notes offering secret lore to the initiated - are long gone. SaaS solutions that amplify engineering knowhow are in. The engineer in the hot seat is able to quickly uncover the patterns in and across incidents. That's how talent can find root causes, unexpected relations, and better solutions, faster.

But a step beyond this, for larger organisations, is also applying that pattern spotting to the human element, too. If the same people are brought in to solve incidents, particularly where lessons are not learnt or investments not made, it can indicate who may be on the way to burnout or leaving. Additionally, those may well be the people with higher expertise and knowledge of the unique nature of the organisation, the people that must be retained to support long term organisational resilience.

The best solutions will help by surfacing things that are hard to even pose a question for. For example, patterns whereby incidents might increase around code freeze dates. Or if costs from incidents rise beyond the cost of an upgrade to manage them. A little automation help can boost the power of the professionals managing digital operations, giving them access to data to be more strategic. This power, applied to cross-incident analysis, offers data to take post-incident analysis to the next level. Beyond this, it empowers teams to make the required changes to reduce the recurrence or cost of incidents for the long term.

Additionally, a cultural shift is also required. Any stigma about incidents must be replaced with acceptance: It's better to get ahead of them early and deal with them in a transparent manner, then extract the learnings and improvements they offer. That keeps teams focussed on action, not politics.

The learning zone: Where failures are lessons

If organisations get this combination of cultural and technical factors correct, they will inevitably change the behaviours required to increase the positives of digital and customer service delivery, and the workplace and cultural experience. However, it can be tough to change people's mind-sets around 'failure'. Getting there may mean different paths for differing organisations with their unique values and people dynamics.

As a foundation, your people must feel safe sharing truths. Leaders must be able to provide feedback and guidance that is helpful and supportive, not carrying the weight of 'the boss' gaze'. Research from the Harvard Business School calls the 'learning zone' the place between high performance expectations and psychological safety. It's the place to operate in, if every incident is to become a lesson to be learned.

Incidents are lessons, not failures

Incidents are lessons, not failures

Monthly Archive

Follow Us