The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.
Most engineering teams are no strangers to key performance indicators (KPIs), those metrics tracking progress toward critical goals and targets. Ideally, tech leaders design KPIs to focus teams on what matters and prove their contribution to the company’s overall performance. Of course, KPI data should also uncover critical information that guides informed decision-making. For engineering teams tasked with managing the customer experience, KPIs often track availability.
Mean time to resolution (MTTR) is a metric that transcends industry and technology. It’s a measure of how quickly, on average, support teams identify, act, and resolve IT issues and incidents. Because MTTR directly relates to service quality, maintaining a low MTTR is a critical goal for DevOps and SRE teams. These teams have a vested interest in resolving issues quickly because escalating incidents to higher levels of the support team increases response and resolution times.
I’ve had the honor and privilege of authoring The SRE Report for the last three years. For the 2023 version, this included working with some amazing individuals like Anna Jones, Kurt Andersen, and Steve McGhee. Download The SRE Report 2023 here (no registration required).
In 2021, the Biden administration issued an executive order outlining that the government and private sector need to work together to combat cyberthreats and improve the nation’s collective cybersecurity stance. As cyberattacks become more common and more costly, the United States — like other nation-states — needs to do everything it can to prevent attacks and rapidly respond to them when they occur, which requires modernizing its approach to incident response.
It doesn’t matter if you’re a startup or in the Fortune 500: cost optimization, tool consolidation, and efficiency efforts are top of mind. Removing toil and automating more often during the incident response process doesn’t only help teams resolve faster, it also helps them become more efficient. In a resource-strapped world, protecting developer and responder time and focus is critical to reducing total cost of operations and optimizing customer experience.
Hybrid and remote work is now the status quo. Companies campaigning for workers to return to the office are facing resistance, with some employers finding that they’re losing employees to jobs that give prospective hires the flexibility they want. Flexible work models have become a competitive advantage in a strained labor market. According to the latest Future of Work report from Accenture, 63% of high-growth companies have adopted a “productivity anywhere” workforce model.
In this post, we'll learn all about the incident metric mean time to detect (MTTD). We'll see how to measure it and look at its relationship with other incident metrics like MTTR (mean time to recover). Both metrics give useful insights into your incident recovery ability.
A global leader in SaaS-based and on-premise software solutions that power innovative digital experiences was looking to replace the internal tool that was being used for resolving outages, service degradation, data center connection loss, and other incidents.