The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.
Last week, we spent some time talking to Gergely Orosz about our thoughts on what happens when an incident is over, and you're looking back on how things went. If you haven't read it already, grab a coffee, get comfortable, and read Gergely's full post Postmortem Best Practices here. But before you do that, here's some bonus material on some of our points.
People are more than happy to talk about their successes, but if you ask them about their failures, they can be much more hesitant to share. Failure is a subject that, interestingly enough, is entangled with the emotion of shame. Yet it’s integral to achieving anything novel, and the learnings that come from failure are unparalleled. So, let’s find ways to get more comfortable with failing, and figure out why people fear it.
A Service Level Agreement (SLA) is a document that details the expected level of service guaranteed by a vendor or product. This document generally sets out metrics such as uptime expectations and any payoffs if these levels are not met. For example, if a provider advertises an uptime of 99.9% and exceeds 43 minutes and 50 seconds of service downtime, technically the SLA has been breached and the customer may be entitled to some type of remuneration depending on the agreement.
It’s 2 AM and you’re paged when you’re still awake – how well can you find what you need to fix the latest mistake? When the incident begins it might only be impacting a single service, but as time progresses, your brain boots, the coffee is poured, the docs are read, and all the while as the incident is escalating to other services and teams that you might not see the alerts for if they’re not in your scope of ownership.
Companies that underwent accelerated digital transformations during the past 18 months are looking to understand how they can improve their operational maturity to handle the increase in complexity. This is paramount to an organizations’ future success.
Technology companies are at the forefront of innovation, changing the way consumers and the general public interact with their everyday lives. As the late Stan Lee so wisely stated, “with great power comes great responsibility,” and this heightened pressure often leaves little room for error when an issue arises—which happens more often than you’d think.