Latest Posts

Learning Flows: Bringing consistency to your post incident processes

Oct 16, 2023 By Luis Gonzalez In Incident.io

To get the most out of your incident response processes, consistency is crucial. The more predictable you can be whenever issues crop up, whether a small bug or a major outage, the quicker and more confidently you can respond. In practice, incident response is equal parts knowing how to actually resolve the issue and having the confidence that the processes in place will help get you through without added stress.

Read Post

Incident.io

Read more about Learning Flows: Bringing consistency to your post incident processes

A guide to post-mortem meetings and how we run them at incident.io

Oct 11, 2023 By Luis Gonzalez In Incident.io

You've just made it through a particularly tough incident. It was a short outage affecting a subset of customers, so not exactly the end of the world, but bad enough that it involved multiple people across a number of teams to resolve. Either way, the incident was well managed, and the dust has settled. Now what? Most guidance would say that putting together a post-mortem document is a good idea, given the severity of the incident. You've also done this, so what's next?

Read Post

Incident.io

Read more about A guide to post-mortem meetings and how we run them at incident.io

Whose fault was it anyway? On blameless post-mortems

Oct 4, 2023 By incident.io In Incident.io

No one wants to be on the receiving end of the blame game—especially in the wake of a major incident. Sure, you know you were the one who made the final change that caused the incident. And hopefully, it was a small one that didn’t cause any SEV-1s. Still, the weight of knowing you caused something bad should be enough, right? Unfortunately, sometimes fingers get pointed, your name gets called, and suddenly, everyone knows that you’re the person who created more work for everyone.

Read Post

Incident.io

Read more about Whose fault was it anyway? On blameless post-mortems

Better learning from incidents: A guide to incident post-mortem documents

Sep 27, 2023 By Luis Gonzalez In Incident.io

If you’re just starting out in the world of incident response, then you’ve probably come across the phrase “post-mortem” at least once or twice. And if you’re a seasoned incident responder, the phrase probably invokes mixed feelings. Just to clarify, here, we’re talking about post-mortem documents, not meetings. It’s a distinction we have to make since lots of teams use the phrase to refer to the meeting they have after an incident.

Read Post

Incident.io

Read more about Better learning from incidents: A guide to incident post-mortem documents

Clouds, caches and connection conundrums

Sep 26, 2023 By Ben Wheatley In Incident.io

We recently moved our infrastructure fully into Google Cloud. Most things went very smoothly, but there was one issue we came across last week that just wouldn’t stop cropping up. What follows is a tale of rabbit holes, red herrings, table flips and (eventually) a very satisfying smoking gun. Grab a cuppa, and strap in. Our journey starts, fittingly, with an incident getting declared... 💥🚨

Read Post

Incident.io

Read more about Clouds, caches and connection conundrums

How we've made Status Pages better over the last three months

Sep 22, 2023 By Asiya Gorelik In Incident.io

A few months ago we announced Status Pages – the most delightful way to keep customers up-to-date about ongoing incidents. We built them because we realized that there was a disconnect between what customers needed to know about incidents, and how easily accessible this information was. For example: As we built them, we focused on designing a solution that powered crystal-clear communication, without the overhead — all beautifully integrated into incident.io.

Read Post

Incident.io

Read more about How we've made Status Pages better over the last three months

The balancing act of reliability and availability

Sep 19, 2023 By incident.io In Incident.io

As consumers, we expect the products and software we buy to work 100% of the time. Unfortunately, that’s impossible. Even the most reliable products and services experience some disruption in service. Crashes, bugs, timeouts. There are a ton of contributing factors, so it's impossible to distill disruptions down to a single cause. That said, technology is becoming more and more sophisticated, and so is the infrastructure that supports it.

Read Post

Incident.io

Read more about The balancing act of reliability and availability

The connection between incident management and problem management

Sep 15, 2023 By Luis Gonzalez In Incident.io

Sometimes, two concepts overlap so much that it’s hard to view them in isolation. Today, incident management and problem management fit this description to a tee. This wasn’t always the case. For a long time, these two ITIL concepts were seen as distinct—with specialized roles overseeing each. Incident management existed in one corner and problem management in the other. Then came the DevOps movement and the lines suddenly became blurred. So where do they stand today?

Read Post

Incident.io

Read more about The connection between incident management and problem management

Practical guidance for getting started as a site reliability engineer

Sep 8, 2023 By Ben Wheatley In Incident.io

At the beginning of May, I joined incident.io as the first site reliability engineer (SRE), a very exciting but slightly daunting move. With only some high-level knowledge of what the company and its systems looked like prior to this point, it’s fair to say that I didn’t have much certainty in what exactly I’d be working on or how I’d deliver it.

Read Post

Incident.io

Read more about Practical guidance for getting started as a site reliability engineer

July 2023 newsletter: Changelog-The Deluxe Edition

Aug 10, 2023 By incident.io In Incident.io

🎵 Gotta give the people, give the people what they want! 🎵 You've been asking. And we've been listening. Over the past few weeks, we've been shipping frequently requested features to help you bring your incident management to the next level. It may be the dog days of summer, but let's ignore that, yeah? Just take a look at this recent changelog. Note that this is the biggest one we've ever published.

Read Post

Incident.io

Read more about July 2023 newsletter: Changelog-The Deluxe Edition

Operations | Monitoring | ITSM | DevOps | Cloud

Latest Posts

Learning Flows: Bringing consistency to your post incident processes

A guide to post-mortem meetings and how we run them at incident.io

Whose fault was it anyway? On blameless post-mortems

Better learning from incidents: A guide to incident post-mortem documents

Clouds, caches and connection conundrums

How we've made Status Pages better over the last three months

The balancing act of reliability and availability

The connection between incident management and problem management

Practical guidance for getting started as a site reliability engineer

July 2023 newsletter: Changelog-The Deluxe Edition

Monthly Archive

Follow Us