Latest Posts

Customers over control: how we measure On-call reliability

May 28, 2026 By Article In Incident.io

Our On-call product has a lot of great features: configuring escalation paths, viewing rotas and schedules, requesting cover, etc. However, when framing its reliability, we reduce it down to two critical pieces of functionality: It’s not that we’re happy if only these parts are working, but they are the most important parts. In this post, I'll go into more detail on how we think about their reliability.

Read Post

Incident.io

Read more about Customers over control: how we measure On-call reliability

Engineering teams in 2027

May 19, 2026 By Article In Incident.io

There's a conversation I keep having with our design partners at incident.io. It starts when I ask "what are you doing with AI internally?" and lands in a similar place every time. The shape of how their engineering teams work is changing fast. Not in vague "AI is transforming everything" ways, but in concrete, repeatable patterns. Different companies are building the same things. The frontier teams are six to twelve months ahead of the average, and they're describing the same future.

Read Post

Incident.io

Read more about Engineering teams in 2027

Humans aren't fast enough for 4 9's

May 11, 2026 By Article In Incident.io

When thinking about Service Level Objectives (SLOs) and contractual Service Level Agreements (SLAs) for availability, I always like to put the percentages into concrete numbers. It’s easy to lose track of what’s meant when saying “99.95%” availability, and even more is lost when thinking how much harder it is to achieve 99.99% compared to 99.95%. On a monthly basis, and in concrete terms, 99.95% availability means you get 21 minutes and 55 seconds of downtime.

Read Post

Incident.io

Read more about Humans aren't fast enough for 4 9's

Who's on call? How Claude helped us calculate this 2,500x faster

Apr 28, 2026 By Article In Incident.io

Schedules are a core part of any on-call system. In ours, they define who to page and when. But people use them in lots of other ways too: checking their next shift, asking for cover while at the gym, keeping a Slack user group up to date, or updating a Linear triage responsibility. For many of our customers, they’re one of the main ways they interact with our product, and as they’re such a foundational part of On-call, it’s very important they work well.

Read Post

Incident.io

Read more about Who's on call? How Claude helped us calculate this 2,500x faster

What does using AI for post-mortems actually mean?

Apr 23, 2026 By Article In Incident.io

Everyone is using AI to help with post-mortems now. The pitch is obvious: post-mortems are time-consuming, the blank page is brutal, and AI is very good at producing structured, confident-sounding documents quickly. We're not here to push back on that. We've built AI into our own post-mortem experience, pulling your Slack thread, timeline, PRs, and custom fields together and giving your team a meaningful starting point in seconds. We think that's genuinely valuable, and the teams using it agree.

Read Post

Incident.io

Read more about What does using AI for post-mortems actually mean?

How it feels to run an incident with AI SRE

Apr 23, 2026 By Article In Incident.io

We've been building the broader incident.io platform for several years now, and one thing we've learned is that UX matters more here than almost anywhere else. When an incident fires, there's no room for poorly designed interfaces or fumbling through features you haven't touched in a while. The product has to be ergonomic: easy to pick up, easy to navigate, with the right things at your fingertips at exactly the right moment. We've put a lot of effort into this over the last 5 years.

Read Post

Incident.io

Read more about How it feels to run an incident with AI SRE

Why post-mortem action items die

Apr 16, 2026 By Article In Incident.io

You can run the best debrief of your life. Honest timeline, blameless tone, real insights. People leave the room nodding. And then nothing happens. This is the last mile problem of post-mortems - and it's an easy trap to fall into. When you've just been through a stressful incident, getting it back up is the priority. Once it's over, the post-mortem itself can feel like the finish line. You've documented what happened, been honest about it, identified what went wrong. It feels like the work is done.

Read Post

Incident.io

Read more about Why post-mortem action items die

How to migrate your paging tool without breaking your team

Mar 20, 2026 By Article In Incident.io

Most engineering teams don’t migrate their on-call and paging systems unless absolutely necessary. No matter how painful their current solution, it's one of those changes that people put off for as long as possible because the cost is real. The disruption, the retraining, the risk of missing a critical page during the transition. It's not something you do on a whim.

Read Post

Incident.io

Read more about How to migrate your paging tool without breaking your team

How Catalog changes the game for long-term maintenance

Mar 18, 2026 By Article In Incident.io

Every incident platform needs to know who owns what. Which team owns which service. Which backlog to send follow-ups to. Which escalation path to page when something breaks. The problem is that most platforms encode this ownership logic separately in every configuration: alert routing, workflows, ITSM ticket syncing, and more. Each one maintains its own copy of the same information, in its own format.

Read Post

Incident.io

Read more about How Catalog changes the game for long-term maintenance

The post-mortem problem

Mar 4, 2026 By Article In Incident.io

Post-mortems are one of the most consistently underperforming rituals in software engineering. Most teams do them. Most teams know theirs aren't working. And most teams reach for the same diagnosis: the templates are too long, nobody has time, and nobody reads them anyway. These aren't wrong observations. But they're symptoms, not causes. The actual problem is that somewhere along the way, the post-mortem stopped being a piece of communication and became a compliance artifact.

Read Post

Incident.io

Read more about The post-mortem problem

Operations | Monitoring | ITSM | DevOps | Cloud

Customers over control: how we measure On-call reliability

Engineering teams in 2027

Humans aren't fast enough for 4 9's

Who's on call? How Claude helped us calculate this 2,500x faster

What does using AI for post-mortems actually mean?

How it feels to run an incident with AI SRE

Why post-mortem action items die

How to migrate your paging tool without breaking your team

How Catalog changes the game for long-term maintenance

The post-mortem problem

Monthly Archive

Follow Us