Operations | Monitoring | ITSM | DevOps | Cloud

Latest Posts

The Incident Maturity Model

I want to walk you through how incident management has evolved, drawing from real data and the experiences of some of the most sophisticated tech organizations out there. I'll also introduce you to a framework we’ve developed at incident.io: the Incident Maturity Model. This framework is the result of thousands of conversations with companies and provides a clear roadmap to help your organization improve its incident management practices—no matter where you're starting from.

The flight plan that brought UK airspace to its knees

On August 28th, 2023—right in the middle of a UK public holiday—an issue with the UK’s air traffic control systems caused chaos across the country. The culprit? An entirely valid flight plan that hit an edge case in the processing software, partly because it contained a pair of duplicate airport codes.

How we page ourselves if incident.io goes down

Picture this: your alerting system needs to tell you it's broken. Sounds like a paradox, right? Yet that’s exactly the situation we face as an incident management company. We believe strongly in using our own products - after all, if we don’t trust ourselves to be there when it matters most, why should the thousands of engineers who rely on us every day? However, this poses an obvious challenge.

Organizing ownership: How we assign errors in our monolith

At incident.io, we run on a monolith. This brings a whole load of benefits that we don’t want to give up any time soon. We don’t have to worry about the speed of internal network requests, complex deployments, or optimizing work that touches multiple services. This blog post isn’t about the relative benefits of monoliths though (but we’ve written more about that here if you are interested)! Ownership in monoliths is tricky.

How we handle sensitive data in BigQuery

As a provider of incident management software, we at incident.io manage sensitive data regarding our customers. This includes Personally Identifiable Information (PII) about their employees, such as emails, first names, and last names, as well as confidential details regarding customer incidents, such as names and summaries. Consequently, we approach the management of this data with a great deal of care.

Lessons from 4 years of weekly changelogs

Writing a meaningful update for customers every week has been held sacred at incident.io since we started the company. We've written over 200 of them in the past 4 years, and we recently celebrated going 2 years straight without missing a single a single week The numbers themselves are not the goal, but the consistency of this habit and what it represents for our customers and our team is very real, and special to me.

Observability as a superpower

With every job I have, I come across a new observability tool that I can’t live without. It’s also something that’s a superpower for us at incident.io: we often detect bugs faster than our customers can report them to us. A couple of jobs ago, that was Prometheus. In my previous job, it was the fact that we retained all of our logs for 30 days, and had them available to search using the Elastic stack (back then, the ELK stack: Elasticsearch, Logstash, and Kibana).

What is DORA and how will it affect me?

The Digital Finance Strategy is a European directive that aims to support and develop digital finance in Europe while maintaining financial stability and consumer protection. There are three main components to the package: In this blog post, we’ll attempt to summarize the 113-page DORA proposal, highlighting how it will apply to incident management at financial entities. Side note: we also wrote a blog post about the other DORA, also known as the DevOps Research and Assessments.