Operations | Monitoring | ITSM | DevOps | Cloud

Incident.io

On-call by default

Like many SaaS businesses, we have an on-call rota to enable us to provide 24x7 cover if there are problems with incident.io. We have a 'pager' which will alert the relevant person if something unexpected happens in our app, so that they can investigate and fix it if needed. Note: This was adapted from an internal document we wrote about how we think about on-call at incident.io.

Breaking down complex projects into smaller, shippable increments

Building a complex new product can be scary. What if no-one gets value from it? What if it doesn't work? What if it's hard to change? One way to mitigate these risks is to break down the product into smaller shippable increments, allowing you to capture feedback early and confirming the most important assumptions before fully committing to a solution.

Workflows: your process, automated

After many weeks of work, we're delighted to announce the latest feature of the incident.io platform: Workflows. Configure your processes once, and we'll make sure you follow them, every time ✨ A little while ago, I was asked the question: “what makes a good incident response?”. Whilst there’s infinite nuance in the answer, mine was pretty straightforward. The best incidents are founded on principles of communication, coordination, and clear roles and responsibilities.

Building safe-by-default tools in our Go web application

At incident.io, we're acutely aware that we handle incredibly sensitive data on behalf of our customers. Moving fast and breaking things is all well and good, but keeping our customer data safe isn't something we can compromise on. We run incident.io as a multi-tenant application, which means we have a single database (and a single application).

Deploying to production in <5m with our hosted container builder

Fast build times are great, which is why we aim for less than 5m between merging a PR and getting it into production. Not only is waiting on builds a waste of developer time — and an annoying concentration breaker — the speed at which you can deploy new changes has an impact on your shipping velocity. Put simply, you can ship faster and with more confidence when deploying a follow-up fix is a simple, quick change.

5 ways incidents made me a better engineer

Incidents are a great opportunity to gather both context and skill. They take people out of their day-to-day roles, and force ephemeral teams to solve unexpected and challenging problems. In my career, I've found incidents can be a great accelerator - for both myself and others around me. It was after leading my first incident at GoCardless that I started to feel really comfortable in the codebase and the team.

Logs and tracing: not just for production, local development too

We're a small team of engineers right now, but each engineer has experience working at companies who invested heavily in observability. While we can't afford months of time dedicated to our tooling, we want to come as close as possible to what we know is good, while running as little as we can- ideally buying, not building. Even with these constraints, we've been surprised at just how good we've managed to get our setup.

Now you see me, now you don't: feature-flagging with LaunchDarkly at incident.io

At incident.io, we ship fast. We're talking multiple times a day, every day (yes, including Fridays). Once I merge a pull request (PR), my changes rocket their way into production without me lifting a finger. 💅 It's when we tackle larger projects that this becomes a bit more complicated. We recently launched Announcement Rules, which let you configure which channels incident announcements are posted in depending on criteria you define.

Five steps to better customer communication

When you’re deep into an incident and there’s alerts firing, decisions to be made, and people to escalate to, it’s easy for outward communication with your customers to fall off the priority list. In many regards this makes sense; it seems natural to put all of your focus and energy into minimising the impact and getting things back on track as soon as possible.

Postmortem Pitfalls

Last week, we spent some time talking to Gergely Orosz about our thoughts on what happens when an incident is over, and you're looking back on how things went. If you haven't read it already, grab a coffee, get comfortable, and read Gergely's full post Postmortem Best Practices here. But before you do that, here's some bonus material on some of our points.