Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

Incident Report: Exercises, Cleanups, and Evacuations

Every year, Honeycomb runs disaster recovery scenarios in multiple environments, including in production. Although each of our instances runs in a single region, on at least three Availability Zones (AZs), we have multiple plans for partial regional failures, and particularly, zonal failures. One of these tests was run on December 5th, and after its successful completion came its cleanup steps.

Secure access at the speed of incident response

Picture this: it's 2am, your pager goes off, and you're staring at a production database that's on fire. You know exactly what's wrong. You know exactly how to fix it. But you can't touch anything because you're waiting on someone to approve your access request. Meanwhile, your customers are down, your SLAs are bleeding out, and you're refreshing Slack hoping someone in security is awake to click "approve." This is the incident response tax that too many teams pay.

Boosting Rust developer productivity with cursor - Our journey at ilert

AI-assisted coding has evolved from a novelty into an industry standard. At ilert, we started our adoption in mid-2023, quickly realizing that success depends heavily on proper context and workflows. This is particularly acute with Rust. While the language is central to our backend infrastructure, its strict compiler rules and distinct idiomatic approaches make it notoriously difficult for modern LLMs to master.
Sponsored Post

What to Say When Things Break: Outage Notification Templates for Ops Teams

This practical guide explains what to say when systems break, offering ready-to-use outage notification templates and best practices to help ops teams communicate clearly during incidents. Learn how effective outage communication can reduce confusion, manage user expectations, and maintain trust during service disruptions.

Response Team @ incident.io

When an incident hits, every second counts. The response team at incident.io builds the tools that make sure engineers aren't flying blind when it matters most. Sam, Tech Lead of the response team, takes us inside what it's really like to build the core of incident.io: the high technical bar, the art of prioritisation, and why there's no shortage of meaningful work to do. If you're an engineer who wants to work on something that genuinely makes other engineers' lives better, this one's for you.

Platform Engineering 101: What It Is, How It Differs from SRE and DevOps, & Why It Matters for Incident Response

Platform engineering has emerged as a response to the growing complexity of modern software delivery. As organizations adopt Kubernetes, microservices, CI/CD pipelines, and infrastructure as code, they are creating dedicated teams responsible for building and operating the internal platforms that power developer workflows.
Sponsored Post

Forwarding Microsoft SCOM Alerts to the Service Desk

Modern IT operations rely heavily on monitoring solutions like System Center Operations Manager (SCOM) to detect issues across servers, applications, and services. While SCOM excels at generating alerts, organizations often struggle to ensure these alerts translate into actionable incidents in their IT Service Management (ITSM) platforms. Without proper integration, critical alerts may be missed, tickets may be created manually, and incident resolution can be delayed.