Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

Feature Spotlight - Task Lists

When an incident occurs, teams often perform a known set of steps in a specific order to help identify and triage the incident. For Base and Advanced plan users, the Incidents menu includes a Task Lists section where teams can build out priority lists for different incident types or use cases. For example, a list of failover tasks, or the tasks required to perform a deployment rollback. With task lists, Incident Commanders can be sure that resolvers know exactly what needs to be done to quickly resolve incidents.

Scientific Incident Management with Dan Slimmon

Dan Slimmon is an incident management veteran who's worked at Etsy, HashiCorp, and now leads consulting and training on pragmatic, non-bureaucratic incident response. In this episode, Dan shares his philosophy on "scientific incident response," the importance of hypothesis-driven troubleshooting, and why incidents should be seen as normal in complex systems.

Opsgenie is shutting down. Here's what that means, and how incident.io can help

Atlassian recently announced they’ll be shutting down Opsgenie, their popular on-call alerting tool. After June 4, 2025, no new Opsgenie accounts will be created, and by April 5, 2027, the service will shut down completely. Users don’t seem happy about it. If you’re currently using Opsgenie, this news is significant. A key part of your incident response process is disappearing, and Atlassian suggests moving to their other products, like Jira Service Management or Compass.

A seven-step framework for running incident debriefs

Ever wrapped up an incident, thought 'Phew, glad that’s over,' only to feel your stomach drop when you see the dreaded "Incident Debrief" on your calendar? We've all been there. Incident debriefs don't need to feel like sitting through your least favorite school subject. They can (and should!) actually be engaging and useful. At incident.io, we've found a simple, repeatable, and blameless framework.

How we responded to a 2+ hour partial outage in Grafana Cloud

On Tuesday, Feb. 18, 2025, we experienced an outage that lasted approximately 150 minutes and impacted roughly 25% of our Grafana Cloud services. To our customers: we are very sorry and more than a little embarrassed that we stepped outside our own processes and advice to cause this. You rely on us to help monitor and troubleshoot your environments, and this type of incident obviously makes it harder for you to do that.

EMEA Rundeck by PagerDuty Meetup - March 2025

Join us for an informal 1-hour virtual event where the open-source Rundeck by PagerDuty community comes together to share automation stories and use cases. Whether you're new to Rundeck or looking to elevate your automation game, this meetup is packed with valuable takeaways for everyone! CERN Orchestrates with Rundeck.

ITSM vs ITIL: Differences and How They Align

Understanding ITSM and ITIL is essential to strengthen your IT service management. Although they are closely related and often used interchangeably, ITSM and ITIL have distinct purposes and methodologies. To gain efficiency and competitive advantage in IT management, understanding their differences while exploring how they complement each other is a must.

The Importance of Customer Experience for Business Success

In today’s customer-centric landscape, businesses must go beyond just ensuring high availability and fast response times. Customers now expect seamless, personalized digital experiences, with little to no disruptions to service, and failing to meet these expectations can drive them to competitors. Studies show that companies prioritizing customer experience (CX) achieve significantly higher revenue growth and retention rates.