Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

How to Fine Tune Your IncidentHub Alerts

IncidentHub can send outage alerts to many external systems. You can choose from Slack, Webhook, Email, Discord, PagerDuty, and more. Alerts are effective only when they are relevant and actionable. In this article, we will explore how to fine-tune your IncidentHub alerts to receive only the relevant ones for your third-party services.

OpsGenie vs. PagerDuty: Which Incident Management Tool Should You Choose in 2025

If you’re comparing OpsGenie vs. PagerDuty, there’s something important you need to know right away: OpsGenie is shutting down. OpsGenie has been a trusted ally for incident teams for over a decade. In our Ode to OpsGenie, we celebrated its legacy—from simplifying on-call rotations to reducing alert noise effectively. Atlassian announced that OpsGenie sales will stop on June 4, 2025, with a complete shutdown by April 5, 2027.

Incident management vs. problem management: A practical guide for SREs

In Site Reliability Engineering (SRE), distinguishing incident management from problem management is crucial. While both processes aim to maintain system reliability, they fulfill distinct roles: incident management focuses on quickly resolving immediate disruptions, whereas problem management identifies and rectifies root causes to prevent recurrence. Effectively combining these processes helps minimize downtime, enhances system resilience, and fosters a proactive operational approach.

Infrastructure Monitoring: A Comprehensive Guide to Integrating Effective Alerting

Imagine you’re the IT guardian of a busy company. Every day, you rely on infrastructure monitoring tools to keep an eye on your servers, networks, and applications. These tools are your early warning system – they spot glitches before they become full-blown problems. But what happens when an alert is missed or delayed? That’s where effective alerting comes in.

Mastering incident routing: a critical component in incident management

Imagine this: a high-priority alert is triggered, but it’s routed to the wrong team, or delayed by manual triage. By the time the right person is notified, the issue has escalated, and users are starting to notice. Technical failures don’t always cause these kinds of incidents. More often, they stem from something simpler: poor alert routing.

Do You Still Need an ITSM Platform in 2025?

The world of IT has undergone a seismic shift over the past two decades. What was once a landscape dominated by physical servers, on-premise data centers, and monolithic applications has transformed into a dynamic ecosystem of cloud-native architectures, microservices, and distributed systems. Yet, many enterprises still rely on traditional IT Service Management (ITSM) tools that were designed for a bygone era.

Navigating the role of an incident commander

When critical services fail, every second counts. Teams scramble, information floods in, and clarity quickly dissolves into confusion. In these high-pressure moments, a single point of leadership, the incident commander, can mean the difference between a quick recovery and prolonged disruption.

How Should You Compensate Your Employees for Being On Call?

In today’s fast-paced, always-connected world, many businesses require employees to be on call to ensure smooth operations and quick responses to critical issues. However, compensating employees for being on call can be a tricky subject. It’s important to strike a balance between fairness, accountability, and incentivizing the right behaviors. Let’s explore four common methods of compensating employees for being on call, along with their advantages and disadvantages.

Best Practices and Demo: Grafana Cloud's End-to-End IRM Solution | Grafana Labs

Grafana Cloud’s Incident Response and Management solution provides workflows that span creating alerts and SLOs, managing on-call and incident response, and learning from postmortems – all within the context of your observability stack. In this session, you’ll learn best practices for making the most of this IRM solution, including leveraging the historical incident data that’s accessible within Grafana Cloud.