Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

Incident Alerting: What We Believe It Should Do

Incident alerting is a critical part of modern operations, yet it’s often misunderstood or reduced to “sending notifications.” In reality, it is about ensuring that the right people are informed at the right time – and that incidents move from detection to action without confusion or delay. This page explains why fast, reliable alerting matters, where it fits between monitoring and incident response, and what best practices look like.

5 Offbeat on-call rotations that work

Most teams choose standard on-call patterns like weekly or daily rotations. But sometimes a less conventional rotation can solve a specific problem or just fit better with how your team works. This guide walks you through five offbeat on-call rotations. For each, we look at why it might work for you and the challenges involved. This helps you see the full picture before you decide to try them out. Let’s dive in!

Follow-the-sun and other on-call models

Most teams run on-call using rotation-based schedules where responsibility shifts every few days or weeks. But some situations call for different models that change who responds based on time zones, expertise, or the type of incident that triggers. This guide walks you through six on-call models that work outside the standard rotation patterns.

Turning Data Into Decisions with the xMatters Incident AI Agent

When an incident hits, the gap between awareness and action can make all the difference. Responders know the pain: endless tool-switching, chasing updates, and fragmented data. It’s not a lack of capability that slows response; it’s the lack of context and connection. That’s why we built the xMatters Incident AI Agent, a purpose-built, conversational assistant that brings intelligence and automation directly into the heart of incident response.

AWS CloudFront Outage (Feb 2026): Timeline, Cascade, and Lessons

At approximately 9:15 PM UTC on February 10, 2026, Amazon CloudFront began returning NXDOMAIN responses for DNS queries against specific distributions. In practical terms: DNS was telling users that services behind those distributions simply didn't exist. The root cause was a DNS resolution failure within CloudFront's infrastructure that quickly spread to eight interconnected AWS services.

ilert now supports a native WhaTap integration

ilert now supports a native WhaTap integration, connecting AI-native observability with AI-first incident management in a seamless workflow. This integration allows DevOps, SRE, and IT teams to move instantly from detection to resolution – cutting through alert noise, improving coordination, and dramatically reducing MTTR in even the most complex IT environments.

How to Create and Manage Incidents in Uptime.com

Learn how to create and manage incidents on your Uptime.com Status Page to keep your subscribers informed about service disruptions and maintenance events in real-time. In this tutorial, we'll cover understanding incident statuses (Investigating, Identified, Monitoring, Resolved, and more), three ways to create a new incident, configuring incident details and timelines, adding updates with Markdown formatting, managing and editing incidents, notifying Status Page subscribers, and using the REST API for incident management.

Reference architecture: The blueprint for safe and scalable autonomy in SRE and DevOps

Everyone wants autonomous incident response. Most teams are building it wrong. ‍ The ultimate goal of autonomy in SRE and DevOps is the capacity of a system to not only detect incidents but to resolve them independently through intelligent self-regulation. However, true autonomy isn't born from automating random, isolated tasks. It requires a stable foundation: a Reference Architecture.