Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

Weaving AI into the fabric of the company | incident.io

At incident.io, we’ve spent the past year shifting how we work to incorporate the AI into both how we build and what we build. The result? AI has become a fundamental pillar of our company. This is the story of how we built reliable AI for reliability itself — reshaping how teams manage and resolve incidents. From early experiments to a company-wide culture of building with AI, this is how we’re redefining incident response for the future.

Replacing AT&T Email-to-Text with OnPage's Critical Alerting

When AT&T officially shut down its email-to-text and text-to-email service on June 17, 2025, a quiet but essential part of many organizations’ communication workflows disappeared overnight. Messages that used to be sent to addresses like simply stopped delivering. For teams who relied on those alerts to reach the on-call clinician, engineer, technician, or service lead — this created an unexpected and urgent gap. This wasn’t just a convenience feature going away.

How Can I Use Categories in SIGNL4 to Quickly Identify Alert Types?

When teams manage a high volume of alerts, it’s easy for things to start blending together. A system outage, a temperature warning, a network slowdown – without a way to quickly identify what’s what, it takes longer to triage and prioritize. Especially on mobile, scrolling through a list of similar-looking alerts can slow your response and add confusion during incidents.

BigPanda Acquires Velocity: Accelerating the Future of Agentic IT Operations

Today marks an exciting milestone for BigPanda and for the future of IT Operations. We’re thrilled to announce that BigPanda has acquired Velocity, an AI SRE company whose technology and team share our passion for transforming how enterprises keep the digital world running. Velocity brings deep expertise in Site Reliability Engineering (SRE) and major incident response, developed alongside some of the world’s most sophisticated technology organizations.

Why Agentic AI Adoption Is Accelerating in Europe and What Comes Next

Across Europe, the cautious optimism business leaders held towards AI agents has evolved into more widespread enthusiasm. What was once a curiosity is now core to how many European organizations operate, respond, and innovate. According to PagerDuty’s latest agentic AI survey, three-quarters or more of organizations in France, Germany, and the UK are deploying multiple AI agents. This growing confidence reflects a broader trend.

How to Choose an AI SRE Solution

The AI SRE landscape has exploded over the past year, with vendors racing to add artificial intelligence capabilities to their platforms. For engineering leaders evaluating these solutions, the sheer number of options can feel overwhelming. Some vendors are building AI-native solutions from scratch, while others are retrofitting AI onto existing workflows. Cloud providers are embedding agents into their ecosystems, and observability platforms are adding intelligence layers to their telemetry data.

Detecting an AWS Outage and DR Lessons

A few weeks ago, on 20th October 2025, AWS suffered a widespread outage in its US-EAST-1 region that affected a large number of customers globally. More than 1,000 apps and websites were impacted including major banks and popular games, streaming and social platforms such as WhatsApp, Snapchat, Fortnite and Pokémon Go.