Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

Replacing AT&T Email-to-Text with OnPage's Critical Alerting

When AT&T officially shut down its email-to-text and text-to-email service on June 17, 2025, a quiet but essential part of many organizations’ communication workflows disappeared overnight. Messages that used to be sent to addresses like simply stopped delivering. For teams who relied on those alerts to reach the on-call clinician, engineer, technician, or service lead — this created an unexpected and urgent gap. This wasn’t just a convenience feature going away.

How Can I Use Categories in SIGNL4 to Quickly Identify Alert Types?

When teams manage a high volume of alerts, it’s easy for things to start blending together. A system outage, a temperature warning, a network slowdown – without a way to quickly identify what’s what, it takes longer to triage and prioritize. Especially on mobile, scrolling through a list of similar-looking alerts can slow your response and add confusion during incidents.

BigPanda Acquires Velocity: Accelerating the Future of Agentic IT Operations

Today marks an exciting milestone for BigPanda and for the future of IT Operations. We’re thrilled to announce that BigPanda has acquired Velocity, an AI SRE company whose technology and team share our passion for transforming how enterprises keep the digital world running. Velocity brings deep expertise in Site Reliability Engineering (SRE) and major incident response, developed alongside some of the world’s most sophisticated technology organizations.

Why Agentic AI Adoption Is Accelerating in Europe and What Comes Next

Across Europe, the cautious optimism business leaders held towards AI agents has evolved into more widespread enthusiasm. What was once a curiosity is now core to how many European organizations operate, respond, and innovate. According to PagerDuty’s latest agentic AI survey, three-quarters or more of organizations in France, Germany, and the UK are deploying multiple AI agents. This growing confidence reflects a broader trend.

How to Choose an AI SRE Solution

The AI SRE landscape has exploded over the past year, with vendors racing to add artificial intelligence capabilities to their platforms. For engineering leaders evaluating these solutions, the sheer number of options can feel overwhelming. Some vendors are building AI-native solutions from scratch, while others are retrofitting AI onto existing workflows. Cloud providers are embedding agents into their ecosystems, and observability platforms are adding intelligence layers to their telemetry data.

Detecting an AWS Outage and DR Lessons

A few weeks ago, on 20th October 2025, AWS suffered a widespread outage in its US-EAST-1 region that affected a large number of customers globally. More than 1,000 apps and websites were impacted including major banks and popular games, streaming and social platforms such as WhatsApp, Snapchat, Fortnite and Pokémon Go.

Jira Service Management (JSM) Review for On-Call Management (2025)

OpsGenie is shutting down. And Atlassian recommends migrating to Jira Service Management (JSM). But if you’re not sure JSM is the right fit for your team’s on-call management needs, this review will help you decide. I signed up for JSM and put it through real-world testing. I created on-call schedules, rotations, and overrides. Then, I reviewed JSM’s on-call management across 4 key criteria. For each criterion, I shared what I liked and what I didn’t.

How Rootly works with Slack | An end-to-end demo.

Rootly is the AI-native on-call and incident management platform that helps you resolve incidents faster, improve system resilience, and streamline on-call operations. It’s your always-on SRE copilot that automates root cause analysis and identifies patterns that drive continuous improvement—trusted by thousands of companies like LinkedIn, NVIDIA, Replit, Elastic, Canva, Clay, Tripadvisor, and Grammarly.