%term

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

AWS And Azure Outages Will Recur - Here's How You Ensure Resilience

Nov 18, 2025 By Keith MacKenzie In CloudZero

The cloud has long promised limitless scalability and near-perfect uptime. But if you tried to access your Microsoft 365 dashboard or recline your smart bed last week, and got nothing but a spinning icon, you weren’t alone. In the span of 10 days, both Amazon Web Services (AWS) and Microsoft’s Azure Cloud suffered widespread outages that rippled across industries.

Read Post

CloudZero

Read more about AWS And Azure Outages Will Recur - Here's How You Ensure Resilience

Cloudflare outage: another wake-up call for resilience planning

Nov 18, 2025 By Mehdi Daoudi In Catchpoint

Another day, another massive Internet disruption, and this time it’s Cloudflare taking huge parts of the Internet offline. This incident is not an anomaly. It is part of a recurring pattern that has become standard in digital infrastructure. We have reached an inflection point in digital operations. Outages at major cloud and content delivery network (CDN) providers are now expected. The only real uncertainty is when it will happen next.

Read Post

Catchpoint

Read more about Cloudflare outage: another wake-up call for resilience planning

GPT-5.1 is here: does it spend less tokens? #ai #sre

Nov 18, 2025 By Rootly In Rootly

View Video

Rootly

Read more about GPT-5.1 is here: does it spend less tokens? #ai #sre

Reliability lessons from the 2025 Microsoft Azure Front Door outage

Nov 17, 2025 By Gavin Cahill In Gremlin

On October 29th, 2025, Azure Front Door suffered an outage that impacted Microsoft services on a global level, including Microsoft 365, Outlook, Xbox Live, Copilot, and more. It also affected Microsoft Azure, meaning companies like Costco, Starbucks, and Alaska Airlines ran into issues for both customer-facing and internal systems. The root of the issue was a misconfiguration in the data plane for Azure Front Door and the Azure Content Delivery Network.

Read Post

Gremlin

Read more about Reliability lessons from the 2025 Microsoft Azure Front Door outage

Manual Call Forwarding vs. Schedule-Based Call Routing: What's the Better Way to Handle On-Call Support?

Nov 17, 2025 By Ritika Bramhe In OnPage

When your team shares one support number, someone has to decide who gets the calls when customers need help after hours. And if your team rotates on-call responsibilities weekly, which is common in IT (SRE, DevOps, ITOps, etc), clinical and field engineering teams, you’ve probably relied on manual call forwarding at some point. On paper, it seems straightforward: update the forwarding number each week to point to the person who’s on call. In practice? It often turns into a scramble.

Read Post

OnPage

Read more about Manual Call Forwarding vs. Schedule-Based Call Routing: What's the Better Way to Handle On-Call Support?

Google Workspace outage on November 12: How StatusGator detected it first

Nov 14, 2025 By Colin Bartlett In StatusGator

On November 12, 2025, users around the world faced difficulty accessing Google Workspace products including Google Drive, Google Docs, Google Sheets, and Google Slides. While the outage did not impact every user, it was widespread and disruptive. StatusGator detected the incident early using real user data and issued an Early Warning Signal long before Google officially acknowledged the issue.

Read Post

StatusGator

Read more about Google Workspace outage on November 12: How StatusGator detected it first

Jira Service Management (JSM) Review for Incident Management (2025)

Nov 14, 2025 By Sreekar In Spike

Atlassian is shutting down OpsGenie. New sales already stopped on June 4, 2025, and the platform will be completely offline by April 5, 2027. As an OpsGenie user, you now face a critical decision: Migrate to Jira Service Management (JSM), Atlassian’s recommended path, or choose a different solution. And if you’re not sure JSM is the right fit for your team’s incident management needs, this review will help you decide. I signed up for JSM and put it through real-world testing.

Read Post

Spike

Read more about Jira Service Management (JSM) Review for Incident Management (2025)

Bloom filters: the niche trick behind a 16× faster API

Nov 14, 2025 By Engineering In Incident.io

This post is a deep dive into how we improved the P95 latency of an API endpoint from 5s to 0.3s using a niche little computer science trick called a bloom filter. We’ll cover why the endpoint was slow, the options we considered to make it fast and how we decided between them, and how it all works under the hood.

Read Post

Incident.io

Read more about Bloom filters: the niche trick behind a 16× faster API

Developer Guide to Customer Love Sprints

Nov 14, 2025 By PagerDuty Inc. In PagerDuty

Join this livestream for a behind the scenes with the engineers who made it happen: 150 plus customer-requests were turned into enhancements to our core incident management and more across PagerDuty.

View Video

PagerDuty

Incident Management

Read more about Developer Guide to Customer Love Sprints

Cascading Failures Aren't Inevitable: Lessons from the AWS DNS Outage

Nov 12, 2025 By Alan Mon In Speedscale

AWS outages grab headlines because they affect millions, but the root cause often comes down to something invisible: DNS failures and cascading service dependencies. The complexity of modern cloud systems, combined with the advanced technology powering platforms like AWS, makes these outages particularly challenging to diagnose and resolve. The recent AWS outage proves one thing: you can't prevent every DNS issue, but you can create resilient architectures and prevent a single failure from taking down your entire service if you test for it.

Read Post