Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

AWS And Azure Outages Will Recur - Here's How You Ensure Resilience

The cloud has long promised limitless scalability and near-perfect uptime. But if you tried to access your Microsoft 365 dashboard or recline your smart bed last week, and got nothing but a spinning icon, you weren’t alone. In the span of 10 days, both Amazon Web Services (AWS) and Microsoft’s Azure Cloud suffered widespread outages that rippled across industries.

Cloudflare outage: another wake-up call for resilience planning

Another day, another massive Internet disruption, and this time it’s Cloudflare taking huge parts of the Internet offline. This incident is not an anomaly. It is part of a recurring pattern that has become standard in digital infrastructure. We have reached an inflection point in digital operations. Outages at major cloud and content delivery network (CDN) providers are now expected. The only real uncertainty is when it will happen next.

Reliability lessons from the 2025 Microsoft Azure Front Door outage

On October 29th, 2025, Azure Front Door suffered an outage that impacted Microsoft services on a global level, including Microsoft 365, Outlook, Xbox Live, Copilot, and more. It also affected Microsoft Azure, meaning companies like Costco, Starbucks, and Alaska Airlines ran into issues for both customer-facing and internal systems. The root of the issue was a misconfiguration in the data plane for Azure Front Door and the Azure Content Delivery Network.

Manual Call Forwarding vs. Schedule-Based Call Routing: What's the Better Way to Handle On-Call Support?

When your team shares one support number, someone has to decide who gets the calls when customers need help after hours. And if your team rotates on-call responsibilities weekly, which is common in IT (SRE, DevOps, ITOps, etc), clinical and field engineering teams, you’ve probably relied on manual call forwarding at some point. On paper, it seems straightforward: update the forwarding number each week to point to the person who’s on call. In practice? It often turns into a scramble.

Google Workspace outage on November 12: How StatusGator detected it first

On November 12, 2025, users around the world faced difficulty accessing Google Workspace products including Google Drive, Google Docs, Google Sheets, and Google Slides. While the outage did not impact every user, it was widespread and disruptive. StatusGator detected the incident early using real user data and issued an Early Warning Signal long before Google officially acknowledged the issue.

Jira Service Management (JSM) Review for Incident Management (2025)

Atlassian is shutting down OpsGenie. New sales already stopped on June 4, 2025, and the platform will be completely offline by April 5, 2027. As an OpsGenie user, you now face a critical decision: Migrate to Jira Service Management (JSM), Atlassian’s recommended path, or choose a different solution. And if you’re not sure JSM is the right fit for your team’s incident management needs, this review will help you decide. I signed up for JSM and put it through real-world testing.

Bloom filters: the niche trick behind a 16× faster API

This post is a deep dive into how we improved the P95 latency of an API endpoint from 5s to 0.3s using a niche little computer science trick called a bloom filter. We’ll cover why the endpoint was slow, the options we considered to make it fast and how we decided between them, and how it all works under the hood.
Sponsored Post

Cascading Failures Aren't Inevitable: Lessons from the AWS DNS Outage

AWS outages grab headlines because they affect millions, but the root cause often comes down to something invisible: DNS failures and cascading service dependencies. The complexity of modern cloud systems, combined with the advanced technology powering platforms like AWS, makes these outages particularly challenging to diagnose and resolve. The recent AWS outage proves one thing: you can't prevent every DNS issue, but you can create resilient architectures and prevent a single failure from taking down your entire service if you test for it.