Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

Reliability lessons from the 2025 Cloudflare outage

On November 18, 2025, X, ChatGPT, Shopify, and many other major sites went offline simultaneously. Even Downdetector, Ookla’s popular outage tracking website, briefly went offline. What caused this issue? Why were so many major websites affected by it? And what steps can you take to reduce the impact on your own applications? ‍

The 7 Most Common Incident Mistakes (and How to Prevent Them)

The hidden blockers slowing down your incident response and how to remove them before they become reliability risks. Incidents rarely go wrong because of one big failure. Most of the time, it’s a handful of small, familiar mistakes that slow teams down, muddy communication, or create confusion in the heat of the moment. Fortunately, these mistakes are predictable and fixable.

OnlineOrNot's lessons from Cloudflare's outage on 2025-11-18

On 2025-11-18 at 11:48 UTC, Cloudflare declared an incident affecting the global network (that also affected OnlineOrNot). OnlineOrNot monitors websites, APIs, web apps, and cron jobs, while providing status pages as well. While we partially mitigated the issue by enabling a fallback to AWS-based monitoring, between 13:00 UTC and 14:33 UTC failing checks went unreported, heartbeat checks over-reported, and status pages were unavailable.

Navigating External Outages: How Selector Cuts Through the Cloudflare Noise

Yesterday’s widespread Cloudflare outage reminds us how crucial external dependencies are to the stability of our own applications. When a key edge provider like Cloudflare goes down, the impact on your internal monitoring systems can look like a catastrophic, internal system failure triggering a massive storm of alerts and sending engineering teams into frantic, misdirected debugging sessions.

How Datadog Feature Flags is resilient to cloud provider failures

As major incidents like AWS’s October 2025 outage illustrate, modern systems are immensely interconnected. A failure in one can lead to a cascade of downstream problems. In this case, issues with DNS resolution for DynamoDB led to widespread disruptions with other AWS services and, subsequently, thousands of applications and services that rely on that infrastructure.

Making Your Business Resilient Against Cloudflare Like Outages

Cloudflare-like outages can cost your business a significant amount of money. This week’s Cloudflare global outage is a wake-up call for business resilience. You can stay resilient against such outages by regularly performing resilience testing and updating your application or infrastructure configurations.

It's Never Different This Time: LLM Reliability Without the Hype with Julien Simon

In this episode, Julien Simon, longtime voice in the open-source ML world, reminds us that even in the era of GenAI, reliability fundamentals haven’t changed. Julien breaks down why calling “the same model” from different providers can produce wildly different results, how deployment choices introduce hidden variability, and why reliability teams need to think of LLM systems as distributed systems.

Five ITOps best practices to stay ahead during major third-party outages

When external providers fail—whether it was CrowdStrike outage last year, AWS outage last month, or the Cloudflare DNS outage yesterday—the symptoms inside your environment often look like internal issues: timeouts, login failures, API errors, service degradation, or sudden spikes in dependency-related alerts. It’s natural for teams to start searching through their own infrastructure first, but none of these symptoms clearly point to your systems as the root cause.

AWS And Azure Outages Will Recur - Here's How You Ensure Resilience

The cloud has long promised limitless scalability and near-perfect uptime. But if you tried to access your Microsoft 365 dashboard or recline your smart bed last week, and got nothing but a spinning icon, you weren’t alone. In the span of 10 days, both Amazon Web Services (AWS) and Microsoft’s Azure Cloud suffered widespread outages that rippled across industries.

Cloudflare outage: another wake-up call for resilience planning

Another day, another massive Internet disruption, and this time it’s Cloudflare taking huge parts of the Internet offline. This incident is not an anomaly. It is part of a recurring pattern that has become standard in digital infrastructure. We have reached an inflection point in digital operations. Outages at major cloud and content delivery network (CDN) providers are now expected. The only real uncertainty is when it will happen next.