Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

Inside the Cloudflare Outage: Real-World Data from UptimeRobot

On November 18th, 2025, a large Cloudflare outage briefly broke big chunks of the internet. For several hours, users around the world were greeted with 500 errors, including platforms like X, ChatGPT, Spotify, and many others that run behind Cloudflare’s network. At UptimeRobot, we sit in a slightly unusual spot during events like this: So when Cloudflare has a bad day, we see it twice: once in the alerts we send to our customers, and again in how it affects parts of our own infrastructure.

Five key takeaways from EDUCAUSE 2025: Adopting AI while navigating change

Having just returned from the 2025 EDUCAUSE Annual Conference in Nashville, I want to share some insights on the future of campus IT from the higher education technology leaders in attendance. Every year, this conference provides an opportunity for technology providers and higher ed professionals to connect and explore the latest innovations in higher education technology. Two themes emerged as critical priorities.

Reliability lessons from the 2025 Cloudflare outage

On November 18, 2025, X, ChatGPT, Shopify, and many other major sites went offline simultaneously. Even Downdetector, Ookla’s popular outage tracking website, briefly went offline. What caused this issue? Why were so many major websites affected by it? And what steps can you take to reduce the impact on your own applications? ‍

The 7 Most Common Incident Mistakes (and How to Prevent Them)

The hidden blockers slowing down your incident response and how to remove them before they become reliability risks. Incidents rarely go wrong because of one big failure. Most of the time, it’s a handful of small, familiar mistakes that slow teams down, muddy communication, or create confusion in the heat of the moment. Fortunately, these mistakes are predictable and fixable.

How Datadog Feature Flags is resilient to cloud provider failures

As major incidents like AWS’s October 2025 outage illustrate, modern systems are immensely interconnected. A failure in one can lead to a cascade of downstream problems. In this case, issues with DNS resolution for DynamoDB led to widespread disruptions with other AWS services and, subsequently, thousands of applications and services that rely on that infrastructure.

Making Your Business Resilient Against Cloudflare Like Outages

Cloudflare-like outages can cost your business a significant amount of money. This week’s Cloudflare global outage is a wake-up call for business resilience. You can stay resilient against such outages by regularly performing resilience testing and updating your application or infrastructure configurations.

It's Never Different This Time: LLM Reliability Without the Hype with Julien Simon

In this episode, Julien Simon, longtime voice in the open-source ML world, reminds us that even in the era of GenAI, reliability fundamentals haven’t changed. Julien breaks down why calling “the same model” from different providers can produce wildly different results, how deployment choices introduce hidden variability, and why reliability teams need to think of LLM systems as distributed systems.

Five ITOps best practices to stay ahead during major third-party outages

When external providers fail—whether it was CrowdStrike outage last year, AWS outage last month, or the Cloudflare DNS outage yesterday—the symptoms inside your environment often look like internal issues: timeouts, login failures, API errors, service degradation, or sudden spikes in dependency-related alerts. It’s natural for teams to start searching through their own infrastructure first, but none of these symptoms clearly point to your systems as the root cause.

OnlineOrNot's lessons from Cloudflare's outage on 2025-11-18

On 2025-11-18 at 11:48 UTC, Cloudflare declared an incident affecting the global network (that also affected OnlineOrNot). OnlineOrNot monitors websites, APIs, web apps, and cron jobs, while providing status pages as well. While we partially mitigated the issue by enabling a fallback to AWS-based monitoring, between 13:00 UTC and 14:33 UTC failing checks went unreported, heartbeat checks over-reported, and status pages were unavailable.

Navigating External Outages: How Selector Cuts Through the Cloudflare Noise

Yesterday’s widespread Cloudflare outage reminds us how crucial external dependencies are to the stability of our own applications. When a key edge provider like Cloudflare goes down, the impact on your internal monitoring systems can look like a catastrophic, internal system failure triggering a massive storm of alerts and sending engineering teams into frantic, misdirected debugging sessions.