%term

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

Inside the Cloudflare Outage: Real-World Data from UptimeRobot

Nov 20, 2025 By Tomas Koprusak In Uptime Robot

On November 18th, 2025, a large Cloudflare outage briefly broke big chunks of the internet. For several hours, users around the world were greeted with 500 errors, including platforms like X, ChatGPT, Spotify, and many others that run behind Cloudflare’s network. At UptimeRobot, we sit in a slightly unusual spot during events like this: So when Cloudflare has a bad day, we see it twice: once in the alerts we send to our customers, and again in how it affects parts of our own infrastructure.

Read Post

Uptime Robot

Read more about Inside the Cloudflare Outage: Real-World Data from UptimeRobot

Five key takeaways from EDUCAUSE 2025: Adopting AI while navigating change

Nov 20, 2025 By PagerDuty In PagerDuty

Having just returned from the 2025 EDUCAUSE Annual Conference in Nashville, I want to share some insights on the future of campus IT from the higher education technology leaders in attendance. Every year, this conference provides an opportunity for technology providers and higher ed professionals to connect and explore the latest innovations in higher education technology. Two themes emerged as critical priorities.

Read Post

PagerDuty

Read more about Five key takeaways from EDUCAUSE 2025: Adopting AI while navigating change

Reliability lessons from the 2025 Cloudflare outage

Nov 20, 2025 By Andre Newman In Gremlin

On November 18, 2025, X, ChatGPT, Shopify, and many other major sites went offline simultaneously. Even Downdetector, Ookla’s popular outage tracking website, briefly went offline. What caused this issue? Why were so many major websites affected by it? And what steps can you take to reduce the impact on your own applications? ‍

Read Post

Gremlin

Read more about Reliability lessons from the 2025 Cloudflare outage

The 7 Most Common Incident Mistakes (and How to Prevent Them)

Nov 20, 2025 By Jessica Abelson In FireHydrant

The hidden blockers slowing down your incident response and how to remove them before they become reliability risks. Incidents rarely go wrong because of one big failure. Most of the time, it’s a handful of small, familiar mistakes that slow teams down, muddy communication, or create confusion in the heat of the moment. Fortunately, these mistakes are predictable and fixable.

Read Post

FireHydrant

Read more about The 7 Most Common Incident Mistakes (and How to Prevent Them)

How Datadog Feature Flags is resilient to cloud provider failures

Nov 19, 2025 By Anthony Rindone In Datadog

As major incidents like AWS’s October 2025 outage illustrate, modern systems are immensely interconnected. A failure in one can lead to a cascade of downstream problems. In this case, issues with DNS resolution for DynamoDB led to widespread disruptions with other AWS services and, subsequently, thousands of applications and services that rely on that infrastructure.

Read Post

Datadog

Read more about How Datadog Feature Flags is resilient to cloud provider failures

Making Your Business Resilient Against Cloudflare Like Outages

Nov 19, 2025 By Uma Mukkara In Harness

Cloudflare-like outages can cost your business a significant amount of money. This week’s Cloudflare global outage is a wake-up call for business resilience. You can stay resilient against such outages by regularly performing resilience testing and updating your application or infrastructure configurations.

Read Post

Harness

Read more about Making Your Business Resilient Against Cloudflare Like Outages

It's Never Different This Time: LLM Reliability Without the Hype with Julien Simon

Nov 19, 2025 By Rootly In Rootly

In this episode, Julien Simon, longtime voice in the open-source ML world, reminds us that even in the era of GenAI, reliability fundamentals haven’t changed. Julien breaks down why calling “the same model” from different providers can produce wildly different results, how deployment choices introduce hidden variability, and why reliability teams need to think of LLM systems as distributed systems.

View Video

Rootly

Read more about It's Never Different This Time: LLM Reliability Without the Hype with Julien Simon

Five ITOps best practices to stay ahead during major third-party outages

Nov 19, 2025 By Adam Blau In BigPanda

When external providers fail—whether it was CrowdStrike outage last year, AWS outage last month, or the Cloudflare DNS outage yesterday—the symptoms inside your environment often look like internal issues: timeouts, login failures, API errors, service degradation, or sudden spikes in dependency-related alerts. It’s natural for teams to start searching through their own infrastructure first, but none of these symptoms clearly point to your systems as the root cause.

Read Post

BigPanda

Read more about Five ITOps best practices to stay ahead during major third-party outages

OnlineOrNot's lessons from Cloudflare's outage on 2025-11-18

Nov 19, 2025 By Max Rozen In OnlineOrNot

On 2025-11-18 at 11:48 UTC, Cloudflare declared an incident affecting the global network (that also affected OnlineOrNot). OnlineOrNot monitors websites, APIs, web apps, and cron jobs, while providing status pages as well. While we partially mitigated the issue by enabling a fallback to AWS-based monitoring, between 13:00 UTC and 14:33 UTC failing checks went unreported, heartbeat checks over-reported, and status pages were unavailable.

Read Post

OnlineOrNot

Read more about OnlineOrNot's lessons from Cloudflare's outage on 2025-11-18

Navigating External Outages: How Selector Cuts Through the Cloudflare Noise

Nov 19, 2025 By Stephen Ochs In Selector

Yesterday’s widespread Cloudflare outage reminds us how crucial external dependencies are to the stability of our own applications. When a key edge provider like Cloudflare goes down, the impact on your internal monitoring systems can look like a catastrophic, internal system failure triggering a massive storm of alerts and sending engineering teams into frantic, misdirected debugging sessions.

Read Post