Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

YouTube Outage (Feb 17, 2026). What Happened?

On February 17, 2026, YouTube went down for users worldwide. Starting around 8:00 PM ET, the platform's homepage, Shorts feed, sign-in system, smart TV apps, YouTube Music, and YouTube Kids all stopped working. Over 21,000 reports were logged on IsDown alone. The error message was the same everywhere: "Something went wrong." For consumer users, it was an inconvenience. For businesses that depend on YouTube — content teams, advertisers, media companies, live streamers — it was a blind spot.

The post-mortem problem

Post-mortems are required, time-consuming, and widely disliked — but they’re also one of the biggest opportunities to improve reliability. In this webinar, we talked about how to run post-mortems that actually lead to learning and improvement. This covered why most post-mortems fall flat, how to structure them effectively, and walk through a real example to show what good looks like in practice. The goal: fewer wasted hours, better outcomes, and post-mortems that actually matter.

AI Is Changing Healthcare Faster Than Most Systems Are Ready For

Healthcare is shifting fast, and artificial intelligence is no longer a future concept sitting in research labs or pilot programs. It’s already embedded in clinical workflows, operational systems, and patient interactions, often in ways that feel subtle, uneven, and sometimes uncomfortable.

How to Set Up SMS Alerting w/ OnPage

In this quick tutorial, learn how to set up SMS alerting in OnPage to ensure your team never misses a critical notification. We’ll walk you through the step-by-step process: This setup ensures reliable message delivery using redundancy rules, so important alerts reach the right person at the right time. Let us know if you have any other questions!

Why SIGNL4 Is the Right Alarm Management Software to Maximize Machine Availability

A plant runs at its best when equipment stays online, processes remain stable, tolerances are met, raw materials are delivered in time, and scrap stays low. That’s how operations teams hit production targets, meet customer SLAs, stay on schedule, keep costs under control, and maintain consistent quality. But does everything always run according to plan? Of course not.

Code Is Cheap, Reliability Isn't: Owning Production in the AI era w/ Swizec Teller

In this episode, Swizec Teller, author of the bestselling Scaling Fast, makes a bold claim: code is cheap, reliability is not. As AI coding tools accelerate feature development, the real competitive advantage shifts to operating systems reliably in production. We explore the hidden complexity of SRE work, the addictive nature of agentic coding, and why ownership — not automation — remains at the core of modern software engineering.

Amazon Web Services outage - February 10, 2026

On February 10, 2026, Amazon Web Services (AWS) experienced an outage that triggered widespread reports of CloudFront failures and DNS resolution issues. While AWS later acknowledged the incident, StatusGator detected the disruption earlier using Early Warning Signals, giving customers valuable lead time before the provider confirmed anything publicly.

4 on-call burnout signs (and how to address them)

Being on-call can sometimes feel overwhelming. If that feeling goes unnoticed for too long, it often translates into burnout. And early burnout signs usually show up in ways, like how people respond to incidents or how they feel about the schedule. This guide walks through four such signs that can be useful to watch for before on-call burnout sets in.

Claude outage - February 10, 2026

On February 10, 2026, Claude users around the world began reporting service failures affecting chat sessions, API integrations, and Claude Code workflows. The first verified outage report reached StatusGator at 19:33 UTC. StatusGator issued an Early Warning Signal at 20:24 UTC. Claude did not post an official “Investigating” update until 22:11 UTC. This incident clearly demonstrates the gap between real user impact and official status page updates.