Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

When IT Alerts Go Bump in the Night: A Halloween Tale of IT Alerting with SIGNL4

As the witching hour approaches, your data center hums quietly – servers glowing like jack-o’-lanterns in the dark. Everything seems calm… until suddenly, your phone lights up with a chilling alert. CPU usage is spiking. Network latency is haunting your system. The ghost of downtime lurks nearby. Welcome to the spooky world of IT alerting – where nightmares come true if your team isn’t ready.

Detect and map third-party outages with Datadog External Provider Status

Modern applications depend on dozens of external cloud platforms, APIs, and SaaS services to function. But when those providers experience issues, engineers often spend valuable time asking a basic question: Is the problem with us or with them? Provider-maintained status pages are often slow to update, leaving teams waiting for confirmation while incidents escalate. This delay wastes valuable time, prolongs investigations, and risks customer trust.

The Hidden Risk of DNS - Lessons from the AWS Outage & Why You Need DNS Spy Monitoring NOW

On October 20, 2025, much of the internet came to a halt. Apps wouldn’t load. Payments failed. Cloud dashboards went dark. From Fortnite to Alexa, Snapchat, and countless business platforms, users across the world were suddenly offline — all because DNS broke inside Amazon Web Services’ (AWS) US-East-1 region.

Amazon Isn't Eating Its Own DNS Dog Food

On October 19-20, 2025, Amazon Web Services (AWS) experienced a significant outage (AWS status) affecting its US-EAST-1 region in northern Virginia. The root cause was DNS resolution failures for DynamoDB’s API endpoints, which cascaded across AWS’s interconnected services, disrupting major platforms including Snapchat, McDonald’s, Disney+, Roblox, Coinbas, Reddit, and Amazon’s own services.

PagerDuty Joins AWS QuickSuite: Connect Your Incident Management with 1,000+ Applications

Today, we’re announcing that PagerDuty is now available in AWS QuickSuite through the Model Context Protocol (MCP). This means PagerDuty’s incident management capabilities can now connect with the 1,000+ applications and data sources that QuickSuite integrates with, from AWS services to enterprise SaaS platforms, all accessible through natural language.

AWS Outage: How do you prepare for the failure of your own safety net?

When AWS’s massive outage struck, it didn’t just take down cloud services, apps, and enterprise platforms. It also knocked out many of the monitoring systems organizations depend on for real-time answers. Observability companies, including Datadog, New Relic, Checkly, Dynatrace, SpeedCurve, and Splunk Observability, lost visibility or functionality precisely when organizations needed them most.

A Launch Day in the Life with AI Teammates

Alex, an SRE at Greenagonia, starts the day knowing there’s a big launch coming. Pre-orders suggest a 5-10x increase in normal traffic, which means coffee needs to be extra strong this morning. As Alex scans through overnight alerts, he realizes he’s completely forgotten about a dentist appointment that overlaps with his upcoming on-call shift. Six months ago, this would have meant frantic Slack messages or at least one phone call. Today? Alex’s AI teammate has it covered.

7 Ways Your Incident Management Just Got a Boost (New Feature Rundown)

All the things you may have missed that will make your incident management smarter, faster, and simply easier. We ship updates every week because we want you to get the most out of FireHydrant. But we also know it's hard to stay up to date and read every week's changelog (even though we know reading changelogs is the highlight of your week ).

Experimenting With Different Scripts

It all began when I spun up an AWS t4g.small burstable instance for a side project. Nothing unusual just another day in the cloud. But the moment I connected through SSH, something caught my eye. The system greeted me with a temperature reading of -273.5°C. Wait… what? That’s 0 Kelvin, the point where atomic motion completely stops. In other words, absolute zero , a state that’s theoretically impossible for anything to operate in.