%term

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

How agentic IT operations lay the foundations for SRE success at scale

Dec 15, 2025 By Manish Agarwal In BigPanda

When something breaks in a modern digital service, customers feel it instantly. Pages stall, requests time out, and carts are abandoned, while frustration grows long before a root cause is identified. What the world never sees is the engineering effort required to keep these systems healthy in the first place. Site Reliability Engineers (SREs) carry that responsibility every day.

Read Post

BigPanda

Read more about How agentic IT operations lay the foundations for SRE success at scale

Scrapers Take Down GitHub: December 11 Outage Timeline

Dec 12, 2025 By Colin Bartlett In StatusGator

On December 11, 2025, GitHub experienced intermittent disruptions that frustrated users across the globe. Developers everywhere started seeing random errors, 503s, unicorns, and CI pipeline failures. Very quickly it became clear something was wrong, even though GitHub’s status page still said ALL SYSTEMS OPERATIONAL. After the incident was over, GitHub published a postmortem that revealed the cause: scrapers. Automated tools hit GitHub with enough traffic to overwhelm key backend systems.

Read Post

StatusGator

Read more about Scrapers Take Down GitHub: December 11 Outage Timeline

AI Reliability, Part 2: When the Datacenter Becomes the Bottleneck

Dec 12, 2025 By Ritika Bramhe In OnPage

In Part 1, we talked about all the hidden complexity inside AI systems: the pipelines, GPUs, embeddings, vector databases, orchestration layers, and everything else that quietly determines how reliable an AI-first product really is. But all of that software still rests on something far less glamorous: the physical infrastructure underneath it.

Read Post

OnPage

Read more about AI Reliability, Part 2: When the Datacenter Becomes the Bottleneck

The Reality of GenAI in Production with Eduardo Ordax (AWS)

Dec 12, 2025 By Rootly In Rootly

GenAI demos are easy. Production is where everything breaks. In this episode, Eduardo Ordax, Principal GTM GenAI at AWS, breaks down what actually stops companies from shipping reliable AI systems, and why the real blockers have little to do with technology.

View Video

Rootly

Read more about The Reality of GenAI in Production with Eduardo Ordax (AWS)

Major Cloud Outages of 2025

Dec 12, 2025 By Hrishikesh Barua In IncidentHub

Cloud outages in 2025 ranged from minor ones affecting some sections of users, to major ones affecting hundreds or thousands of users. Services like Cloudflare and AWS on which many other services depend experienced outages that affected many due to the cascading effect. Let's look at some of the major cloud outages in 2025.

Read Post

IncidentHub

Read more about Major Cloud Outages of 2025

Microsoft Teams outage - December 10th, 2025

Dec 11, 2025 By Colin Bartlett In StatusGator

On the morning of December 10, 2025, Microsoft Teams experienced a service disruption affecting users across Australia. Although Microsoft 365 users reported issues across several apps, the hardest hit service was Microsoft Teams which became completely unusable for many organizations. While Microsoft did not acknowledge the incident until 03:46 UTC StatusGator identified the issue at 02:52 UTC through incoming outage reports and delivered an Early Warning Signal at 03:01 UTC.

Read Post

StatusGator

Read more about Microsoft Teams outage - December 10th, 2025

Getting Started With Spike

Dec 10, 2025 By Sreekar In Spike

Welcome to Spike! Whether you’ve just set up your account or joined an existing team, this guide will help you understand how to receive and respond to incidents.

Read Post

Spike

Read more about Getting Started With Spike

What Is IT Incident Response?

Dec 10, 2025 By SIGNL4 In SIGNL4

“We’ve got a new alert – have you seen it yet?”“Which one? The CPU spike or the unusual login?”“The login. Same region as yesterday. But the CPU thing looks suspicious too.”“…Alright, I’ll check the firewall logs. You take the containers.”“Perfect. Let’s hope this doesn’t turn into another all-hands situation.” Does this conversation sound familiar?

Read Post

SIGNL4

Read more about What Is IT Incident Response?

Every Business Needs a Robust Incident Response Strategy

Dec 10, 2025 By OpsMatters In OpsMatters

In today's digital landscape, businesses face an increasing number of cyber threats that can compromise sensitive data, disrupt operations, and tarnish their reputation. As companies adopt more complex technological solutions, they must be prepared for the inevitable risk of security incidents. Having a well-established, effective incident response strategy is no longer optional but essential. This article explores why incident response solutions are critical for every business and how they play a pivotal role in safeguarding an organization's assets, reputation, and continuity.

Read Post

OpsMatters

Read more about Every Business Needs a Robust Incident Response Strategy

When major IT incidents occur, AI can deliver speed and transparency

Dec 8, 2025 By Katie Petrillo In BigPanda

The recent Cloudflare outage served as a stark reminder of how fragile the global digital ecosystem can be due to a single point of failure. In a matter of minutes, thousands of websites that rely on Cloudflare’s CDN, from Fortune 500 brands to SaaS platforms and consumer apps, went offline for hours. The business impacts were severe, with Shopify alone suffering over $4 million in losses while downstream merchant impacts potentially exceeded $170 million.

Read Post