Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

Shopify Outage 2025: Rise of the Commerce Kaiju

It was a normal day in the land of eCommerce. Birds were singing, dashboards were loading, and merchants everywhere felt cautiously optimistic. Then the ground trembled. A tiny glitch. A flicker. A warning log no one read. And suddenly— BOOM! Shopify burst out of the digital ocean like a gigantic scaly beast that woke up on the wrong side of the server rack. Checkouts froze mid-purchase. Product pages stopped producting. Merchants stared blankly at blank screens. The Commerce Kaiju had arrived.

Cloudflare was down again: Here's what happened.

On December 5, 2025, the internet faced another major disruption – the second significant Cloudflare-related outage in just a few weeks. A similar widespread incident occurred on November 18, which we covered in detail in our post The internet broke again – StatusGator can help. Today’s outage reinforces how quickly issues within core internet infrastructure can ripple outward and impact thousands of services simultaneously.

Towards a more resilient StatusGator

Between October 20 and December 5, 2025, a rapid succession of major outages across multiple cloud providers disrupted large portions of the internet. Each of these events affected StatusGator in different ways. After each incident, we implemented improvements to strengthen our reliability. This post summarizes the impact of each outage, the changes made, and the architectural work now underway to ensure StatusGator remains available during the moments when it is needed most.

Introducing the BigPanda Triage Agent and the future of agentic L1 operations

If you’ve been following the development of BigPanda AI Detection and Response (ADR), you’re aware of our mission to automate Level 1 (L1) operations and eliminate the need for manual, time-consuming investigations. In our last update, we highlighted the manual, complex, and time-consuming processes that hinder modern IT teams. Enterprises spend billions on observability tools based on the false belief that more coverage equals total visibility.

PagerDuty Becomes Newest AWS Software Partner to Earn Resilience Competency

As enterprise system failures cost businesses an estimated $400 billion annually in lost revenue and productivity, PagerDuty announced it has achieved the Amazon Web Services (AWS) Resilience Services Competency in the software category - becoming one of the first AWS Software Partners to earn the designation. This achievement validates PagerDuty's ability to help enterprises architect, deploy and maintain mission-critical systems that can withstand failures and recover rapidly with minimal business disruption.

Turning Incidents Into Insight: The Continuous AI Operations Loop Explained

Modern systems generate enormous volumes of operational data. Yet, most incident workflows still treat every outage like a one‑off fire drill: an alert fires, responders scramble, the issue is resolved, the status page goes green—and the organization learns almost nothing from the experience. Meanwhile, the same patterns quietly repeat in code releases, logs, traces, and support tickets until they erupt into the next ‘unexpected’ incident.