Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

New features: AI SRE, Merge alerts, and Status pages for thousands of services

As we head into the holiday season, the ilert team is doing the opposite of slowing down; we’re ramping up. Over the past weeks, we’ve shipped a wave of impactful improvements across alerting, AI-powered automation, mobile app, and status pages. From major upgrades that reshape how teams triage incidents to smaller refinements that remove daily friction, this release is packed with updates designed to make on-call and operations smoother, smarter, and faster. Let’s dive in.

Shopify Outage 2025: Rise of the Commerce Kaiju

It was a normal day in the land of eCommerce. Birds were singing, dashboards were loading, and merchants everywhere felt cautiously optimistic. Then the ground trembled. A tiny glitch. A flicker. A warning log no one read. And suddenly— BOOM! Shopify burst out of the digital ocean like a gigantic scaly beast that woke up on the wrong side of the server rack. Checkouts froze mid-purchase. Product pages stopped producting. Merchants stared blankly at blank screens. The Commerce Kaiju had arrived.

Cloudflare was down again: Here's what happened.

On December 5, 2025, the internet faced another major disruption – the second significant Cloudflare-related outage in just a few weeks. A similar widespread incident occurred on November 18, which we covered in detail in our post The internet broke again – StatusGator can help. Today’s outage reinforces how quickly issues within core internet infrastructure can ripple outward and impact thousands of services simultaneously.

Towards a more resilient StatusGator

Between October 20 and December 5, 2025, a rapid succession of major outages across multiple cloud providers disrupted large portions of the internet. Each of these events affected StatusGator in different ways. After each incident, we implemented improvements to strengthen our reliability. This post summarizes the impact of each outage, the changes made, and the architectural work now underway to ensure StatusGator remains available during the moments when it is needed most.

Introducing the BigPanda Triage Agent and the future of agentic L1 operations

If you’ve been following the development of BigPanda AI Detection and Response (ADR), you’re aware of our mission to automate Level 1 (L1) operations and eliminate the need for manual, time-consuming investigations. In our last update, we highlighted the manual, complex, and time-consuming processes that hinder modern IT teams. Enterprises spend billions on observability tools based on the false belief that more coverage equals total visibility.

PagerDuty Becomes Newest AWS Software Partner to Earn Resilience Competency

As enterprise system failures cost businesses an estimated $400 billion annually in lost revenue and productivity, PagerDuty announced it has achieved the Amazon Web Services (AWS) Resilience Services Competency in the software category - becoming one of the first AWS Software Partners to earn the designation. This achievement validates PagerDuty's ability to help enterprises architect, deploy and maintain mission-critical systems that can withstand failures and recover rapidly with minimal business disruption.

From Noise to Notified: Making Azure Sentinel Alerts Actionable

Modern security operations are overflowing with data, and organizations rely heavily on Azure Sentinel alerts and Microsoft Sentinel alerts to maintain visibility across hybrid environments. From firewalls and endpoints to cloud workloads and identity systems, thousands of signals compete for attention every second. For most security teams, the challenge isn’t detection anymore – it’s action.