Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

Turning Incidents Into Insight: The Continuous AI Operations Loop Explained

Modern systems generate enormous volumes of operational data. Yet, most incident workflows still treat every outage like a one‑off fire drill: an alert fires, responders scramble, the issue is resolved, the status page goes green—and the organization learns almost nothing from the experience. Meanwhile, the same patterns quietly repeat in code releases, logs, traces, and support tickets until they erupt into the next ‘unexpected’ incident.

Shopify Cyber Monday outage - December 1, 2025

On December 1, 2025, Cyber Monday, the biggest online shopping day of the year, Shopify suffered a widespread outage that left many merchants unable to access their stores or process orders. At a time when every minute of uptime translates directly into revenue, the disruption caused immediate concern across the ecommerce community. StatusGator detected the issue within minutes, sending an Early Warning Signal 10 minutes before Shopify published its official acknowledgement.

Introducing a More Flexible On-Call Schedule

Today, we are introducing some new on-call features: Add Gaps to on-call, Scheduled Layers, Handoff Days, and more. Flexibility in on-call schedules has been the single focus point in this release. These features give you much finer control over when people are on-call, how handoffs work, and what your schedule looks like around holidays and time off.

AI agents just got smarter thanks to PagerDuty + AWS

We are on the ground with AWS and announcing innovations that give customers more powerful AI agents for incident management. These new and improved integrations bring PagerDuty context into the AWS ecosystem for faster resolution and more connected data across the business. And, with our new competency, we take this a step further by codifying these best practices into our joint customers’ day-to-day operations. Announced today, here are some of the highlights.

OnPage Introduces Multi-Language Mobile App Localization on iOS & Android

As organizations continue to adopt OnPage across regions and operational environments, providing an experience that feels natural and intuitive for every user has become increasingly important. Clear communication is essential in time-sensitive workflows, and being able to use the app in one’s preferred language supports clarity, confidence, and consistency. To support our growing global user base, OnPage is introducing multi-language localization across its mobile applications.

How ilert's holidays and support hours keep teams sane

The end of the year brings pressure. (Oh, we know!) Customer demand spikes, response expectations stay high, and engineering teams are juggling production issues, releases, and time off. For many teams, this is when on-call becomes chaotic: schedules break, notifications hit at the wrong time, and coverage gaps appear exactly when you can’t afford them. ‍ ilert's Holidays and Support hours features were built to fix that.

AI Infrastructure Is Creating a New Wave of Incidents, And Why Enterprises Need a Modern On-Call Strategy

Over the last few years, AI has quietly shifted from a fascinating experiment to a core operational system. Enterprises aren’t just building prototypes anymore — they’re deploying LLMs into production environments where uptime directly affects customer interactions, revenue flows, and business continuity. AI has essentially become a new layer of critical infrastructure. Because of that shift, the definition of “reliability” is changing.