Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

Silent Failure in Production ML: Why the Most Dangerous Model Bugs don't Throw Errors

You’ve done it. Your machine learning model is live in production. It’s serving predictions, powering features, and quietly doing its job. Dashboards are green. There are no errors in the logs. Nothing appears broken. And yet, something is wrong. Predictions are getting less reliable. Users are waiting a little longer for responses. Conversion rates are slipping. Trust is eroding, but no alert fires, no system crashes, and no one knows there’s a problem until the damage has been done.

PagerDuty x Backstage Plugin Demo: Eliminate Context Switching for On-Call Engineers

Join Rocío, Product Manager of the Forward Deploying Engineering team at PagerDuty, as she demonstrates how the PagerDuty Backstage plugin transforms incident response by bringing critical operational data directly into your developer portal.

Weekly vs. split-week on-call rotations: A guide to finding the right rhythm

When you move past daily rotations but find anything longer than a week feels too stretched out, you often end up choosing between weekly and split-week rotations. Weekly rotations give you a full seven days before handing off. Split-week rotations break that time into smaller chunks like 2-day, 3-day, or 4-day shifts. Each approach creates a different rhythm for your team. This guide compares both patterns across three key criteria.

PagerDuty + OOPS Meetup: AI in Incident Management

AI is transforming industries at pace, and Incident Response is no exception - raising important questions about how humans and automation should work together when systems are failing and pressure is highest. Panelists:​Andrew White (Technology Director, checkout.com) James Pickles (Senior Solutions Consultant, PagerDuty)​Sarah Wells (Independent Consultant, former Technology Director at FT) Suraj Singh Dadwal (Team Lead, Incident & Problem Management, IG)

Event Intelligence Solutions Part Three: Best Practices for Successful Adoption

As Event Intelligence Solutions (EIS) move from early adoption to operational necessity, many enterprises are realizing that success depends on more than selecting the right technology. For Banking and Financial Services organizations, effective adoption requires a clear strategy, disciplined execution and a strong alignment to business priorities and regulatory demands and not least, customer expectations.

Transform IT major incident management with customizable AI Workflows from BigPanda

Enterprise Management Associates found that major IT service outages are increasing in cost, frequency, and duration, with unplanned downtime costing large enterprises nearly $25,000 per minute, or $1.5 million per hour. When every minute costs $25,000, you can’t afford to waste engineering time on coordination tasks like creating channels, paging experts, typing summaries, and posting updates. An agentic AI-powered incident assistant can eliminate that waste and reduce bridge call costs.

2-day vs. 4-day on-call rotations: Which one fits your team

Teams that find a weekly rotation too long and a daily rotation too short often end up choosing between 2-day and 4-day rotations. This guide compares both these rotations across three key criteria. For each criterion, we have discussed how it works for 2-day and 4-day rotations and recommended what to choose when. To make it easy, we also included a comparison table for a quick overview. This gives you all the information you need at a glance. Let’s dive in! Table of contents.

How to choose the right on-call rotation

Choosing an on-call rotation is about finding a rhythm that balances your team’s well-being and your system’s reliability. The right on-call rotation helps prevent burnout and makes on-call duties sustainable over the long run. This guide walks you through different on-call rotation patterns, from daily rotation to after-hours rotations. We’ll look at why you might choose a particular rotation and the challenges that often come with it.

Why a month is too long to be on-call

There is often a temptation to stretch on-call shifts to a month or longer, especially when incident volume is low. The logic seems sound. If the phone rarely rings, it feels unnecessary to hand off on-call duties every week. But looking strictly at incident volume often misses the human side of the equation. Being on-call isn’t just about answering pages. It is also a state of mind. Even when it is quiet, simply being on-call could create fatigue of its own.