Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

HIPAA-Compliant Messaging and Clinical Communication

In today’s fast-paced healthcare environment, patient outcomes rely entirely on immediate, accurate, and secure information transfer. Mismanaged communication is costly; industry data suggests that communication failures contribute to an estimated $12 billion in annual revenue loss and are linked to nearly 30% of malpractice claims.

What Is an Incident Commander? Role, Skills, and Best Practices

The fastest incident response teams treat coordination as a craft. Someone owns the call, drives the decisions, and keeps everyone moving in the same direction while the team puts the system back together. That person is the incident commander (IC), and getting the role right is what separates your 15-minute fix from a four-hour war room where nobody’s sure who’s making the call.

PagerDuty Appoints John DiLullo as Chief Executive Officer

Jennifer Tejada Transitions to Executive Chair of Board of Directors After Serving as CEO Since 2016. John DiLullo Brings Deep Enterprise, Product and Go-to-Market Leadership Experience to Lead Next Phase of Growth. Company Reaffirms First Quarter and Full Fiscal Year 2027 Guidance.

What is the Mean Time to Resolution (MTTR)? Why It Matters and How to Resolve

How quickly can you restore service when an incident hits your system? Most IT teams are not slowed down by detecting incidents. The challenge starts after something breaks, when the goal is to bring services back online as quickly as possible. Modern systems are highly distributed. Alerts arrive from multiple tools, dependencies are complex, and it is often difficult to immediately understand what actually failed.

New in PagerDuty's Slack Experience: Dedicated Channels, Quick Declare & New On-Call Paging Commands

For teams that live in Slack, incident management is getting a whole lot smoother. EA planned for May includes dedicated incident channels, one-click escalation, centralized configuration, onboarding tutorials, and new commands to page responders without leaving Slack.#IncidentResponse.

Humans aren't fast enough for 4 9's

When thinking about Service Level Objectives (SLOs) and contractual Service Level Agreements (SLAs) for availability, I always like to put the percentages into concrete numbers. It’s easy to lose track of what’s meant when saying “99.95%” availability, and even more is lost when thinking how much harder it is to achieve 99.99% compared to 99.95%. On a monthly basis, and in concrete terms, 99.95% availability means you get 21 minutes and 55 seconds of downtime.

AWS outage takes down more than 150 cloud services

On May 7th and 8th, 2026, Amazon Web Services (AWS) experienced an outage affecting Amazon Elastic Compute Cloud (EC2) in the dreaded US East 1 region. The original region of AWS located in Northern Virginia, us-east-1 or just “US East” as it is known, has been the subject of some of the internet’s most high profile and destructive outages and remains Amazon’s least reliable region.

KPI vs SLA: What's the Difference?

Why Confusing Them Costs You More Than a Missed Target Every operations leader tracks KPIs. Every enterprise IT team has SLAs. Both involve targets, both involve measurement, and both surface in the same board reviews and vendor conversations. So it is not surprising that the two get treated as variations of the same thing.

How to Customize an SLA Template

A Practical Guide for Help Desk, IT Operations, and Enterprise SRE Teams A service level agreement template is only useful if it can be customized. The version that ships with your ITSM platform was designed to be generic enough to apply anywhere, which makes it precise enough to apply nowhere. The teams that maintain defensible SLAs are not the ones with the most sophisticated legal language.