%term

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

Customers over control: how we measure On-call reliability

May 28, 2026 By Article In Incident.io

Our On-call product has a lot of great features: configuring escalation paths, viewing rotas and schedules, requesting cover, etc. However, when framing its reliability, we reduce it down to two critical pieces of functionality: It’s not that we’re happy if only these parts are working, but they are the most important parts. In this post, I'll go into more detail on how we think about their reliability.

Read Post

Incident.io

Read more about Customers over control: how we measure On-call reliability

MCP Community: PagerDuty MCP + ServiceNow in Action

May 28, 2026 By PagerDuty Inc. In PagerDuty

We're going live to show what's new in the PagerDuty MCP Server and what you can build with it today. Timestamps Streamers: Ignacio Garces, Forward Deployed Engineer (ServiceNow SME), PagerDuty José Reyes, Engineering Manager, PagerDuty.

View Video

PagerDuty

Read more about MCP Community: PagerDuty MCP + ServiceNow in Action

Every pilot is ready for engine failure: are your engineers? w/ Hamed Silatani (Uptime Labs)

May 28, 2026 By Rootly In Rootly

Every pilot who's never had an engine failure is still ready for one. The same can't be said for most software engineers facing their first major incident. Hamed Silatani, co-founder and CEO of Uptime Labs, and former Head of Reliability Engineering at IG Group, has spent two decades watching engineers learn incident response the hard way: alone, under pressure, with no training.

View Video

Rootly

Read more about Every pilot is ready for engine failure: are your engineers? w/ Hamed Silatani (Uptime Labs)

Root Cause Analysis: How Engineering Teams Fix Production Issues Faster?

May 28, 2026 By Mohana Ayeswariya J In Atatus

When a production incident strikes, a sudden latency spike, a cascading API failure, a service returning 500s at scale, every minute of downtime has a cost. Root cause analysis (RCA) is the process that turns that chaos into a clear answer: what actually broke, and why. Not the symptom that triggered the alert. The underlying cause.

Read Post

Atatus

Read more about Root Cause Analysis: How Engineering Teams Fix Production Issues Faster?

How BigPanda and ServiceNow are redefining agentic IT operations for enterprise IT

May 27, 2026 By Sam Osborn In BigPanda

Enterprise ITOps leaders are realizing that legacy incident management processes are collapsing under the weight of today’s sprawling, hybrid-cloud enterprise environments. Monitoring and observability tools generate a relentless flood of alerts across cloud platforms, infrastructure, applications, and services. The signals are there, the volume of noise makes it harder than ever to identify what’s urgent.

Read Post

BigPanda

Read more about How BigPanda and ServiceNow are redefining agentic IT operations for enterprise IT

SIGNL4 Update: Centralize alerts. Automate response. Easier than ever.

May 27, 2026 By SIGNL4 In SIGNL4

Get ready for the new SIGNL4 update. The completely redesigned API makes it easier than ever to connect your systems and tools and consolidate alerts from every source – so nothing gets missed. With the new Automation menu, you can now manage automated alert routing and filtering from one central place, ensuring the right alerts reach the right person at the right time.

Read Post

SIGNL4

Read more about SIGNL4 Update: Centralize alerts. Automate response. Easier than ever.

Best Practices in the Slack Experience

May 26, 2026 By PagerDuty Inc. In PagerDuty

PagerDuty’s slack experience is evolving to help your teams organize better and resolve incidents faster. Use Triage Channels to collect telemetry and updates from your systems. Create dedicated Incident Channels for coordination and resolution. Give stakeholders the updates they need in Announcements Channels. Everyone in your organization can get the information they need easily.

View Video

PagerDuty

Incident Management

Read more about Best Practices in the Slack Experience

Shopify outage on May 22, 2026 impacted merchants worldwide

May 23, 2026 By Colin Bartlett In StatusGator

On May 22, 2026, merchants using Shopify experienced a brief but widespread disruption that affected access to product pages, collections, and administrative tools. While the outage lasted less than an hour, it created immediate challenges for businesses that rely on Shopify to manage inventory, update products, and operate online stores. StatusGator detected the developing incident at 10:20 UTC using Early Warning Signals, 18 minutes before Shopify officially acknowledged the outage at 10:38 UTC.

Read Post

StatusGator

Read more about Shopify outage on May 22, 2026 impacted merchants worldwide

The $600 billion wake-up call: New Splunk research reveals downtime is a systemic business crisis

May 19, 2026 By Splunk In Splunk

600 billion annual impact: Aggregate downtime costs for the Global 2000 have soared 50% in two years. $15,000 per minute: The average cost of downtime for organisations, highlighting the immediate financial impact of service disruptions. 3.4% stock price drop: The average decline in shareholder value following a single downtime incident.

Read Post

Splunk

Read more about The $600 billion wake-up call: New Splunk research reveals downtime is a systemic business crisis

Microsoft Fabric outage disrupted analytics workloads on May 18, 2026

May 19, 2026 By Andy Libby In StatusGator

On May 18, 2026, organizations using Microsoft Fabric experienced a multi-hour outage that disrupted analytics workloads, reporting systems, and access to platform services across several regions. StatusGator detected the developing incident at 14:00 UTC using Early Warning Signals, 37 minutes before Microsoft officially acknowledged the outage at 14:37 UTC.

Read Post