Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

AI Didn't Change the Game, It Just Exposed Your Bottlenecks w/ Ganesh Datta (CTO, Cortex)

Every engineering org says they want to improve reliability — but most can't even agree on what "good" looks like. Ganesh Datta, Co-Founder and CTO of Cortex, has spent the better part of a decade helping companies confront that gap.

From Alerting Tool to Critical Communication Platform

Modern operations don’t break down only because alerts are misconfigured or missed. They break down when systems are difficult to manage, slow to adapt, or lack visibility into what’s actually happening in real time. Across industries, teams are managing an increasing volume of critical events. Critical System Alerts. After-hours urgent calls from patients, clients or even emergency lines. Voicemails. Answering service calls, Emergency notifications. Time-sensitive clinical communication.

How to Prevent and Resolve Incidents Using Model Context Protocol (MCP)

The rapid pace of modern software development, fueled by AI-driven coding and accelerated deployment cycles, has resurfaced a challenge that many development teams already struggled with: the speed of incident response must now match the speed of change. Every day, teams ship code faster than ever, which inevitably increases the risk of a new issue making it to production. The traditional approach—where engineers waste time jumping between disconnected tools—is no longer sustainable.

Updated Web Management Console Demo | On-Call Management, Hospital Communication & Call Routing

See the next-generation OnPage Enterprise Web Management Console in action, built to simplify on-call scheduling, incident alerting, critical communication workflows and post-event reporting. In this demo, we walk through how teams can: Manage on-call schedules and escalation pathsSend and track critical alerts in real timeGain visibility into alert activity, read rates, and response timelinesConfigure contact groups and communication workflowsUse the new Lines Management module to set up call routing, menus, and rules through a self-service interface.

Best Secure Messaging Apps for Healthcare Workers (2026 Buyer's Guide): OnPage

Secure messaging apps for healthcare workers are platforms designed to enable HIPAA-compliant communication, real-time collaboration and coordination, and urgent alerting across clinical teams for timely response. In modern hospitals, communication is no longer just about sending messages. It’s about ensuring the right person receives the right information and acts on it quickly.

Fear, Identity & Flaky Tests: AI in Reliability w/ Dana Lawson (CTO, Netlify)

The self-healing systems that SREs have dreamed about for a decade aren't a distant promise anymore — they're already being built, and the biggest barrier left is cultural. Dana Lawson, CTO at Netlify, has spent over 25 years in the trenches of developer infrastructure, from sysadmin roots to running the platform that powers 5% of the internet.

Building an Alert Routing setup that never misses a critical incident

Critical incidents have a direct impact on your business revenue and the trust your customers place in you. The longer a critical incident goes unnoticed, the higher the stakes. A reliable alert routing setup automatically catches these incidents the moment they trigger and gets them to the right person without delay. This guide walks you through how to build that reliable routing setup.

How to handle midnight incidents without waking everyone up

When a midnight incident triggers, the goal is not to wake your entire team. It’s to reach the one person who can act on it. Everyone else should sleep through it undisturbed. The difference between a team that handles midnight incidents well and one that doesn’t usually comes down to a few decisions made ahead of time. Which incidents actually need a midnight response? Who should get the call? And what should happen to everything else? This guide walks through those decisions.

Routing incidents the way their severity and priority demand

Severity and priority are two labels that describe different things about an incident. Severity covers the blast radius: how much of your system or how many customers are affected. Priority covers the urgency: how quickly someone needs to act. Routing rules then use these labels to load the right escalation policy for each incident. This guide covers how to define your severity and priority levels and map them to escalation policies.