Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

PagerDuty Incident Responder custom agent for Github is now Generally Available!

This custom agent in GitHub’s AI ecosystem gives users access to PagerDuty data (including change correlation, incident data, and more) directly in GitHub Copilot, saving time from context switching for faster resolution. The agent can help users analyze incident context, identify recent code changes, and suggest fixes via GitHub PRs. Learn more about PagerDuty’s MCP capabilities for GitHub and other tools here.

SLA, SLO, and SLI: Understanding the Foundations of Service Reliability

Last week, I ordered a pizza on a food delivery app. And they promised the delivery in 30 minutes. Similarly, all digital services: Apps, websites, cloud platforms, etc, make promises about speed, uptime, and reliability. The difference is how they track and measure those promises. That’s where SLA, SLO, and SLI come in. These three metrics define what “reliable” actually means. They turn a vague claim like “99.9% uptime” into something you can measure, track, and act on.

Bring incident response to AI stack with ilert's MCP Server

ilert’s engineering team has developed an open Model Context Protocol (MCP) server that enables AI assistants to securely interact with your alerting and incident management workflows, from determining who is on call to creating incidents. In this article, we provide a simple explanation of MCP, outline the reasons behind our investment in it, describe the high-level architecture, and explain how to connect Claude, Cursor, and other MCP clients to ilert today.

Integration & Data Ingestion: Strengthening AIOps Observability

Large enterprises face the challenge of managing high-volume, very diverse data streams that span both legacy and modern, digital systems and applications. To gain timely, accurate insight across this kind of complexity, IT teams need observability platforms that can do more than just monitor - they must also unify, contextualize and enrich data so teams can act effectively to protect the availability of the services their customers rely on.

Disaster Recovery: Everything You Need to Know

With increasing cyberattacks and cloud outages, maintaining system resilience is critical. A robust Disaster Recovery (DR) strategy enables teams to prepare for unexpected events. It makes sure they can recover critical systems and data with minimal disruption. This blog will cover what disaster recovery is, why it matters, and the key components of an effective Disaster Recovery Plan. We’ll also walk through the steps for creating your own strategy.

Top tips for smoother IT incident management

Top tips is a weekly column where we highlight what’s trending in the tech world and share ways to stay ahead. This week, we’re talking about something every IT team knows too well—incidents. Whether it’s a sudden server crash, a network outage, or a system slowdown right before an important client call, incidents always seem to strike at the worst possible time. No matter how strong your IT setup is, issues are bound to happen.

Your Next Incident Has Already Started. You Just Haven't Noticed Yet.

The best way to minimize the impact of an incident is to catch it early, before small issues snowball into major disruptions. That requires maintaining healthy systems and ensuring sufficient resources are available when problems arise. But developers and IT operations pros working in large enterprises face a challenge: Complex systems operate in an inherently degraded state. In his essay “How Complex Systems Fail,” Dr.

Your Top Engineers Should Be More than Expensive Button-Pushers

The engineer you pay $200,000 a year just spent an hour copy-pasting data between dashboards. Again. Software engineers have critical skills that are in the highest demand. And yet, many world-class engineers are currently spending too much of their time clearing tickets, routing alerts, and responding to the same types of incidents over and over again. This operational toil is costing you.

DNS Outages Expose Hidden Risks. Edwin AI Finds Them Faster.

The recent AWS outage exposed how fragile the internet remains. Amazon traced the hours-long disruption to a DNS error—a small failure with massive reach. For most organizations, DNS operates quietly in the background. When it fails, every digital service connected to it stops. One of LogicMonitor’s valued customers, IG Group, faced a similar event less than ten hours after enabling Edwin AI.

Demo Roundups! What's New in Schedules: Flexible Shifts + AI Conflict Resolution

Manual scheduling and on-call gaps cost your team sleep and sanity. Join us for a demo of PagerDuty's latest schedule experience improvements. From iCal-compatible shift management to AI-powered conflict resolution, see firsthand how to build bulletproof on-call coverage with minimal operational overhead.