Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

Speed, Scale, and Special Sauce: The Evolution of the PagerDuty Brand

At PagerDuty, our purpose is to empower teams with the time and efficiency to build the future. That means that our own teams are constantly building and relentlessly innovating to help organizations drive transformative change in the way they operate.

Everything you need to know about IT Operations Analytics

Data is both a challenge and an asset for IT professionals, who rely on IT Operations Analytics (ITOA) to guide them towards operational excellence, system reliability, and swift incident resolution. So whether you’re seeking clarity on understanding what ITOA is and its connection to related technologies, are contemplating how to use it within your organization, or are curious about its enhanced efficiency and cost savings benefits, we’ve got you covered.

Panel Discussion: Modern Monitoring and Observability

Struggling with effective monitoring for your services? Not sure how to handle the volume of information your environment creates? Join us for a panel discussion about Monitoring and Observability, featuring Jason Hand of Datadog, Ernest Mueller of Accenture, Steve McGhee of Google, and Peco Karayanev of PagerDuty. Hosted by PagerDuty DevOps Advocate Mandi Walls.

Do you need better cloud observability - or AI-powered cloud visibility?

Maybe you’re still using monolithic applications, built and refined over many years. You understand that shifting to microservices or containerized architectures is a huge and daunting task. You’re probably grappling with the limitations of legacy systems—maybe they’re slow, tough to update, or can’t scale as you’d like. And you’re likely using more traditional IT monitoring tools or even some cloud observability tools.

Kubernetes Incident Management: A Practical Guide

As more organizations embrace containerized applications, Kubernetes has emerged as the leading platform for orchestrating these containers. However, its complexity, combined with the inevitable reality of IT incidents, demands a well-defined strategy for managing disruptions. This article introduces Kubernetes incident management, describes common Kubernetes errors, and provides practical guidance to efficiently handle incidents.

AI-Generated Runbooks

AI-generated Runbooks lower the barrier to entry to new automation developers and speeds up the time to create new automation for experienced automation authors. This feature works seamlessly with the user’s preferred scripting language, offering a low-code solution for what used to be a high-code task. Watch how Runbook Automation users can write the task they wish to automate in plain-English and let AI build a template of automation for that particular task.

Avoiding a Major Incident with PagerDuty AIOps

A global retailer has a major incident occurring and the team doesn’t know it yet. Before PagerDuty AIOps, the NOC would get hit by alert storms and page multiple teams. This resulted in large conference calls and customer downtime. Now, a major incident right before Black Friday has been averted with PagerDuty AIOps. The result is better overall customer experience, no matter how stressed the system is.