Operations | Monitoring | ITSM | DevOps | Cloud

Incident Management

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

How to build a customer advisory board

Regardless of where you are in your product journey, it is impreative that you constitute a customer advisory board who can share perspectives into their business challenges so that you can gain insights on how to shape our road map, develop new features, formulate your vision and give you constant feedback on your product. So, how many customers should to include in a customer advisory board? Should you target higher level stakeholder or individual users?

Keeping PagerDuty Always On With Remote Incident Response

Earlier this month, many areas of the internet experienced a major incident caused by a router misconfiguration within a highly used service provider. This led to cascading service failures, causing widespread outages and disruptions for several well-known SaaS organizations. When the outage occurred, our teams at PagerDuty immediately noticed a global spike in events and incidents.

How to Improve On-Call with Better Practices and Tools

In the era of reliability, where mere minutes of downtime or latency can cost hundreds of thousands of dollars, 24x7 availability and on-call coverage to respond to incidents has become a requirement for the vast majority of organizations. But setting up an on-call system that drives effective incident response while minimizing the stress placed on engineers isn’t a trivial task.

What's New: Updates to Visibility Console, Event Intelligence, Analytics, and More!

We’re excited to announce a new set of product updates and enhancements to the PagerDuty platform! PagerDuty partners with organizations to help teams create efficiencies across IT organizations and protect customer relationships. These updates will help further improve your team’s ability to manage and reduce noise, automate critical response workflows, and quickly mobilize a response in order to mitigate disruptions across your digital operations when seconds matter.

Enabling the Stripe and Lyft Platforms Through Modern Safety Science

Jacob Scott is an experienced engineer and enthusiastic participant in the resilience engineering community, having spent time caring for the technology systems powering high-growth startups as well as unicorns like Lyft and Stripe. He is deeply passionate about how to apply learnings from modern safety science to real, complex socio-technical systems.

Root Cause Changes: are they the "Elephant in the NOC?" Here's the CTO Perspective

Ask any IT Ops practitioner what the first question they ask is when joining an emergency bridge call, and you’ll get the same answer: “What changed?” Our customers report that changes in their IT environments cause 60% to 90% of the incidents they see. Yet for some reason enterprises still find it difficult to deal with changes and correlate them to the IT incidents they may have caused.

New Integration: Create Zoom incident bridges automatically

Incident response doesn’t only happen in Slack, so today we’re happy to announce our integration with Zoom to create incident bridges automatically. Using the power of FireHydrant Runbooks, a Zoom meeting can be added with fully customizable titles and agendas based on your incident details. Let’s dive into how it works.

Evan Niedojadlo from Peddle shares his thoughts on being an SRE

Evan Niedojadlo is an SRE at Peddle based in Austin, TX. He is currently on a small team and works on the SRE, Ops, and Security area of the organization. In his free time, he enjoys building communities, reading, music, helping others learn, and being outside.

Defining your Sev-1s

One of the primary things you need to figure out whenever your team is formulating your incident management process is describing in words what a Sev0(your highest incident priority) looks like. “Website doesn’t work” is certainly no enough. “Website is up but a key resource (ie CSS file) is missing, rendering the website unusable” is still not enough. “A single page on the website is 404’ing” is not a major but could be a minor incident.