Operations | Monitoring | ITSM | DevOps | Cloud

Incident Management

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

See Global Event Orchestration End-to-End

Global Event Orchestration’s powerful decision engine enriches events, controls their routing, and triggers self-healing actions based on event data. Teams can use this functionality across any or all services within PagerDuty. This feature is a continued investment in Event Orchestration, demonstrating PagerDuty’s commitment to providing customers with best-in-class automation capabilities. Check out this live demo from Principal Product Manager Frank Emery.

Assembly time is where you have the most control of an incident

The FDNY EMS Command responds to more than 4,000 calls per day. They range from car accidents to building fires to cats stuck in trees, and responses vary accordingly. Sometimes they might take hours, sometimes they take just a few minutes. With such unpredictable conditions, the FDNY focuses on improving what they call “response time.” That’s the amount of time between a 911 call being made and emergency responders arriving on the scene. This might sound familiar.

Trust shouldn't start at zero

How often have you heard the phrase “trust is earned” in life? While well-meaning, I think this can actually lead to some strange behaviour at work, especially when you’re on a fast growing team. Startups experience a lot of chaos and unknowns your teams need to navigate, so it’s vital to know you can trust the people around you. As you grow, how you set expectations around trust as people join your team can impact your ability to hire, onboard, ship and ultimately, survive.

How to Manage Customer Support Channels in Slack: A Step-by-Step Plan

As more and more teams transition to remote work, collaboration tools like Slack have become increasingly popular. Slack's chat-based communication platform makes it easy to keep teams connected and informed, but it can also create challenges when it comes to managing support channels. In this post, we'll explore different approaches to building a Slack-based support system and provide some tips for success.

Debugging Kubernetes with Automated Runbooks & Ephemeral Containers

In our previous blog, we discussed the difficulty in capturing all relevant diagnostics during an incident before a “band-aid” fix is applied. The most common, concrete example of this is an application running in a container and the container is redeployed—perhaps to a prior version or the same version—simply to solve the immediate issue.

10 Mistakes to avoid when framing your IT Incident Management Strategy

An IT incident is an unplanned disruption that negatively impacts an IT service. As the importance of IT to the business has increased, the impact of IT incidents has become greater. IT incidents can result in revenue loss, loss of employee productivity, SLA financial penalties, government fines, and more. An effective IT incident management strategy is now essential in every organization. For a business like Amazon whose entire business relies on IT, a single second of slowness can cost over $15,000.

How to get started with incident management metrics

Tracking incident metrics can help you discover patterns in the causes and costs of incidents and help you understand brittle parts of your organization. We've seen them help teams zero in on things like: But it can be intimidating to get started. Do you really need metrics if you're a small team or just beginning to formalize your incident management program? I say yes. The key is to start with something manageable and grow.

How Abbott transformed its incident management process with Workflow Automation

Eliminating errors and streamlining the incident management process are top priorities for many ITOps, NOC, SRE, and DevOps teams. With organizations using multiple tools in their IT stack, manually finding the right information at the right time becomes crucial during incident triage. By automating tasks and workflows, businesses can eliminate manual tasks that are time-consuming, repetitive, and prone to mistakes.

The Rise of ServiceOps: Unifying IT Service Delivery

With the complex and steadfast growth of IT service delivery processes, organizations and their internal teams have come to rely on several tools in their toolbox to deliver best-in-class products and services. The use of AIOps, AI/ML, and overall automation has shaped modern delivery methods, but what we call this process, and how we grow to advance it, has yet to find a definition that’s universally recognized.

Reflecting on one of the biggest incidents in our history

We have to come clean. During KubeCon, we experienced an incident that we weren’t ready to discuss until now. This incident caused quite a disruption and, had it been left unresolved, would have had a massive snowball effect. At the time, we didn’t want to raise any alarms, so we kept it quiet while our team rallied to resolve it. And to be honest, most folks probably didn’t even realize that it happened since we moved so quickly.