Operations | Monitoring | ITSM | DevOps | Cloud

Incident Management

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

How to Scale for Reliability and Trust

As more people depend on your product, reliability expectations tend to grow. For a service to continue succeeding, it has to be one customers can rely upon. At the same time, as you bring on more customers, the technical demands put on your service increase as well. Dealing with both the increased expectations and challenges of reliability as you scale is difficult. You’ll need to maintain your development velocity and build customer trust through transparency.

What's New: Improvements to On-Call Schedule Exceptions

We’re excited to present a feature update to the OnPage platform. The new update will bring more flexibility and resiliency to a team’s on-call management workflow. With the new scheduling capabilities, OnPage system administrators can create exceptions to configured, recurring on-call schedules.

Phoenix Project: Sometimes you have to look back to look forward

It has been eight years since The Phoenix Project was published and a lot has changed since then! I started to think about what we’ve learned in that time. It starts with the theory of constraints. I still see it all the time. Organizations take actions which are merely temporary, putting out fires but not solving for the underlying causes of those fires.

Mattermost Incident Collaboration now includes improved communication, automation, and history for incident response teams

Teams are always looking for a speed advantage, and that comes from planning, crisp execution, and teamwork. To this end, we’re excited to release new enhancements to Incident Collaboration to help make life easier for DevOps teams during incident response. The Mattermost platform includes built-in Incident Playbooks with predefined response plans and task lists. Playbooks can be customized to your environment and specific use cases.

Say goodbye to guessing: Introducing Automatic Incident Triage by BigPanda

Low MTTR is the much-desired nirvana-state in IT Operations. One of the most painful parts of the incident management lifecycle, which prevents the achievement of this nirvana, is triage: the time it takes first incident responders to determine the next action when facing a barrage of IT incidents. Why?

IT Incident Response is Improved with a Corporate Status Page

To understand the impact that stovepipes have on incident response, one need look no further than the 9/11 terrorist attacks that occurred in the United States. The CIA, DoD, and FBI all knew about the Al Qaeda terror threats before the planes hit the World Trade Center, but the 9/11 Commission found that a lack of data and intelligence sharing among the agencies limited each agency’s understanding of the looming terrorist threat; thereby, limiting their incident response.

PagerDuty for AIOps & Automation: Innovate & Automate Faster

We continue to improve our AIOps and machine learning capabilities to help customers reduce noise, quickly identify root cause, and automate the resolution of critical, business-impacting issues. This will help organizations further increase cost savings, reduce mean time to resolution (MTTR), and preserve people hours. The following capabilities empower responders to gain control, deliver critical context for faster root cause identification, assess impact, and automate actions with minimal configuration.