Operations | Monitoring | ITSM | DevOps | Cloud

Incident Management

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

Product update: ensure consistent data across all your retros with two new features

FireHydrant captures your incident, from declaration through remediation, and gives you a framework to run your retrospectives. But retrospectives are only as effective as their inputs. Now we're delivering a better way to learn from and analyze retrospectives by guaranteeing consistent, structured, and sufficient data from your team.

OnCallogy Sessions

Being on call is challenging. It’s signing up to be operating complex services in a totally interruptible manner, at all hours of the day or night, with limited context. It’s therefore critical to have proper on-call on-boarding procedures, offer continuous training sessions, and continuously improve documentation. We also need to make sure people feel safe by providing ways to reduce their stress, and make room for questions to surface all sorts of uncertainties around our operations.

Conflict Management and the Major Incident Management Process

Major incidents are, by their very nature, stressful and intense. The ITIL 4 definition of a major incident is: High-stress situations can cause conflict that left unchecked could delay the fix effort. Since we already have a definitive guide on incident management, this blog post will focus specifically on the major incident management process.

xMatters remains a G2 Grid Report Leader

Worldwide businesses and their technical resources use G2, the leading business solution review platform, to analyze software, gather user feedback, and make informed decisions about technology. Although we value all the recognition we’ve earned on G2 over the years, there’s one that always stands out and makes us feel extra proud of what we’ve accomplished so far.

Debug issues and automate remediation with Shoreline and Datadog

Shoreline is an incident response automation service that enables DevOps engineers and site reliability engineers (SREs) to quickly debug and remediate issues at scale and develop automated routines for incident management. Using Shoreline’s proprietary Op language, customers can run debug commands across all their hosts simultaneously and then deploy custom scripts via Actions to trigger automated remediations.

How to use PagerDuty with Blameless

Blameless integrates with PagerDuty so you can notify teams and key stakeholders during an incident. We also help you search escalation policies and on-call rotation schedules. In this video, our Solutions Engineer walks you through navigating the initial setup and configuration in the Blameless UI. He'll then demonstrate how the integration works in real-time. If you use Slack or Microsoft Teams for internal communications, you'll also learn how to access and manage the PagerDuty integration from within those tools.

Keep Stakeholders Informed During Major Incidents

During major incidents, it’s crucial that all stakeholders are provided with the status updates they need. Those communications however need to be tailored to what the stakeholder actually needs, and provided in a streamlined format that works best for them. Just like alert fatigue, communication fatigue can be detrimental during an outage or other service reliability issue.