Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

Conflict Management and the Major Incident Management Process

Major incidents are, by their very nature, stressful and intense. The ITIL 4 definition of a major incident is: High-stress situations can cause conflict that left unchecked could delay the fix effort. Since we already have a definitive guide on incident management, this blog post will focus specifically on the major incident management process.

xMatters remains a G2 Grid Report Leader

Worldwide businesses and their technical resources use G2, the leading business solution review platform, to analyze software, gather user feedback, and make informed decisions about technology. Although we value all the recognition we’ve earned on G2 over the years, there’s one that always stands out and makes us feel extra proud of what we’ve accomplished so far.

Debug issues and automate remediation with Shoreline and Datadog

Shoreline is an incident response automation service that enables DevOps engineers and site reliability engineers (SREs) to quickly debug and remediate issues at scale and develop automated routines for incident management. Using Shoreline’s proprietary Op language, customers can run debug commands across all their hosts simultaneously and then deploy custom scripts via Actions to trigger automated remediations.

Making Go errors play nice with Sentry

Here at incident.io, we provide a Slack-based incident response tool. The product is powered by a monolithic Go backend service, serving an API that powers Slack interactions, serves an API for our web dashboard, and runs background jobs that help run our customers incidents. Incidents are high-stakes, and we want to know when something has gone wrong. One of the tools we use is Sentry, which is where our Go backend send its errors.

Four Use Cases for Optimizing Your Cloud Operations With PagerDuty Runbook Automation

The cloud is easy and powerful—until it’s not. Once companies have customers, commitments, and compliance concerns, they often have to create cloud operations teams to manage the cloud on behalf of their fellow employees. Often, organizations that migrate to the cloud find themselves hampered by inefficient cloud operations if they haven’t standardized their IT procedures for operability.

Keep Stakeholders Informed During Major Incidents

During major incidents, it’s crucial that all stakeholders are provided with the status updates they need. Those communications however need to be tailored to what the stakeholder actually needs, and provided in a streamlined format that works best for them. Just like alert fatigue, communication fatigue can be detrimental during an outage or other service reliability issue.