Operations | Monitoring | ITSM | DevOps | Cloud

Incident Management

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

Resilience in Action E9: Vulnerability, Compassion, and Post-Incident Reviews in the Emergency Room with Dr. Al'ai Alvarez

‍ What can software engineers learn from post-incident reviews that physicians do in the emergency room? In our ninth episode, Christina, member of the Blameless strategy team, guest-hosts the podcast to interview both Kurt Andersen and Al'ai Alvarez, MD (@alvarezzzy). Dr. Alvarez is an assistant clinical professor of Emergency Medicine at Stanford. Clinically, he’s an emergency physician.

Chapter Nine: In Which Dinesh Experiments with Chaos Engineering

Another day, another drama! This one, though, is very much of my own making. I have been wanting to try my hand at a bit of chaos engineering for some time now but C&Js just hasn’t been ready. Sarah’s been up for it too, though, at Animapanions. And now that our CIO, Charlie has seen MTTR drop across every single technology team, thanks to the rollout of Moogsoft and the new incident management system (kudos to James), it’s pilot day.

What is Incident Management in IT and Why does it matter?

Incident management is the process of identifying and resolving problems that occur in IT services. Incident Management is also used as a metric to measure the health of the IT Service Desk. Let’s discuss what incident management is, why it matters to your business, and how you can apply it to your organization.

Splunk On-Call prevents and cuts downtime episode length by half

Your Answer: Escalate the right alerts to the right on-call people for fast collaboration and issue resolution with Splunk On-Call. Reduce burn-out and make on-call suck less with a complete ChatOps experience that's integrated with your IT stack and incident reporting.

Most frequently asked questions surrounding Google's Cloud Operations Sandbox

Cloud Operations Sandbox serves as a simulation tool for budding SREs to learn the best practices from Google and apply them to real cloud services. In this blog, we have compiled a list of FAQs surrounding the use of Google's Cloud Operations Sandbox. The Google SRE sandbox provides an easy way to get started with the core skills you need to become a SRE.

Dun & Bradstreet Reduces Mean Time to Resolution with xMatters

How does a business continue to improve its incident management processes, when it’s already using some of the best tools on the market? Join Nick Romanelli, Site Reliability Engineering Lead at Dun & Bradstreet, and Zoe Na, Customer Success Manager at xMatters, as they discuss how Dun & Bradstreet has been able to use xMatters to reduce MTTR and streamline major incident management. With their innovative use of Flow Designer, Dun & Bradstreet have created unique workflows that you’re going to want to know about!

Hear From Product Automation & AIOps Lightning Talk

Learn about what's new with PagerDuty Runbook Automation & AIOps from the Summit 2021 Launch. Our Product team shares how you can benefit from our latest updates and enhancements and enjoy demos that were recorded live from Summit 2021 featuring PagerDuty Runbook Actions, Customer Change Event Transformer, Change Correlation, and Outlier Incident.