Operations | Monitoring | ITSM | DevOps | Cloud

Incident Management

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

Moogsoft AIOps: Enterprise Platform Showcase

Recorded live at the Moogsoft User Conference, several product experts do a deep dive on Moogsoft Enterprise, our flagship AI platform for IT Operations. They go into specific features and capabilities, including the workflow engine and topology visualization. https://www.moogsoft.com/aiops-enterprise/

Moogsoft AIOps Science with Dr. Robert Harper

Moogsoft has over 50 mathematical patents and counting. Our AI is not an add-on feature -- it's core to our platform. In this session at the recent Moogsoft User Conference, Dr. Robert Harper, Moogsoft’s Chief Science Officer & VP of Engineering for Advanced Algorithms, explains how our algorithmic suite, and algorithmic and AI techniques offer our customers the true benefits of AIOps. By delivering deep contextual insights, Moogsoft AIOps enables IT Ops and DevOps teams to solve IT problems faster than products that rely on rules-based approaches.

Pavlos Ratis shares his experience on being an SRE

Pavlos is a Site Reliability Engineer based in Munich, Germany. He likes building software and expanding his knowledge around the reliability of services and their infrastructure. He has created a few open-source SRE projects such as the awesome-sre, Wheel of Misfortune, Availability Calculator, and awesome-chaos-engineering to assist teams and individuals in getting on board with the SRE culture.

Postmortems vs. Retrospectives: When (and How) to Use Each Effectively

When we announced the launch of our Retrospectives Guide, we wrote about the value of scaling the continuous improvement mindset to beyond Product Development at PagerDuty by establishing the RetroDuty community. In this installment of our blog post series on retrospectives, I highlight the differences between postmortems and retrospectives. You might have heard of postmortems and/or retrospectives before reading our guides.

OnPage's New Two-Way Dispatcher and User Communications Feature

OnPage is pleased to announce a new, innovative two-way dispatcher and user communications feature launching next month, allowing system administrators to communicate with on-call healthcare providers. OnPage wanted to launch a feature, where a console dispatcher can initiate and send secure messages to the right providers. After receiving a dispatcher message, on-call providers can reply back to the message. This new feature converts one-way communications into two-way threaded exchanges.

Site Reliability Engineering-Why you should adopt SRE

Site reliability engineering was a term coined by Google engineer Benjamin Treynor in 2003 when he was tasked with making sure that Google services were reliable, secure and functional. He and his team eventually wrote the book on SRE which is available online for free for anyone interested in research and implementation of SRE best practices.

Severity Matrix Updates

We’re on a mission to make responding to incidents a bit less chaotic. One of the best features we offer (we’re definitely not biased, no way) is a simple way to define how a severity gets determined when you open an incident. We call it the severity matrix, and today it has a new look. Previously, we had a preset list of conditions and impact that allowed you to pick a severity that matched them.

From Mayhem to Modernization: The Evolution of Critical Incident Management

Let’s face it, managing a critical incident has never been a walk in the park. Even, in the “good old days,” before the great cloud revolution and the onslaught of digital transformations, an incident often meant mayhem. Processes were manual, time consuming, difficult to execute, document, and learn from. Getting all the right people in the “same room” at the right time – was nearly impossible. Lots of time was wasted chasing down the right folks.