Operations | Monitoring | ITSM | DevOps | Cloud

Incident Management

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

Reduce MTTR and Take Automation to a New Level with PagerDuty Global Event Orchestration

PagerDuty’s Global Event Orchestration is now generally available. Global Event Orchestration’s powerful decision engine enriches events, controls their routing, and triggers self-healing actions based on event data. Teams can use this functionality across any or all services within PagerDuty. This feature is a continued investment in Event Orchestration, demonstrating PagerDuty’s commitment to providing customers with best-in-class automation capabilities.

Incident Response Guide

Site reliability engineering (SRE) is a critical discipline that focuses on ensuring the continuous availability and performance of modern systems and applications. One of the most vital aspects of SRE is incident response, a structured process for identifying, assessing, and resolving system incidents that can lead to downtime, revenue loss, and brand reputation damage.

Automated Incident Management

Automated Incident Management is the process of automating some or all these tasks through various means. Automated incident management can improve incident response time, reduce unnecessary work, such as when an issue is a minimal impact. AlertOps can help automate incident management by creating tickets in help desk systems, filtering and rules, and escalating alerts.

Alarm Notification Software: SIGNL4 is test winner

The renowned German manufacturing magazine “Factory Innovation” recently conducted a comprehensive practical test on four leading alarm notification software for industrial manufacturing in their latest issue (01/23). The four alarming systems that were evaluated include: the Alarm Control Center from Alarm IT Factory (a spin-off of Siemens AG), ALERT 4.0 from Micromedia, the Alarm and Information Portal (AIP) from VIDEC, and SIGNL4 from Derdack.

Our A, B, Cs of external communications

Communication carries more weight than ever before. Businesses are so much more connected to their customers given the number of mediums they can communicate through; Twitter, Instagram, Facebook, and even TikTok. Because of this, it's essential to prioritize these lines of communication throughout your day-to-day. Some might even say that over-communicating is the best way forward. Why? No one likes a company that appears simply like a black box with zero insight into what's happening.

Time to Resolution: What is it, Why You Need it, And How to Calculate it

Ready, set, go: when it comes to customer service, it's a race against the clock. Customers expect lightning-fast responses and complete solutions to their problems. But what happens when your help desk can't keep up with the pace? The answer is simple: frustration, dissatisfaction, and potentially lost clients. That's why measuring and improving Time to Resolution (TTR) is crucial. As a customer, there's nothing more irritating than dealing with a slow or ineffective help desk.

How to prepare for, deal with, and recover from IT outages

The average cost of an IT outage is $12,900—per minute. And when it comes to a “significant outage,” organizations reported the average overall cost was a whopping $1,477,800. On the latest podcast episode of That’s great IT, I spoke with Scott Lee, AVP for infrastructure and ITOps at Arch Mortgage Insurance Company, part of Arch Capital Group, about how organizations can best navigate IT outages.

Global Event Orchestrations Demo

Frank Emery, Principal Product Manager, joins the Twitch stream to talk about and show off enhancements to Event Orchestration, featuring the new Global Event Orchestrations feature. Global orchestration rules will enable your organization to suppress, annotate, and customize events for all services in your PagerDuty account. This new feature is available to all accounts with AIOps plans.

Transforming Incident Management with KPIs: A Comprehensive Guide

In modern times, the significance of digital experiences cannot be overstated across various industries. Thus, a well-designed and effective incident management system is essential to ensure the smooth running of businesses and prevent any revenue loss. The ability to respond and resolve incidents promptly enhances the dependability and trustworthiness of businesses in the eyes of their users. Conversely, failure to handle incidents efficiently can lead to negative consequences.