Operations | Monitoring | ITSM | DevOps | Cloud

Incident Management

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

Why Your Team Needs an Automation Center of Excellence

Read the full ebook, The Value of Implementing an Automation Center of Excellence, here. Automation has been a proven change-maker for business operations for decades. In this era of technology and innovation, its use is geared towards streamlining repetitive tasks, boosting developer productivity, and reducing operational costs.

How to Improve Your Service Reliability with ilert Status Pages

According to the Uptime Institute, during the last year, the number of IT incidents slowly declined while the average cost of every incident grew. As dependency on digital services increases, the cost for ⅔ of all outages exceeds $100,000. Stakes are rising, and more and more companies are investing in proactive incident management.

Harness AI for financial services IT

IT operations teams in the financial services industry face serious challenges. Customers expect a seamless experience across a complex landscape including online platforms, mobile devices, and ATMs. Competition is fierce. Technology evolution continually disrupts the marketplace. These factors create obstacles for the teams tasked with ensuring near-perfect service availability while continuing to innovate.

The power of context in root-cause analysis

The ability to quickly and accurately identify the root cause of IT incidents is paramount. According to EMA Research, more than 80% of IT professionals said a solution that could generate an accurate summary of alerts and incidents, including the likely root cause, would be transformational or high value. Respondents noted that such a solution would reduce mean time to resolution (MTTR) by 10 to 30 minutes.

Better multi-timezone support for On-call overrides

Today, we are bringing enhancements to on-call overrides. For many remote teams using Spike, we are addressing the need to manage overrides across multiple time zones. This new design makes it easy to see override times in the local time of the person taking over. It adds clarity and helps you be mindful about on-call times. We also focus on clearly showing who is taking over on-call duties, enhancing overall management and coordination.

AIOps use cases: Technical, operational, and business

ITOps stands at a crossroads: Teams need help managing high volumes of alerts and coordinating between different tools and teams. They must balance the agility offered by cloud technologies and the stability provided by on-premises solutions. Success relies heavily on adaptability and clarity, requiring flexibility, with synchronized technology stacks for seamless IT operations. AIOps, a term coined by Gartner, provides a straightforward way to improve IT operations.

MTBF, MTTR, MTTF, MTTA: Incident Metrics Explained

When it comes to managing incidents and ensuring operational efficiency, understanding key metrics is crucial. Among the most important are MTBF (Mean Time Between Failures), MTTR (Mean Time To Repair), MTTF (Mean Time To Failure), and MTTA (Mean Time To Acknowledge). In this blog, we'll explore these metrics along with some best practices and practical applications.

How the PagerDuty Operations Cloud Can Play a Part in Your Digital Operational Resilience Act (DORA) Strategy

Since I wrote DORA vs DORA!, a number of people have asked if I could give more practical advice on how the PagerDuty Operations Cloud can play a part in helping firms in the Financial Services Industry (FSI) to meet their obligations under DORA. Let me try to do that now.

Building the Best Incident Response Team

When it comes to critical incident management, IT teams require a structured approach that will ensure that any cybersecurity event is swiftly remediated. And no incident management plan is complete without a clearly defined incident response team. Whether your team is looking to establish an incident response team from scratch or just improve existing response practices, this blog will help your organization understand what it takes to build the best incident response team.

Redefining incident management: the incident way

Gone are the days when incidents were manual to resolve, invisible to customers, and overall viewed with a negative lens. This is part two of the virtual event series as we dive into our fresh take on what incidents should look like, The Incident Way, and hear from customer stories putting these principles into practice.