Operations | Monitoring | ITSM | DevOps | Cloud

Incident Management

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

8 Strategies for Reducing Alert Fatigue

Site Reliability Engineers (SREs) and DevOps teams often deal with alert fatigue. It's like when you get too alert that it's hard to keep up, making it tougher to respond quickly and adding extra stress to the current responsibilities. According to a study, 62% of participants noted that alert fatigue played a role in employee turnover, while 60% reported that it resulted in internal conflicts within their organization.

Supercharged with AI

One of the most painful parts of incident management is keeping on top of the many things that happen when you’re right in the middle of an incident. From figuring out and communicating what’s happening, to ensuring you learn from previous incidents, and even capturing the right actions – incidents are hard, but they don’t need to be this hard.

Empowering your AIOps journey: Rediscovering the power of BigPanda University

We hope this message finds you well in your start to 2024. As pioneers in the field of AIOps, we understand that the landscape is ever-evolving, and staying ahead requires continuous learning. That’s why we’re thrilled to remind you of a particularly invaluable resource at your fingertips—BigPanda University.

The Catchpoint 2024 SRE Report - Five Key Takeaways

Only emerging into the mainstream in the 2010s, SRE is a relatively new discipline in tech. It’s been rapidly adopted by a widening variety of organizations, implementing constantly evolving practices. For the last six years, Catchpoint has been running a survey to take the temperature of the latest developments and trends. Check out the full report here, and read on to see our analysis on five key takeaways.

Ultima Release - xMatters

The age of Ultima is upon us! While dragons, wizards, and dungeons may only appear on a fantasy map, it takes preparation and resilience to conquer the highest-level incidents in the real world. Let's explore what's new in your xMatters inventory: To help teams better understand the criticality of incidents, use service categorizations to sort your technical and application services into different tiers.

APAC Retrospective: Learnings from a Year of Tech Outages - Dismantling Knowledge Silos

As our exploration through 2023 continues from the second blog segment, “Mobilise: From Signal to Action”, one undeniable fact persists: Incidents are an unavoidable reality for organisations, irrespective of their industry or size. In the APAC region, a surge in regulatory enforcement has been observed against large corporations failing to meet service standards, resulting in severe penalties.

Mastering IT Alerting: A Short Guide for DevOps Engineers

$575 million was the cost of a huge IT incident that hit Equifax, one of the largest credit reporting agencies in the U.S. In September 2017, Equifax announced a data breach that impacted approximately 147 million consumers. The breach occurred due to a vulnerability in the Apache Struts web application framework, which Equifax failed to patch in time. This vulnerability allowed hackers to access the company's systems and exfiltrate sensitive data. ‍

Debugging Go compiler performance in a large codebase

As we’ve talked about before, our app is a monolith: all our backend code lives together and gets compiled into a single binary. One of the reasons I prefer monolithic architectures is that they make it much easier to focus on shipping features without having to spend much time thinking about where code should live and how to get all the data you need together quickly. However, I’m not going to claim there aren’t disadvantages too. One of those is compile times.

Tech is Easy, People are Hard - Incidentally Reliable with Suresh Kumar Khemka(Head of Infra @apna)

Settle in and listen to Suresh Kumar Khemka(Head of Platform & Infra at apna) talk about platform engineering, balancing bureaucracy and velocity at startups and Tech Giants, and the rippling impact of an e-commerce's downtime. Exclusively on The Incidentally Reliable podcast — made by SREs for SREs, hosted by Zenduty.

A New Approach To Incident Management

In recent years, IT departments have faced the challenge of adapting to an evolving landscape of demands. While the primary focus of traditional incident management solutions has been to reduce downtime, it's become clear that just reducing the amount of downtime isn’t sufficient. To truly mitigate the total impact of downtime, there must be a focus on reducing the damage and costs that accumulate while you are down.