Operations | Monitoring | ITSM | DevOps | Cloud

Incident Management

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

How to detect anomalies in logs, metrics, and traces to reduce MTTR with Elastic Machine Learning

Elastic Observability has extensive machine learning capabilities that support and improve analysis in APM. Learn techniques for correlating and detecting anomalies of telemetry data from APM agents for a particular application.

Blameless culture drives incident learning and other key insights from Catchpoint's 2022 SRE Report

SRE is a constantly evolving field, responding to the challenges of increasing reliance on tech and the opportunities of its evolving abilities. Reliability has to remain a step ahead of the cutting edge, whether it’s navigating remote work, implementing AI assistance, or optimizing internal processes. But how do we know that SRE is keeping up? ‍ We’re proud and excited to announce the results of the SRE Survey we ran in partnership with Catchpoint.

Expanding Incident Response with Microsoft Teams

Last week we launched a number of features across the PagerDuty Operations Cloud portfolio to help teams minimize downtime and protect customer experience. One of the areas where PagerDuty continues to invest is collaboration and communication during incident response to ensure that all impacted stakeholders across the business are updated in real-time.

Managing a Slew of Monitoring Tools? Here's How to Make Them Talk.

Engineering teams use a lot of single-domain monitoring tools. In fact, the average team manages and maintains 16 monitoring tools — and up to 40 — according to Moogsoft’s State of Availability Report. While IT leaders select and implement these tools to save teams time, our research finds they do quite the opposite. Engineers spend far and away more time on monitoring than they do on any other task — innovative, value-creating tasks included.

The Importance of Role-Based Messaging in Healthcare

Do you remember the classic board game where you have to go back and forth with your opponent deducing which characters on the board you’ve each selected? It’s still played by children today, and unfortunately by healthcare teams as well. Every day, healthcare teams are forced to play a game of “Guess Who?” is on-call if they do not have systems in place for role-based messaging.

Building an incident management process

In this podcast, our panellists discuss the foundations that any team needs to put in place when designing their incident management process. Starting from the basics of defining what we really mean by an incident, to how to set your severity levels, roles and statuses, Chris and Pete share their tips for building solid foundations to run your incidents.

3 questions to ask in the build vs buy debate for incident response tooling

As a former incident responder and now as a responder advocate for FireHydrant, I’ve seen the “build vs. buy” debate play out many times. In fact, I even supported the tool that former employers used for managing incidents for years before they decided to buy (more on that in a future blog post).

Webinar: Real talk: automation for ITOps

IT operations move fast. If you’re an ITOps leader, you need to be moving just as fast to make sure your team has what it needs. Positioning your team for success isn’t easy: complexity in IT is increasing every year and can reach a point where it exceeds a person’s capacity to keep pace. In the face of massive growth, ITOps teams can face major challenges with productivity, burnout and efficiency.

Early stage data teams: a balancing act

Most well established data teams have a clear remit and a well defined structured for what they work on and when: from the scope of their role (from engineer to analyst) to which part of the business they work with. At incident.io, we have a 2 person data team (soon to be 3) with both of us being Product Analysts.