Operations | Monitoring | ITSM | DevOps | Cloud

Incident Management

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

Here's what SLIs AREN'T

SLIs, or service level indicators, are powerful metrics of service health. They’re often built up from simpler metrics that are monitored from the system. SLIs transform lower level machine data into something that captures user happiness. Your organization might already have processes with this same goal. Techniques like real-time telemetry and using synthetic data also build metrics that meaningfully represent service health.

The MTTR that matters

“Mean time to X” is a common term used to describe how long, on average, a particular milestone takes to achieve in incident response. There’s mean time to detect, acknowledge, mitigate, etc. And then there’s the elusive “mean time to recover,” also known as “MTTR.” MTTR, a hotly debated acronym and concept, measures how long it takes to resolve an incident on average. The problem with MTTR, though, is that it doesn’t matter.

AIOps as a modern cockpit, and why that matters

Our human capacity for ingesting information and acting on it, is constant. As the systems we operate grow more complex, we need to make sure we use technology that presents us with only the relevant information we need, exactly when we need it. In aviation, this lesson was learned long ago, and now IT Ops is catching up.

Press Release: iLert achieves Amazon RDS Ready designation

Cologne, Germany – iLert GmbH, a SaaS company for alerting, on-call management, and uptime monitoring, announced today that it has achieved the Amazon RDS Ready designation, part of the Amazon Web Services, Inc. (AWS) Service Ready Program. This designation recognizes that iLert has demonstrated successful integration with Amazon Relational Database Service (Amazon RDS).

Faster Incident Resolution with Context Rich Alerts

Labelling your alert payloads although simple can significantly improve the time it takes for your team to respond to incidents. In this blog learn how Squadcast's auto-tagging feature can be a game changer by enabling intelligent labelling & routing of alerts to ultimately reduce your MTTR. A frequent problem faced by on-call engineers when critical outages occur is pinpointing the exact point of failure.

5 Steps to Building an Effective Clinical Communication Plan

Organizations require a well-crafted clinical communication plan to streamline workflows across care teams. The communication plan must include processes, hardware and software that improves how providers perform. An effective communication plan eliminates barriers across departments and ensures that all providers are informed of patient-related incidents. High-level healthcare administrators are responsible for designing, managing and launching the clinical communication plan.

Chapter 7: In Which Sarah Experiments with Observable Low-Code

This is the seventh chapter in a series of blog posts exploring the role that intelligent observability plays in the day-to-day life of smart teams. In this chapter, our DevOps Engineer, Sarah, experiments with low code and Moogsoft in her team’s DevOps toolchain to rush a new feature out the door to keep up with a competitor.

Streamline incident management with BigPanda's offering in the Datadog Marketplace

BigPanda is a domain-agnostic AIOps platform that helps organizations detect and resolve incidents in their complex IT environments. By unifying and correlating data from monitoring, change, and topology tools, BigPanda enables teams to quickly pinpoint the root cause of issues and prevent costly outages.