Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

What Is Five 9s in Availability Metrics?

What comes to mind when you hear that an IT component has “five 9s availability”? Five 9s availability of >= 99.999% is the peak metric for IT availability. Five 9s predicts that a measured component — whether it is a server, communication line, app, service, or any other item — will be available at least 99.999% of the time during a specific period.

BigPanda and ServiceNow improve IT service management

By breaking down the silos between observability, IT operations, and service management, teams can improve service delivery and enhance IT incident management. However, this is more easily said than done. The average BigPanda customer uses more than 20 observability and monitoring data sources. Combining mountains of alert data with legacy event management systems can make it almost impossible to sift through the noise to find the most important alerts.

Don't get caught in the dark: Lessons from a Lumen & AWS micro-outage

While major outages like the recent CrowdStrike incident dominate headlines, those of us in the trenches ensuring Internet Resilience know that most of our issues are not necessarily global but localized by geography, autonomous systems, or something else. Micro-outages – those elusive, localized incidents – can pose the most persistent threat to observability.

Introducing: incident.io for Microsoft Teams

There’s a major outage. Support tickets are mounting. Everybody from engineering to legal is scrambling for information. You have more Teams notifications clamouring for attention than you do minutes to address them, and it’s hard to know where to begin. What comes next is a balancing act—mitigating the impact, updating colleagues, managing action items, or updating a status page that will be seen by millions.

Harness GenAI to enhance IT incident management

Advances in generative AI are rapidly transforming the IT operations landscape. According to Enterprise Strategy Group, 85% of organizations use or plan to deploy AI across many functional areas, including ITOps. AIOps platforms can apply advanced GenAI to quickly identify an incident’s root cause, impact, and recommend steps to resolution. When fed the correct information, AIOps gives IT teams immediate access to context-rich insights.

Building On-call: Continually testing with smoke tests

With the release of On-call, our system’s reliability had to be solid from the outset. Our customers have high expectations of a paging product—and internally, we would not be comfortable with releasing something that we weren’t sure would perform under pressure. While our earlier product, Response, was the core of a customer’s incident response process after an incident was detected, we’re now the first notification an engineer gets when something’s wrong.

Intelligent Alerting, Fewer Headaches: Insider View at ilert AIOps

You might have noticed that we released a series of AI-supported features last year. Intelligent alert grouping, developed to reduce alert fatigue, is the icing on the cake. ‍ With it, we combined all ilert AI features in a new powerful add-on that aims to reduce stress and give more clarity during IT incidents.