Operations | Monitoring | ITSM | DevOps | Cloud

Incident Management

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

Sponsored Post

Best practices when managing an outage

There's never a good time for a service outage. And, from the moment it hits, it starts affecting your stakeholders. Suddenly, essential daily tasks are curtailed while your team enters emergency response mode. However, the surest way to mitigate damages and recover quickly is to follow a set of best practices. It's far better to plan for an outage. But if you wait until it happens before you start developing a response, you will be far behind where you need to be for a quick resolution. This guide will help you create a set of best practices for your organization. This will help you work toward faster and more effective responses.

Implementing SLAs, SLIs, and SLOs: A guide to monitoring best practices

Implementing SLAs, SLIs, and SLOs is essential for effective monitoring and maintaining optimal system performance. As companies grow, they may add a significant number of KPIs that burden their IT assets, leading to system sluggishness and employee complaints. Developers must balance business needs with IT processes, and SLAs, SLIs, and SLOs can help them achieve this balance.

Top 6 Tips for Improving MTTx

In our research for the inaugural State of Availability Report, we asked 1,900 engineers about mean time to detect (MTTD) and mean time to recovery (MTTR) as two leading incident management Key Performance Indicators (KPIs) strongly associated with availability. We learned that less than 15% of respondents are tracking their MTTD. It takes twice as long to discover an issue than it does to resolve it.

Best practices for IT incident management

Today, many digital technologies in IT can operate with minimal human intervention. However, while they boost productivity and drive growth, any failure or unpredictable behavior can pose a significant challenge for the ITOps and DevOps teams. So, effective IT incident management helps minimize the impact of incidents on business operations and ensures that systems are restored as quickly as possible.

The future of AI

It’s no secret that every ITOps leader can face an ever-increasing amount of alerts. Since the dawn of digital, alerts have served an important purpose. Sometimes all those alerts can become overwhelming noise, and sorting out what is and is not a priority can become challenging. The good news is that artificial intelligence (AI) and machine learning (ML) are adept at processing large data sets in real time, looking for patterns and being able to aid in decision making.

How ITOps is evolving to support brick-and-mortar organizations

To hear Ehab Tarabay explain it, the need for retailers to continue evolving their digital operations is an age-old problem. I recently hosted Tarabay, head of workplace IT services at TMF Group, on our That’s Great IT podcast. As an avid information technology specialist with a track record of more than 20 years in the technology field, he had a unique perspective to share about the shift that’s happening in retail right now.

How to be successful with Unified Analytics

As an ITOps professional, it can be challenging to justify all of your actions to your organization. After talking with many of you, we saw first-hand the pains and gaps around showing the impact of your team and the constant struggle to measure how you’re improving. That’s where Unified Analytics comes into play.

The Incident Commander Role: Duties & Best Practices for ICs

Imagine that a critical incident — a major outage, cyberattack or disaster — occurs out of nowhere in your company. In such a case, you'll try to minimize the damage and get back to normal operations as quickly as possible. But how will you do that? You've no idea how to manage such incidents. This is where incident commanders come in. They're trained professionals who lead the response to critical incidents.