Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

ROI of Reducing MTTR: Real-World Benefits and Savings

Mean Time to Repair (MTTR) stands as a critical metric when it comes to IT Operations and Incident Management. Reducing MTTR is not just a technical goal but a strategic business imperative, driving significant Return on Investment (ROI) through various tangible and intangible benefits. This blog delves into the real-world benefits and savings achieved by reducing MTTR, emphasizing its importance in contemporary business environments.

Managing Vendor Incidents: Customer Impact That Isn't Your Fault

One of the first key tenets of cloud computing was that “you own your own availability”, the idea being that the public cloud providers were making infrastructure available to you, and your organization had to decide what to use and how to use it in order to meet your organization’s goals. The cloud providers have no knowledge of your applications or their KPIs.

PagerDuty Executive Spotlight Series: Vodafone

Vodafone is a Global 500 telecommunications company in Europe and Africa servicing over 320 million mobile customers across 21 markets. In this PagerDuty Executive Spotlight, we sat down with Ahmed Elsayed, UK CIO & Digital Engineering Director at Vodafone, to discuss his experience unifying a global engineering team to streamline the development and deployment of digital products and services to ensure an exceptional customer experience.

Incident Management vs Problem Management: Definition & Differences

Imagine this: your company’s website suddenly goes down during a peak sales hour, leaving customers frustrated and potential revenue lost. This situation calls for immediate action, which is where Incident Management comes into play. But what happens next? If this issue recurs, it signals the need for a deeper investigation—enter Problem Management.

Alerting with Twilio: Connect Your Monitoring with the Top-1 Communications Platform

You might be surprised. Why does ilert, the platform dedicated to alerting and incident management, publish anything about the direct (in the sense of bypassing an incident management tool) connection between monitoring solutions and Twilio? Do they take the bread out their own month? —You might think. Working on DevOps incident management since 2009, we believe every solution fits specific needs.

Balancing Centralization and Autonomy: The Key to Automation at Scale

The recent global outage reminds us that identifying issues and their impact radius is just the first part of a lengthy process to remediation. Incidents are inevitable; how we prepare for and learn from them is what sets teams up to respond more effectively next time. As we saw from the remediation steps taken by enterprises around the world, implementing a known fix across a large number of environments that are potentially managed by a number of distributed teams can be a gargantuan challenge.

How Stress Affects Our Learning Abilities in Incidents (And What To Do About It)

While retrospectives provide a valuable pathway for learning outside of the flow of work, we also want learning to happen during an incident or unexpected event as it unfolds. This can be challenging due to the negative impact of stress on our ability to learn and navigate difficult situations. In this article, we’ll dig into how stress inhibits our ability to learn and what we can do about it.

Introducing Squadcast's Audit Logs: Enhanced Visibility and Control

Maintaining comprehensive records of user and entity-related changes within your Incident Management platform is crucial. Organizations have long relied on external analytics tools for these insights. However, the demand for an integrated solution within Squadcast has been growing. We are excited to introduce Squadcast's Audit Logs feature, designed to address this need directly within our platform.

Incident Metrics: Exploring MTTF

Metrics play a pivotal role in assessing performance, identifying areas for improvement, and ensuring optimal service delivery in IT. One such critical metric is MTTF (Mean Time To Failure). Basically, it calculates the average amount of time a system or component is expected to operate before experiencing a failure. But what exactly is MTTF, and why is it essential to managing IT infrastructure?