Operations | Monitoring | ITSM | DevOps | Cloud

Incident Management

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

The Challenges of Rising MTTR - And What to Do

Data volumes are soaring. Environments are increasingly intricate. The risk of applications and systems encountering breakdowns is sky-high, and the mean time to recovery (MTTR) for production incidents is moving in the wrong direction. Disruptions not only jeopardize critical infrastructure but also have a direct impact on the bottom line of organizations. Swift recovery of affected services becomes paramount, as it directly correlates with business continuity and resilience.

Why you need an incident lead

In this clip, Adrian explains why it's important to have a dedicated incident lead. More about this episode: Today, good incident communication isn't a nice to have—it's an absolute must. But where do you even start? To help answer that question, we sat down with the VP of Engineering at SumUp, ⁠Adrián Moreno Peña⁠, to get his perspective on how organizations of all sizes can share stellar comms no matter the situation.

How SumUp benefitted from using incident.io

In this clip, Adrian explains how SumUp benefitted from using the incident.io platform. More about this episode: Today, good incident communication isn't a nice to have—it's an absolute must. But where do you even start? To help answer that question, we sat down with the VP of Engineering at SumUp, ⁠Adrián Moreno Peña⁠, to get his perspective on how organizations of all sizes can share stellar comms no matter the situation.

Building trust through incident communication with Adrián Moreno, VP of Engineering at SumUp

Today, good incident communication isn't a nice to have—it's an absolute must. But where do you even start? To help answer that question, we sat down with the VP of Engineering at SumUp, ⁠Adrián Moreno Peña⁠, to get his perspective on how organizations of all sizes can share stellar comms no matter the situation. We discuss: What it means to communicate during incidents Why Status Pages are critical in helping to build trust How you can have good comms even without a lead...and much more.

Future-Proofing IT Operations: Charter's Journey to Enhanced Reliability with Squadcast

Discover the transformative journey of Charter, a leader in global IT services, towards achieving unmatched operational reliability through the strategic use of Squadcast in this insightful webinar recording. Chris Ardagh from Charter shares valuable insights and experiences, highlighting how advanced incident management practices with Squadcast have allowed the organization to redefine benchmarks in reliability engineering.
Sponsored Post

Enterprise Incident Management: Guide & Best Practices

In today's rapidly evolving technological landscape, incident management has become a critical discipline for enterprises to ensure uninterrupted operations and an optimal customer experience. Effective incident management involves a systematic approach to promptly detecting, responding to, and resolving incidents.

What are Blameless Retrospectives? How Do You Run Them?

In most engineering organizations, everyone agrees that in complex systems, failure is inevitable. It’s possible to prevent the recurrence of certain incidents, reduce their impact, or shorten the time to resolution. However, it’s impossible to avoid them altogether. In the past, we asserted failures are a result of people’s mistakes. It was all about “the bad apple theory,” focused on finding the “guilty party” and removing them to prevent future failures.

Incident Response Team | Roles & Responsibilities Defined

When your organization faces outages, errors, security breaches, and other incidents, you need to have a plan in place to take appropriate actions as needed. However, you also need a capable team of experts filling critical roles and responsibilities to execute those actions and effectively collaborate to resolve issues quickly. An incident response team, therefore should be developed in a way that avoids skills gaps in expertise.

Incident Management Automation - What You Should Know

Automated incident management is the process of automating incident response to ensure that critical events are detected and addressed in the most efficient and consistent manner. In incident management, time is of the essence and the primary benefit of automated incident management is speed. With automation, you can accomplish time-consuming tasks much quicker. This brings down the incident response time and allows the team to focus their attention on matters that require their expertise.