Operations | Monitoring | ITSM | DevOps | Cloud

Incident Management

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

SRE Essentials: Building a Team and Culture

What differentiates tech companies that weather digital storms with unwavering resilience? In many cases, the answer lies in a deeply ingrained SRE culture, which fosters proactive approaches to system reliability. Site Reliability Engineering (SRE) culture extends beyond mere tech tools and automated scripts. It emphasizes proactive care, shared responsibility, and continuous improvement, leveraging incident management software as a vital component in fostering these core values of SRE.

Tracking developer build times to decide if the M3 MacBook is worth upgrading

All incident.io developers are given a MacBook which they use for their development work. That meant when Apple released the M3 MacBook Pros in October, people naturally started asking questions like “wow, how much more productive might I be if my laptop looked that good?” and “perhaps we’d be more secure if our machines were Space Black 🤔” Pete’s (our CTO) response to this was “if you can prove it’s worthwhile, we’ll do it”

BigPanda's latest Unified Console features unveiled

In the fast-paced realm of incident management and response, the need to stay ahead is more vital than ever. In recognition of this, BigPanda has significantly enhanced the Unified Console, introducing a suite of new features designed to revolutionize incident handling. Let’s explore these transformative updates and how they can redefine your approach to incident management.

What is a multi-cloud management platform?

As an IT leader, you’re acutely aware of the struggles of juggling multiple cloud environments, from integration headaches to holistic incident management to monitoring multiple clouds at once. Seeking a more efficient multi-cloud management solution is crucial to alleviate these pressures and streamline your cloud operations.

Episode 23: Zero-Downtime Updates with Todd Whitney

With limited error budgets and low user tolerance for maintenance window, the ability to execute routine updates without a maintenance window is an increasingly important socio-technical capability. Hear from Todd Whitney, who recently spoke at HashiConf about how PagerDuty performs updates while upholding its promise to customers of taking zero maintenance windows.

How MSPs and MSSPs can reduce risk and liability for their clients

For 83% of companies, a cyber incident is just a matter of time (IBM). And when it does happen, it will cost the organization millions, coming in at a global average of $4.35 million per breach. Add to that stringent data protection laws and the growing frequency and reach of ransomware and other sophisticated attacks.

Why monitoring your application is important

Effective monitoring and observability tools are critical for modern enterprises. Daily operations, digital transformation, moving to a cloud-native architecture, and an ever-evolving tech stack all require ITOps, DevOps, and SRE teams to monitor increasingly complex systems. So what happens if your applications suddenly cease to function? Every moment of downtime translates to lost income, decreased customer satisfaction, and harm to your company’s reputation.

APAC Retrospective: Learnings from a Year of Tech Turbulence

Throughout 2023, one thing has become abundantly clear: regardless of an organization’s size or industry, incidents are inevitable. Recently across the APAC region, we’ve seen numerous regulatory bodies clamp down on large companies who are failing to provide acceptable service, with some handing out quite severe penalties. For many, the cost of an incident is no longer just lost revenue and customer trust, but financial penalties and business restrictions.