Operations | Monitoring | ITSM | DevOps | Cloud

Incident Management

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

What is DORA and how will it affect me?

The Digital Finance Strategy is a European directive that aims to support and develop digital finance in Europe while maintaining financial stability and consumer protection. There are three main components to the package: In this blog post, we’ll attempt to summarize the 113-page DORA proposal, highlighting how it will apply to incident management at financial entities. Side note: we also wrote a blog post about the other DORA, also known as the DevOps Research and Assessments.

The keys to establishing resilient infrastructure

Infrastructure resilience is essential for any modern IT environment. Downtime is expensive. Beyond the stresses of day-to-day operations, you want to be confident that your IT systems will continue functioning during service disruptions, hardware failures, or natural disasters. Establish a reliable resilient infrastructure to minimize downtime, improve customer trust, and protect your business’s revenue and reputation.

Downtime: Understanding and Minimizing Outages

Downtime isn’t just about systems going offline. It’s about how well your business can adapt and keep moving forward. Whether it’s a minor glitch or a large-scale outage, it affects revenue, productivity, and the trust your customers place in your services. For instance, in July 2024, CrowdStrike’s Falcon platform faced an outage that cost Fortune 500 companies $5.4 billion. Businesses that had proactive strategies recovered faster, minimizing the damage.

Top 5 IT outages detected by StatusGator

StatusGator is the world’s best status page aggregator: We aggregate the status of thousands of cloud services and hosted applications from their official status pages. But everyone knows official status pages are often behind and in those critical moments before the status page is updated, you might be thinking “Is it just me? Or is it really down?” StatusGator’s Early Warning Signals solves that by alerting you before providers even acknowledge the incident.

G2: Squadcast Leads in Incident Management and Secures Key Wins Across IT Alerting

We’re thrilled to share that Squadcast has been recognized as a Leader for the second time in the Incident Management Category. This win celebrates our pioneering role in Unified Incident Management, where we bring together On-Call Management, Incident Response, Workflow Automation, AI/ML-powered Noise Reduction, and SLO tracking—all in one platform.

Best Practices for Choosing a Status Page Provider

Downtime is inevitable but what sets successful businesses apart is how they handle it. A key part of incident management is incident communication with both internal and external stakeholders. A status page is a crucial tool for maintaining clear communication with users during outages or service interruptions. There are numerous status page providers available with different features. This article will guide you through best practices for selecting a provider that suits your needs.

Mastering regulatory compliance with incident.io

The origin of incident.io goes back to our days building Monzo, a UK-based bank, where Stephen, Pete, and I first crossed paths. As a bank, compliance with numerous regulations was, unsurprisingly, a top priority. When it came to incident management—something we were very involved in—this meant that every aspect of reporting, policy adherence, and root cause analysis (or "contributing factors," as we called it) had to be managed consistently and meticulously.

Demo Roundups! Operations Center Modernization

Solutions Consultants Nick Gallegos and Gurinder Singh show how the PagerDuty Operations Cloud addresses key challenges through Operations Center Modernization. Discover how it unifies your IT operations stack across Security, Network, and DevOps centers, automates remediation, and eliminates the need for a dedicated NOC by serving as a virtual operations center for distributed teams.

Update October 2024 - AI-based summary of alarm details and comprehensive audit logs

Our October update brings you AI-based summaries of alarm details. This makes complex or technical content much easier to understand in a matter of seconds. In addition, there is now also a comprehensive audit log, which always logs changes made to the system in a comprehensible manner. As always, you can find all the details in this blog article.