Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

The ultimate guide to on-call schedules

An Ultimate Guide to on-call schedules? You might think this sounds overly grandiose for what’s essentially putting people into a list and rotating through them. But you’d be flat-out wrong. Getting your on-call setup correct is as real and as important as it gets, and getting things wrong can lead to prolonged incidents, burnt out employees, and damaged company reputation.

Custom Milestones: Empowering Enterprise Incident Management

Milestones have been central to our platform since day one, helping users track incident progress and drive automation. We're excited to introduce our enhanced Milestone feature, offering unparalleled customization. Now, you can fine-tune your incident management process to perfectly align with your organization's specific policies and workflows.

Preparedness as a Competitive Advantage: Building Resilience Year Round

The recent global IT outage is a stark reminder that even the most advanced organizations can have bad days. Major disruptions can have significant downstream impacts that can lead to disappointed customers, lost revenue, deferred processes and even legal action if the downtime is considerable. With the rapid pace of technological change and the continued digital transformation intensified by AI, disruptions are no longer “unexpected.” They are part of the normal course of business.

The Role of Technology in Enhancing Incident Response Call Etiquette

The interconnectedness of today's business environment has significantly heightened the complexity of incident response (IR). The need for immediate action, precise communication, and real-time collaboration is more critical than ever. However, beyond the technical precision required in solving problems, there lies an often overlooked aspect of effective IR management: the etiquette of incident response calls.

4 New Ways to Improve Incident Management with Event Orchestration

In an era where efficiency and smart technology integration are key, 71% of technical leaders report their companies are expanding their investments in artificial intelligence (AI) and machine learning (ML) this year. With the sheer volume of data coming into the enterprise and the need for timely response, monitoring every incoming alert around the clock is impractical, and human vigilance alone is too imprecise.

6 top incident management use cases for AI copilots

The news is filled with buzz about how companies approach AI. As a result, many organizations are trying to identify how AI can effectively support their business goals. There seem to be infinite use cases, but finding those that add the most value is often the first challenge. In the ITOps environment, generative AI copilots can effectively improve team efficiency, share knowledge, and support day-to-day tasks to deliver immediate value.

Myth vs. Reality: Lessons in Reliability from the July 19 Outage

It was 3AM at Newark Liberty International Airport. I was groggy, waiting in line to get my boarding pass, only to be met with a blue screen on the check-in kiosk. Needing some coffee, I learned the vendor was only accepting cash. There was clearly a big outage and I quickly checked our systems at PagerDuty. Major outages happen multiple times per year, so frequently that we have an internal dashboard (colloquially referred to as “the internets are broken”).

AlertOps Announces Integration with ServiceNow to Enhance Incident Management and Response

AlertOps announced its new integration with ServiceNow to enhance incident management and response capabilities for ServiceNow customers. This joint effort enables AlertOps to create better experiences and drive value for customers by providing real-time notifications, bi-directional data synchronization, and seamless integrations. ServiceNow’s expansive partner ecosystem and partner program is critical in supporting the Now Platform’s $275 billion forecasted market opportunity through 2026.

Achieving Faster Mean Time to Resolution MTTR with AIOps

In today’s fast-paced digital world, customer satisfaction is the top priority of every other business. To ensure that customer stays satisfied with your service and application at all times, businesses must work on reducing their downtime and guarantee quick resolutions. Excessive downtime can be expensive for any business and its brand reputation. Hence, adapting practices that eliminate issues responsible for downtime is crucial for maintaining seamless IT operations.

IT Outage Notification Templates and Incident Communication Examples

Outages cost millions and even billions for businesses across different spheres. For example, Amazon may lose up to $34 billion in sales within an hour of downtime, and a service outage back in March cost Meta nearly 100 million in revenue. However, that’s not all that was lost. Due to poor outage notifications and a lack of resolution details, many Meta users were kept in the dark about the outage. This Reddit thread shows many users were frustrated.