Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

Evaluating PagerDuty Alternatives in 2024 (Updated)

We live in times of instant gratification, where customers expect same-day delivery, round-the-clock tech support, and seamless browsing experiences. Disruptive technologies and continuous innovation have raised expectations for faster and uninterrupted delivery of services. This shift is compelling organizations to adapt their operations to meet these new demands and stay competitive.

Learning from Major Incidents: The Opportunities We're Missing

While they are untimely, stressful and likely to highlight communication breakdowns within an organization; incidents can be a powerful tool for learning and growth in organizations. When an incident occurs with a large impact, which it feels like we read about this happening in the news on a weekly basis, oftentimes the focus is on two things: stabilizing the situation, and controlling the narrative. Organizations often miss the opportunity incidents present: learning.

The Microsoft-CrowdStrike Outage: An In-Depth Analysis

On July 19, 2024, a significant outage impacted globally, causing widespread disruptions across various industries. This outage was primarily linked to a faulty update from CrowdStrike’s Falcon Sensor, which led to severe issues on Windows systems. CrowdStrike is a leading cybersecurity company that specializes in protecting businesses from online threats.

Microsoft 365 Outage, MO821132: Users may be unable to access various Microsoft 365 apps and services

Thursday evening, Microsoft 365 identified a global outage affecting users accessing various Microsoft 365 applications and services. Impacted users suffered from login issues, Azure hosted virtual machines not being available, and constant loading screens in Microsoft 365 services, just to name some of the issues.

A tough day for incident responders: lessons from the CrowdStrike update

Today marks a particularly challenging day for incident responders across the globe. As many of you may have noticed, a recent update from CrowdStrike has triggered widespread disruptions, causing chaos in various sectors. The ripple effects have been far-reaching and severe: While the technical specifics of the issue might not be the focus here—and indeed, there are experts better suited to dissect the cause—what's crucial is understanding the impact on those who manage such crises.

Nexthink Stops MS Outage From Hurting a Leading Consumer Goods Company

While individual blue screen errors are frustrating, the recent global system crashes caused by a CrowdStrike update incompatible with Microsoft Windows have wreaked havoc across entire industries since early Friday morning. Companies ranging from the airlines, media, and banking industries have been facing significant disruptions, with thousands of customer-facing devices experiencing blue screens and causing widespread travel delays and chaos.

UptimeRobot Alerts Spike 5x Due to Microsoft/CrowdStrike Global Issues

Given recent global events, UptimeRobot is experiencing an increased number of downtime notifications. We are currently sending out five times more notifications than usual due to a widespread power outage impacting several critical services worldwide. Here’s a brief overview of the situation and how it affects our monitoring services.

The IT Scramble is On with a Microsoft Outage: Incident MO821132 - July 18, 2024

On July 18, 2024 at 6:38 pm ET, Vantage DX, Martello’s Microsoft 365 and Teams performance management solution, started to see indicators of a likely Microsoft outage impacting users’ ability to access various Microsoft 365 apps and services. Almost an hour later at 7:41 pm ET Microsoft issued a statement on X.

Global Microsoft Outage and Preventing Future Vulnerabilities

In a recent unexpected turn of events, a faulty component in the latest CrowdStrike Falcon update led to widespread outages, crashing Windows systems globally. The repercussions were felt across various sectors, including airports, TV stations, hospitals, and even emergency services in the U.S. and Canada. The glitch, affecting both Windows workstations and servers, resulted in massive outages, bringing entire companies to a standstill and crashing fleets of hundreds of thousands of computers.