Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Incident Management, On-Call, Incident Response and related technologies.

AI Reliability, Part 2: When the Datacenter Becomes the Bottleneck

In Part 1, we talked about all the hidden complexity inside AI systems: the pipelines, GPUs, embeddings, vector databases, orchestration layers, and everything else that quietly determines how reliable an AI-first product really is. But all of that software still rests on something far less glamorous: the physical infrastructure underneath it.

Major Cloud Outages of 2025

Cloud outages in 2025 ranged from minor ones affecting some sections of users, to major ones affecting hundreds or thousands of users. Services like Cloudflare and AWS on which many other services depend experienced outages that affected many due to the cascading effect. Let's look at some of the major cloud outages in 2025.

Microsoft Teams outage - December 10th, 2025

On the morning of December 10, 2025, Microsoft Teams experienced a service disruption affecting users across Australia. Although Microsoft 365 users reported issues across several apps, the hardest hit service was Microsoft Teams which became completely unusable for many organizations. While Microsoft did not acknowledge the incident until 03:46 UTC StatusGator identified the issue at 02:52 UTC through incoming outage reports and delivered an Early Warning Signal at 03:01 UTC.

What Is IT Incident Response?

“We’ve got a new alert – have you seen it yet?”“Which one? The CPU spike or the unusual login?”“The login. Same region as yesterday. But the CPU thing looks suspicious too.”“…Alright, I’ll check the firewall logs. You take the containers.”“Perfect. Let’s hope this doesn’t turn into another all-hands situation.” Does this conversation sound familiar?

Every Business Needs a Robust Incident Response Strategy

In today's digital landscape, businesses face an increasing number of cyber threats that can compromise sensitive data, disrupt operations, and tarnish their reputation. As companies adopt more complex technological solutions, they must be prepared for the inevitable risk of security incidents. Having a well-established, effective incident response strategy is no longer optional but essential. This article explores why incident response solutions are critical for every business and how they play a pivotal role in safeguarding an organization's assets, reputation, and continuity.

New features: AI SRE, Merge alerts, and Status pages for thousands of services

As we head into the holiday season, the ilert team is doing the opposite of slowing down; we’re ramping up. Over the past weeks, we’ve shipped a wave of impactful improvements across alerting, AI-powered automation, mobile app, and status pages. From major upgrades that reshape how teams triage incidents to smaller refinements that remove daily friction, this release is packed with updates designed to make on-call and operations smoother, smarter, and faster. Let’s dive in.

When major IT incidents occur, AI can deliver speed and transparency

The recent Cloudflare outage served as a stark reminder of how fragile the global digital ecosystem can be due to a single point of failure. In a matter of minutes, thousands of websites that rely on Cloudflare’s CDN, from Fortune 500 brands to SaaS platforms and consumer apps, went offline for hours. The business impacts were severe, with Shopify alone suffering over $4 million in losses while downstream merchant impacts potentially exceeded $170 million.