Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Service Reliability Engineering and related technologies.

Meet Your Virtual Responder: PagerDuty's SRE Agent for AI-Driven Reliability

Modern SRE teams face an overwhelming challenge: too many signals, too little time. Incidents are faster, systems are more complex, and reliability targets only get stricter. What if you had a teammate who could jump in instantly—context-aware, tireless, and armed with your runbooks, metrics, and alert data? Introducing PagerDuty’s SRE Agent, the next evolution in AI-driven operations.

How a Runtime Aware AI SRE Agent Transforms System Reliability

A runtime aware AI SRE extends existing AI SRE approaches by moving beyond telemetry correlation into runtime-validated reliability. While the majority of AI SRE tools accelerate incident triage using logs, metrics, and traces, they cannot confirm execution behavior if critical runtime signals were never captured. By generating on-demand evidence inside running services, AI SRES can eliminate slow redeploy cycles, ensuring your distributed systems remain resilient under real-world traffic conditions.

Multi-Agent AI SRE Has Landed and Its Built for Your Most Complex Stacks

Once upon a time, a monolith running on a handful of servers meant that incident management, even at 2:17 AM, was something a single generalist could handle. One person with enough context across the stack could reasonably diagnose whether the database was choking, a config had changed, or a server was running hot. They’d fix it and go back to sleep.

12 DevOps Tools You Should Be Using in 2026 (SREs Included)

When everything on the internet comes with an “AI-powered” tag attached and AI fatigue is in full gear, we come to the rescue with a list of tools and services for DevOps and SREs. No AI included. Twelve tools across infrastructure, security, observability, and incident management. Mostly open source. All of them solving specific problems without a chatbot in sight.

The Incident You Never Had: Deterministic Simulations w/ Will Wilson (Antithesis CEO)

Most reliability engineering happens after something breaks. Will Wilson thinks that's the wrong place to be. As co-founder and CEO of Antithesis, the autonomous testing platform that just raised $105M in a Series A led by Jane Street, Will has spent years building the infrastructure to catch failure modes before they ever reach production. His starting point is uncomfortable: the testing practices most teams rely on are structurally incapable of finding the bugs that cause real incidents.

8 Video Workflows That Optimize IT Operations

It wasn't that long ago when Agile revolutionized IT workflow, introducing a feedback-forward process that ensured each project task was perfected and approved before moving on to the next. To execute a task with high precision, an assigned team needs a reliable arsenal of tools, including video. Project managers also need updated tool stacks to lead complex projects to completion.

Why DevOps and SRE Teams are replacing 3-4 monitoring tools with Atatus?

Your on-call engineer gets paged. A critical service is down. Error rates are spiking. They open Sentry for errors. Flip to Grafana for metrics. Pivot to Kibana to search logs. Then jump to Lumigo, but that only covers the Lambda functions, not the Node.js backend throwing the actual errors. Three tabs become five. Five become eight. Half the incident is gone and your team is still piecing together what happened instead of fixing it. Sound familiar?

Olly for SREs: 3 ways I actually use it in production

There’s a moment after an alert where you’re not fixing anything yet. You’re trying to answer a much simpler question: Is it actually down? Sometimes it’s obvious. Sometimes it’s 20 alerts at once with no clear starting point. Sometimes it’s a small upstream degradation that might cascade. Sometimes it’s just a spike that resolves on its own. That first phase is orientation. Is the signal real or transient? Is it isolated or spreading? Root cause or symptom?

Burnout Doesn't Ask Permission: Recognizing, Recovering, and Rebuilding w/ Stephen Townsend

Burnout doesn't announce itself. For Stephen Townsend, SRE team lead and host of the Slight Reliability podcast, it crept in over months of mounting pressure on a massive transformation program, and announced itself overnight with an inability to sleep. In this episode, Stephen shares his personal burnout story with rare honesty: the physical symptoms he dismissed, the org structure that left him without autonomy, and the full year it took to recover.