%term

The latest News and Information on Service Reliability Engineering and related technologies.

Why SRE agents need orchestration, not just more tools

May 19, 2026 By Mezmo In Mezmo

Single agents are a useful starting point for SRE workflows. They are not where the architecture should end. The first version is simple enough: connect an LLM to a few tools, give it a system prompt, and point it at your infrastructure. It can summarize an alert, pull logs, answer questions, and draft a useful next step. Then the workflow gets real. You add GitHub for runbooks, Kubernetes for cluster state, PagerDuty for incident context, Prometheus for metrics, and Mezmo for telemetry.

Read Post

Mezmo

Read more about Why SRE agents need orchestration, not just more tools

The Follow-the-Sun Field Log: Running an SRE Rotation Across Lisbon, Singapore and Austin in One Quarter

May 19, 2026 By OpsMatters In OpsMatters

Quick note before we start. At 03:17 on a Tuesday in Lisbon, a watch buzzes against a hotel pillow. Two seconds later a phone screen lights the ceiling: P1, payments-writer-secondary, error rate seventy-eight percent. The on-call lead is twelve thousand kilometres from her desk. The team's five-minute escalation service-level objective is already running. The next ninety seconds will decide whether this is a clean save or a long retro.

Read Post

OpsMatters

Read more about The Follow-the-Sun Field Log: Running an SRE Rotation Across Lisbon, Singapore and Austin in One Quarter

What broke when engineering went fully agent-based

May 15, 2026 By Rootly In Rootly

Last year, we went fully agent-based at Rootly. Cursor, Claude Code, Codex, all of it. The productivity gains were real. However, Rigel, senior engineering manager at Rootly, started noticing a pattern emerging in his team.

View Video

Rootly

Read more about What broke when engineering went fully agent-based

LLM Observability: Lessons From MLOps w/ Maria Vechtomova (Cauchy)

May 14, 2026 By Rootly In Rootly

For nine years, Maria Vechtomova was shouting about monitoring. Nobody cared, until LLMs arrived. As co-founder of Cauchy, Databricks MVP, and one of the most followed voices in MLOps, Maria has watched the field evolve from hand-built experiment trackers to today's flood of observability tools, and her central claim might surprise you: globally, nothing has changed. The fundamentals are the same: track your code, data, and models so you can roll back when something breaks.

View Video

Rootly

Read more about LLM Observability: Lessons From MLOps w/ Maria Vechtomova (Cauchy)

Zero-Code OpenTelemetry for Vert.x

May 8, 2026 By Prathamesh Sonpatki In Last9

Drop a JAR on the JVM. Get distributed tracing, RxJava context propagation, log-trace correlation, and Vert.x internal metrics. No code changes. No Maven dependency. Java 8–21. Inside the design of last9/vertx-opentelemetry v2.3.4. Prathamesh works as an evangelist at Last9, runs SRE stories - where SRE and DevOps folks share their stories, and maintains o11y.wiki - a glossary of all terms related to observability.

Read Post

Last9

Read more about Zero-Code OpenTelemetry for Vert.x

The Journey to Production AI: Five Steps for SRE and Platform Teams

May 8, 2026 By Mezmo In Mezmo

In a recent webinar, The Journey to Production AI, Andre Elizondo walked through what separates a working agent demo from an agent worth trusting on a 2 a.m. page. Live polls during the session put numbers behind a pattern most platform teams already feel. ‍ ‍ Most teams are early. The ones who are further along did not get there by shipping a flashier demo. They got there by treating production AI as a platform problem.

Read Post

Mezmo

Read more about The Journey to Production AI: Five Steps for SRE and Platform Teams

New enhancements to PagerDuty's SRE Agent: triage faster without waking a human

May 6, 2026 By Ariel Russo In PagerDuty

AI promise and AI capabilities often diverge, with developers often reporting much faster code production, but not enough change in how incidents are handled. When the rate of change is faster than ever, but the rate of recovery from incidents isn’t moving, developers wind up stuck in firefighting mode. And, when these systems fail, it’s costly. According to PagerDuty’s State of AI-First Operations, over a third of surveyed companies report losing $500K per hour of downtime.

Read Post

PagerDuty

Read more about New enhancements to PagerDuty's SRE Agent: triage faster without waking a human

SRE Agent Enhancements for Autonomous Triage

May 5, 2026 By PagerDuty Inc. In PagerDuty

Triage just got turbocharged with our latest PagerDuty SRE Agent enhancements!

View Video

PagerDuty

Read more about SRE Agent Enhancements for Autonomous Triage

Stop ECS Containers From Collapsing Into One Service in OpenTelemetry

May 2, 2026 By Prathamesh Sonpatki In Last9

Why ECS containers collapse under service.name = aws_ecs and how to fix it for both EC2 launch type and Fargate, including the resource-vs-log-record pitfall that quietly breaks log filtering. Prathamesh works as an evangelist at Last9, runs SRE stories - where SRE and DevOps folks share their stories, and maintains o11y.wiki - a glossary of all terms related to observability.

Read Post