%term

The latest News and Information on Service Reliability Engineering and related technologies.

Launching an agentic SRE for root cause analysis

Oct 14, 2025 By Mezmo In Mezmo

Today, we’re excited to announce the launch of Mezmo’s AI-powered Site Reliability Engineering (SRE) agent for root cause analysis (RCA)—a transformative leap forward for engineering and operations teams facing the relentless complexity of modern cloud-native systems. ‍

Read Post

Mezmo

Read more about Launching an agentic SRE for root cause analysis

Hiring SREs in the AI era w/ Weights & Biases

Oct 14, 2025 By Rootly In Rootly

View Video

Rootly

Read more about Hiring SREs in the AI era w/ Weights & Biases

How OpenTelemetry Auto-Instrumentation Works

Oct 10, 2025 By Anjali Udasi In Last9

Most developers use auto-instrumentation as it’s meant to be used — run the Java agent, add NODE_OPTIONS, and telemetry starts flowing. When it stops, though, figuring out why can be tricky. Maybe the agent didn’t load, maybe there’s a framework version mismatch, or something else entirely. Understanding how auto-instrumentation works makes it easier to spot and fix these issues.

Read Post

Last9

Read more about How OpenTelemetry Auto-Instrumentation Works

15 PHP APM Tools Worth Using in 2025

Oct 10, 2025 By Faiz Shaikh In Last9

PHP powers a large swath of the web — from blogs to storefronts to APIs. But with microservices, third-party dependencies, and scaling complexity, performance can slip in subtle ways. Your app might mostly work, but small—noted delays, occasional spikes, or hidden bottlenecks build up. An APM tool helps you see inside the black box: which functions are slow, which DB queries are hogging time, which external calls are failing or stalling.

Read Post

Last9

Read more about 15 PHP APM Tools Worth Using in 2025

How to Scale Prometheus APM for Modern Applications

Oct 9, 2025 By Anjali Udasi In Last9

When developers monitor application performance, they pick one of two paths: traditional APM tools with distributed tracing and code profilers, or metrics-driven monitoring with Prometheus. The second approach — Prometheus APM — tracks the signals that matter most: request rates, error rates, latency, and resource utilization. No agents to install, no per-host pricing, just exporters and PromQL. For most teams, Prometheus APM is where monitoring starts.

Read Post

Last9

Read more about How to Scale Prometheus APM for Modern Applications

SRE Report Retrospectives - Have AIOps Predictions Held Up?

Oct 7, 2025 By Leo Vasiliou In Catchpoint

Welcome to a new blog series where we take a candid look at the predictions, insights, and bold claims we've made in previous SRE Reports and ask the uncomfortable question: How did we do? For the uninitiated, Catchpoint's SRE Report is our annual, practitioner-driven effort to capture the pulse of the global reliability community.

Read Post

Catchpoint

Read more about SRE Report Retrospectives - Have AIOps Predictions Held Up?

Pastries with SREs: Leveling up observability and donut dunkability

Oct 6, 2025 By Elastic In Elastic

In this episode of Pastries with SREs, we explore what it really means to shift left with observability, moving from reactive firefighting to proactive performance. And yes, it starts with donuts. We unpack how SREs and IT Ops teams are often stuck reacting to incidents, battling alert fatigue and swivel-chair triaging. But what if you could pull in developers earlier, and give everyone a unified view of observability data?

View Video

Elastic

Read more about Pastries with SREs: Leveling up observability and donut dunkability

Observability vs. Visibility: What's the Difference?

Oct 3, 2025 By Faiz Shaikh In Last9

In modern IT systems—distributed services, cloud-native platforms, and dynamic networks—just knowing that something is “up” isn’t enough. Green checkmarks on dashboards don’t tell you why performance shifted, why latency crept in, or why a perfectly healthy-looking service suddenly failed. This is where the conversation around visibility and observability begins. They sound similar, but they solve very different problems.

Read Post

Last9

Read more about Observability vs. Visibility: What's the Difference?

OTel Naming Best Practices for Spans, Attributes, and Metrics

Oct 1, 2025 By Anjali Udasi In Last9

An incident’s in progress. Services are slow, customers are frustrated, and your dashboards… look fine. At least, until you search for payment metrics and get 47 different names for the same signal. Suddenly, the real issue isn’t latency — it’s inconsistency. The OpenTelemetry project recently published a three-part series on naming conventions to solve exactly this problem.

Read Post

Last9

Read more about OTel Naming Best Practices for Spans, Attributes, and Metrics

Docker Daemon Logs: How to Find, Read, and Use Them

Sep 30, 2025 By Faiz Shaikh In Last9

Sometimes Docker behaves in ways that catch you off guard—containers don’t start as expected, images pause during pull, or networking takes longer than usual to respond. In those moments, the Docker daemon logs are your best reference point. These logs capture exactly what the Docker engine is doing at any given time. They give you a running account of system state, performance signals, and events that help you understand what’s happening beneath the surface.

Read Post