Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Service Reliability Engineering and related technologies.

How OpenTelemetry Auto-Instrumentation Works

Most developers use auto-instrumentation as it’s meant to be used — run the Java agent, add NODE_OPTIONS, and telemetry starts flowing. When it stops, though, figuring out why can be tricky. Maybe the agent didn’t load, maybe there’s a framework version mismatch, or something else entirely. Understanding how auto-instrumentation works makes it easier to spot and fix these issues.

15 PHP APM Tools Worth Using in 2025

PHP powers a large swath of the web — from blogs to storefronts to APIs. But with microservices, third-party dependencies, and scaling complexity, performance can slip in subtle ways. Your app might mostly work, but small—noted delays, occasional spikes, or hidden bottlenecks build up. An APM tool helps you see inside the black box: which functions are slow, which DB queries are hogging time, which external calls are failing or stalling.

How to Scale Prometheus APM for Modern Applications

When developers monitor application performance, they pick one of two paths: traditional APM tools with distributed tracing and code profilers, or metrics-driven monitoring with Prometheus. The second approach — Prometheus APM — tracks the signals that matter most: request rates, error rates, latency, and resource utilization. No agents to install, no per-host pricing, just exporters and PromQL. For most teams, Prometheus APM is where monitoring starts.

SRE Report Retrospectives - Have AIOps Predictions Held Up?

Welcome to a new blog series where we take a candid look at the predictions, insights, and bold claims we've made in previous SRE Reports and ask the uncomfortable question: How did we do? For the uninitiated, Catchpoint's SRE Report is our annual, practitioner-driven effort to capture the pulse of the global reliability community.

Pastries with SREs: Leveling up observability and donut dunkability

In this episode of Pastries with SREs, we explore what it really means to shift left with observability, moving from reactive firefighting to proactive performance. And yes, it starts with donuts. We unpack how SREs and IT Ops teams are often stuck reacting to incidents, battling alert fatigue and swivel-chair triaging. But what if you could pull in developers earlier, and give everyone a unified view of observability data?

Observability vs. Visibility: What's the Difference?

In modern IT systems—distributed services, cloud-native platforms, and dynamic networks—just knowing that something is “up” isn’t enough. Green checkmarks on dashboards don’t tell you why performance shifted, why latency crept in, or why a perfectly healthy-looking service suddenly failed. This is where the conversation around visibility and observability begins. They sound similar, but they solve very different problems.

OTel Naming Best Practices for Spans, Attributes, and Metrics

An incident’s in progress. Services are slow, customers are frustrated, and your dashboards… look fine. At least, until you search for payment metrics and get 47 different names for the same signal. Suddenly, the real issue isn’t latency — it’s inconsistency. The OpenTelemetry project recently published a three-part series on naming conventions to solve exactly this problem.

Docker Daemon Logs: How to Find, Read, and Use Them

Sometimes Docker behaves in ways that catch you off guard—containers don’t start as expected, images pause during pull, or networking takes longer than usual to respond. In those moments, the Docker daemon logs are your best reference point. These logs capture exactly what the Docker engine is doing at any given time. They give you a running account of system state, performance signals, and events that help you understand what’s happening beneath the surface.