Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Distributed Tracing and related technologies.

The Data Plane Reality: OTel Scales, While Topology UX Lags

OpenTelemetry won the architectural standards battle. At scale, though, telemetry breaks more like plumbing than code. It breaks quietly, across a graph, with a blast radius you don’t understand until it’s expensive. With over 65% of organizations now running more than 10 collectors in production, hybrid deployments across Kubernetes and VMs are accelerating fast. Telemetry standardization is no longer a project milestone. It is a baseline expectation.

Telemetry Talks ep. 5 - OpenTelemetry in the AI agents era

Telemetry Talks explores how OpenTelemetry’s CNCF graduation arrives at a pivotal moment for AI-powered development. Together with Alex Marshalov, we dive into vibe coding, AI agents, and the growing need for observability in GenAI systems — from prompts and token usage to reasoning chains and distributed traces — using the VictoriaMetrics stack and OpenTelemetry as the foundation for understanding the next generation of autonomous software.

Use This OTel Processor to Prevent Your Dashboards From Breaking

A semantic-convention rename (http.method → http.request.method) can silently break your RED metrics — no errors, just gaps in dashboards and alerts. The OpenTelemetry Collector's schema processor fixes it: put it first in your pipeline and it normalizes attribute names no matter what each service emits. Migration mode writes BOTH the old and new names, so you get zero-downtime upgrades while queries keep working.

Open Standards Observability - Prometheus & OpenTelemetry

Modern applications are distributed, ephemeral and built from a dozen moving parts. To keep them reliable, you need real visibility: not just “is the server up?”, but“how is this request behaving, right now, across every component it touches?”. The good news is that the observability world has converged on a handful of open standards — Prometheus for metrics, OpenTelemetry for telemetry, plus battle-tested protocols like StatsD and NRPE.

Grafana Tempo: The distributed tracing journey to 3.0 (June 2026 Community Call)

Our distributed tracing journey from the inception of Tempo to 3.0. Can't comment in the chat? You may need to create a channel. Grafana Cloud is the easiest way to get started with Grafana dashboards, metrics, logs, traces, and profiles.

If You Are Building a Startup from a Vibe-Coded App, Don't Skip This #devops #programming #ai

Everyone is vibe coding products right now. But most applications are missing one crucial thing: Observability. In this video, I talk about: You can literally start this weekend: If you are turning your vibe-coded app into a real startup, observability should not be an afterthought.

Running the OpenTelemetry Collector as a Lambda

The OpenTelemetry Collector is usually deployed as a long-running process: a sidecar, a DaemonSet, an EC2 instance, a docker container on my computer. It sits there listening for telemetry. That's fine when I want to send telemetry all day, but not when telemetry is rare. Like right now, when I have an agent defined on AgentCore, and it runs a few times a week maybe. Or my website that hardly sees any traffic. Can I run the OpenTelemetry Collector as a Lambda function?

Errors, traces, logs, metrics: when to reach for what

When should I reach for a log, a trace, or a metric? I hit that question constantly when I instrument code, and I watch coding agents hit it too. It sounds like it should be obvious. Errors, traces, logs, and metrics are the four kinds of telemetry most apps run on, four tools in one box, and they overlap enough that the honest answer is every developer’s favourite: it depends. You can stuff context into span attributes instead of logging it. You can count log events instead of emitting a metric.

Your AI App Is Lying to You - Here's How to Fix That #devops #observability #programming

You shipped your AI app. But do you have all the answers? Do you actually know which model ran, how many tokens it consumed, or why it stopped? This is what LLM observability gives you, and most AI engineers are skipping it entirely. I built an SOS detection app and used OpenTelemetry to get full visibility into every single call. Token usage, model version, finish reason, and cost per call all in one place, standardised across any provider. Check out the OpenTelemetry GenAI docs in the link below; there is a lot more you can track than you think.