Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Observabilty for complex systems and related technologies.

Un-observable AI is Un-trustworthy AI

Recently, someone talked Chipotle’s customer support agent into reversing a linked list – a task completely unrelated to burritos in any way. Screenshots circulated, people laughed, but underneath the joke sat a sharper question. If a production support agent will do that on a public channel, what else will it do that nobody is screenshotting? The bug is funny. The trust gap behind it is not.

Why CI/CD Pipelines Miss Runtime Failures

CI/CD pipelines do four things: it builds code, runs tests against mocked dependencies, lints for style violations, and scans for known vulnerability patterns. What it cannot do is validate how that code behaves under real users, real service responses, and real runtime constraints that staging was never configured to reproduce. That entire class of failure clears every gate cleanly and surfaces only in production.

Kubernetes Monitoring: Datadog Alert to Lightrun Root Cause

Datadog Kubernetes monitoring tells an SRE team what failed, which pod failed, and when. It does so within seconds of the alert firing. The investigation then stalls at the same point every time: nothing in the dashboard layer can prove why a specific request behaved the way it did inside a running JVM at the moment of failure. Variable values, feature flag evaluations, and code branches are never captured.

Observability: Are You Measuring What Actually Matters?

Observability has always been important, and much like any core capability in your business, the value needs to be understood. For years, the value of observability was predictable. It was uptime, error rates, MTTR, and likely tool consolidation. That was enough to be able to show progress. These are foundational, tablestakes metrics—and they still matter, but they aren’t enough.

Why Your Agentic Workflow Succeeds and Still Gets It Wrong

Agentic workflows are reshaping how engineering teams operate, fetching context, synthesizing decisions, and shipping results across systems without human intervention. But the same design that makes them powerful adds risk in production. Agents do not crash when they hit bad data; they synthesize around it, substituting a stale value, an empty page, or a missing field for the result they were supposed to capture.

The Next Evolution of Infrastructure Observability

Operational visibility is becoming increasingly important as infrastructure teams are asked to support AI initiatives, automation goals, cost accountability, modernization efforts, and growing operational complexity at the same time. Most are expected to do it without expanding headcount, introducing additional risk, or rebuilding the environment from scratch. Those expectations are changing the role of infrastructure operations.

Open Standards Observability - Prometheus & OpenTelemetry

Modern applications are distributed, ephemeral and built from a dozen moving parts. To keep them reliable, you need real visibility: not just “is the server up?”, but“how is this request behaving, right now, across every component it touches?”. The good news is that the observability world has converged on a handful of open standards — Prometheus for metrics, OpenTelemetry for telemetry, plus battle-tested protocols like StatsD and NRPE.