Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Observabilty for complex systems and related technologies.

Graviton5 in Production at Honeycomb: Per-service Results From the m8g to m9g Migration

This is the fourth installment in the Graviton retrospective series we've been writing since 2021. The methodology is the same one I always reach for: hold the workload constant, run both generations on the same Kubernetes namespace concurrently, and let the per-pod numbers speak.

What is SRE Observability and Key Pillars You Should Know?

What happens when a critical service slows down, but nothing is technically “broken”? Most teams have monitoring in place. They know when something goes down. But when performance drops or issues spread across services, finding the real cause becomes slow and unclear. Engineering teams end up switching between dashboards, logs, and alerts just to understand what changed. This delays response and increases pressure on on-call teams. This is where SRE observability becomes essential.

It Can Only Goodhart Happen

When a measure becomes a target, it ceases to be a good measure. Charles Goodhart, 1975 You’ve probably read this quote in relation to any number of things over the years. People complaining about arbitrary metrics like PRs merged, lines of code produced, and now, token usage. But is the era of tokenmaxxing over before it even began? The rise of token leaderboards to the death of token leaderboards at companies like Amazon seem to have taken place in less than three months!

Running the OpenTelemetry Collector as a Lambda

The OpenTelemetry Collector is usually deployed as a long-running process: a sidecar, a DaemonSet, an EC2 instance, a docker container on my computer. It sits there listening for telemetry. That's fine when I want to send telemetry all day, but not when telemetry is rare. Like right now, when I have an agent defined on AgentCore, and it runs a few times a week maybe. Or my website that hardly sees any traffic. Can I run the OpenTelemetry Collector as a Lambda function?

MCP Servers Are Becoming a Core Interface Layer in Data Observability and Data Quality

Data observability has traditionally been built around human workflows. When data breaks, engineers are alerted, open dashboards, inspect lineage graphs, and manually trace the issue across pipelines. The system is designed for human investigation and interpretation. That model is now being challenged by the rise of AI agents in data operations. As organizations begin embedding AI into analytics, engineering, and decision-making workflows, observability is no longer just about explaining what happened - it must also enable systems to understand and act on it.

Why Engineers Don't Trust Autonomous AI - 4th Annual Observability Survey | Grafana Labs

The 2026 Observability Survey from Grafana Labs heard from over 1,300 engineers and leaders across 76 countries on the real-world role of AI in observability. The data reveals a sharp distinction between intelligence and autonomy — and a critical blind spot most teams have.
Sponsored Post

How APM fits into the modern observability stack

Most engineering teams don't have a data problem. They have an interpretation problem. Prometheus is running, logs are shipping to the aggregator, dashboards are green-and then a latency spike hits and the root cause takes 45 minutes to isolate. The data was there but the answer wasn't. That gap is where application performance monitoring (APM) operates. This article explores what APM adds to a modern observability stack, why relying on standalone tools leaves critical blind spots, and how teams can unify infrastructure data with application context for a complete operational picture.