Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Monitoring for Websites, Applications, APIs, Infrastructure, and other technologies.

Nobody Cares About Your MTTR

I’ve been in those late-night "war room" calls where, after hours of painstaking work, the team finally resolves a critical outage. The dashboards all turn green, a collective sigh of relief is shared, and the next day’s report highlights a victory: Mean time to resolution (MTTR) was reduced by 15% compared to the last major incident. It feels like a win.

Enhanced Icinga 2 Container Images

As some of you might have already noticed, we recently gave our official Icinga 2 container image builds a complete overhaul. These new images are currently available only as snapshot builds but will replace the existing stable images with the next Icinga 2 v2.16.0 release. In this blog post, we’ll walk you through the key changes and improvements that come with the new images, as well as the reasons behind these changes.

ObservabilityCON 2025 Keynote: Grafana Assistant GA and Full-Stack Observability in Grafana Cloud

Join Grafana Labs CEO Raj Dutt, CTO Tom Wilkie, and engineering leaders to kick off ObservabilityCON 2025 with the latest in AI-powered observability in Grafana Cloud. See how Grafana is making observability smarter, simpler, and more scalable. This ObservabilityCON 2025 keynote unveils: AI-powered observability → Grafana Assistant (GA) and Assistant Investigations (Public Preview). Observability at scale → The Adaptive Telemetry suite is now complete (Traces GA, Adaptive Profiles in Private Preview) plus BYOC for flexible, cost-efficient cloud deployment.

Top 9 LLM Observability Tools in 2025

Organizations are adding GenAI to their current and future architectures and product roadmaps, requiring Ops teams to ensure LLMs are accurate, fast, secure and cost-efficient. LLM observability tools directly addresses these needs, helping identify and prevent common LLM errors and issues: LLM observability provides the telemetry data for this analysis. LLM observability tools trace requests end-to-end, evaluate outputs, and correlate quality with latency, cost, prompts, tools, and data sources.

Vibe Coding: Closing The Feedback Loop With Traceability

I have begun to truly embrace vibe coding over the last few months, using Cursor as my main code editor and Claude Sonnet 4 for my agent's LLM. It's an exciting time as a developer, we get to experiment with something that promises to 100x our productivity while pioneering the new workflows and strategies for implementing these tools. But, as most people who have done any extensive development with LLMs in a sufficiently sized code base knows, it's a bit like trying to herd scared cats.

ObservabilityCON 2025: A guide to all the announcements from Grafana Labs

Today at ObservabilityCON 2025 in London, we unveiled a number of exciting announcements and updates to Grafana Cloud that reimagine SaaS economics, simplify the complexity of running your observability stack at scale, and provide AI tooling that’s actually useful. (Root cause analysis via chatbot? Yes, please!) Check out the keynote to learn more about how we’re helping you do more with the open observability cloud, and read on for a quick recap of all the news from ObservabilityCON 2025.

AI-powered observability: Resolve incidents faster, reduce alert fatigue, and expand access

When an incident lands in your lap, you’ll often start with a lot of questions: Why is latency so high? What’s causing this outage? How much money are we losing at this very moment? The uncertainty—and the pressure to quickly find answers—has always been one of the more nerve wracking parts of being an on-call engineer, but it doesn’t have to be that way any more.

Maximize data value and cut costs: Adaptive Telemetry for metrics, logs, traces, and profiles in Grafana Cloud

When it comes to observability, more data doesn’t always mean more clarity. In fact, as telemetry volumes grow, it only becomes more difficult to discern the signals from the noise and to keep overall costs in check. This is exactly why we built Adaptive Telemetry, a suite of features in Grafana Cloud that analyzes how your telemetry is used and then automatically recommends actions like aggregating, sampling, dropping, or reducing low-value data.

Complete guide to OpenTelemetry Tracing (with code examples)

Distributed tracing is an essential technique for monitoring modern, cloud-native applications. It provides a holistic view of a request's entire journey as it propagates through a multi-service architecture, making it invaluable for performance optimization and root cause analysis. But how do you generate and collect this trace data in a standardized, vendor-agnostic way? That's where OpenTelemetry comes in.