Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Observabilty for complex systems and related technologies.

What Is LLM Observability? For CFOs And Engineers, The Missing Layer Is Cost

You probably have Datadog. Maybe New Relic, maybe Dynatrace. Your observability stack has been solid for years — and you're still flying blind on AI cost. Here's why LLM observability needs a fourth pillar most tools skip, and how to build one that actually tells you what your models are costing you per request, per feature, per customer.

Moving Beyond SolarWinds: Building a Modern Observability Strategy

For years, platforms like SolarWinds have been a standard in IT environments. They helped teams answer a fundamental question: are systems up or down? That approach worked well when environments were more contained and predictable. The challenge is that most environments no longer operate that way. Hybrid infrastructure, cloud services, and tightly interconnected applications have changed what “visibility” needs to mean.
Sponsored Post

From Microsoft SCOM to Dashboards

System Center Operations Manager (SCOM) remains one of the most capable on-premises monitoring platforms for Microsoft environments. However, as IT operations evolve toward real-time observability and self-service insights, traditional SCOM reporting and consoles can feel restrictive. This whitepaper explores practical ways to extend and modernize your SCOM visualizations using today's leading dashboarding technologies - including SquaredUp, Grafana, Power BI, and Azure Workbooks.

No more monkey-patching: Better observability with tracing channels

Almost every production application uses a number of different tools and libraries,whether that’s a library to communicate with a database, a cache, or frameworks like Nest.js or Nitro. To be able to observe what’s going on in production, application developers reach out for Application Performance Monitoring (APM) tools like Sentry. But there’s an inherent problem: the performance data that APM tools need is most often not coming natively from the libraries themselves.

AI Observability in Grafana Cloud: A complete solution for monitoring your agentic workloads

The observability industry has developed great tools for using metrics, logs, traces, and profiles to monitor the cloud native applications that have dominated the last decade of software development. But when it comes to understanding what an AI system is actually doing, we’re often left reading raw conversations, guessing at quality, and reacting too late. And that’s a problem.

Introducing o11y-bench: an open benchmark for AI agents running observability workflows

Evaluating agents is hard. Verifying observability tasks is harder. Yes, AI agents have gotten dramatically and quantifiably better at coding and tool use, but observability presents a different kind of challenge. In a real incident, the hard part is rarely just writing a query. It's deciding which signal matters, figuring out whether a spike is noise or symptom, correlating metrics with logs and traces, and sometimes making a change in Grafana without breaking the dashboard another engineer depends on.

Bringing observability data hosting to the UK on AWS

UK organizations are increasingly required to design systems that account for data residency requirements, ensuring that operational data remains within national boundaries. Many teams already run their applications on AWS infrastructure in the UK, but telemetry data can still be processed outside the region, creating gaps in visibility. Datadog’s upcoming UK availability zone solves this by keeping telemetry data in the same region as the workloads that generate it.

Fast AI Feedback Loops with Honeycomb and OpenTelemetry

Are you writing agentic applications, but aren’t sure what the agents are doing? Finding out too late that you've blown the budget with super expensive models? Not sure where the agents are failing, and feeling a loss of control? Could they do better? Observability is the visibility you need to get the job done. Sending telemetry to Honeycomb explains what your agents are actually doing.

How to solve key site reliability engineering challenges

Modern site reliability engineering challenges stem from the difficult requirement of confirming why complex systems fail in ways staging cannot replicate. While observability tools signal failures, and AI SREs reason over data, they leave observability gaps regarding the actual state of running code. By utilizing runtime context, teams capture live execution data to accelerate production debugging, resolving incidents in minutes without requiring manual redeploy cycles.

How Observability Powers Autonomous IT in Hybrid Environments

Autonomous IT only works when observability gives it the context to act with confidence. On any given day, a mid-size enterprise generates tens of thousands of alerts across on-prem infrastructure, multiple clouds, SaaS tools, Internet dependencies, and AI workloads. Most of them don’t need a human. A few of them do. Telling the difference, fast enough to matter, is exactly where IT teams are losing ground.