Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Observabilty for complex systems and related technologies.

When Your Observability Literally Stops Traffic

Last week, a fleet of autonomous robotaxis in China suddenly stopped working—at scale. Over a hundred vehicles stalled across a city, stranding passengers in traffic and raising immediate concerns about safety, reliability, and trust in autonomous systems. This wasn’t just a bad day for self-driving cars. It was a distributed systems failure, one that happened in the physical world, not just in dashboards.

Uncertainty and Change Are Everywhere in Software Development

If you’re like everyone else who works in software development, it’s a good bet that almost every single thing that you thought you knew about your business and engineering has changed as a result of the advent of modern LLMs. How should you respond to these changes? How should you change how you and your team develop software?

Introducing OrionIQ: The End of Manual Observability

OrionIQ is Logz.io’s new agentic observability platform designed to move teams from detecting issues to resolving them automatically. As AI accelerates software development, operations remain manual: engineers still wake up at 2 a.m. to investigate alerts and rebuild context. OrionIQ uses AI agents to analyze real-time telemetry, investigate incidents, identify root causes, and take action across systems.

OpenTelemetry Collector + Uptrace: From Zero to Your First Traces

Learn how to set up the OpenTelemetry Collector and connect it to Uptrace for distributed tracing, metrics, and logs. This step-by-step guide walks you through installation, configuration, and sending your first telemetry data — perfect for beginners and anyone looking to level up their observability stack.

AI agent observability: The developer's guide to agent monitoring

Most "agent observability best practices" content reads like a compliance checklist from 2019 with "AI" pasted over "microservices." Implement comprehensive logging. Establish evaluation metrics. Create governance frameworks. Not a single line of code. No mention of what happens when your agent silently picks the wrong tool on turn 3 and you need to figure out why.

Operating agentic AI with Amazon Bedrock AgentCore and Datadog LLM Observability: Lessons from NTT DATA

This guest blog post is by Tohn Furutani, SRE Engineer at NTT DATA. Over the past year, the conversation around generative AI has shifted from single-shot use cases—such as summarization, Q&A, and chat interfaces—to agentic AI systems that can make decisions based on context, plan multistep actions, invoke tools, and adapt as conditions change.

LLM Cost Monitoring with OpenTelemetry

Teams running LLM applications in production face a cost problem that traditional APM tools were never designed to solve. CPU and memory costs are relatively predictable — a web service processing 1,000 requests per second costs roughly the same week over week. LLM API costs are not. A single user session can cost $0.01 or $5 depending on prompt length, model choice, conversation history, and how many retries happen inside your chain.

Top 5 Continuous Monitoring Tools and Why Runtime Context Is the Layer They Are Missing

Continuous monitoring tools track system health, performance, and behavior in real time across production environments. For a deeper understanding of how this fits into modern DevOps practices, see this guide on continuous monitoring and its impact on DevOps. They collect logs, metrics, and distributed traces across the infrastructure and application layers, giving engineering teams visibility into how their systems are running, where anomalies occur, and when something needs immediate attention.