Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Observabilty for complex systems and related technologies.

Advanced Python Logging: Mastering Configuration & Best Practices for Production

Python's logging system provides powerful tools for application monitoring, debugging, and maintenance. This comprehensive guide covers everything from basic setup to advanced implementation strategies, helping you build robust logging solutions for your Python applications.

How Much Should I Be Spending On Observability?

In 2018, I dashed off a punchy little blog post in which I observed that teams with good observability seemed to spend around ~20-30% of their infra bill to get it. I also noted this was based on absolutely no data, only my own experiences and a bunch of anecdotes, heavily weighted towards startups and the mid-market tech sector. This post should have ridden off into the sunset years ago. To my horror, I have seen it referenced more in the past year than in all preceding years combined.

AI Agent Observability Explained: Key Concepts and Standards

AI agent observability has become a critical discipline for organizations deploying autonomous AI systems at scale. This guide explores the emerging standards and best practices for monitoring, analyzing, and improving AI agent performance in enterprise environments.

Elastic Observability 9.0/8.18: Elastic Distributions of OpenTelemetry (EDOT) now GA, LLM observability, and more

Elastic Observability 9.0/8.18 announces several key capabilities: Elastic Observability 8.18 and 9.0 is available now on Elastic Cloud — the only Elasticsearch offering to include all of the new features in this latest release. You can also download the Elastic Stack and our cloud orchestration products — Elastic Cloud Enterprise and Elastic Cloud for Kubernetes — for a self-managed experience. What else is new in Elastic 9.0/8.18? Check out the 9.0/8.18 announcement post to learn more.

How to get started with Calico Observability features

Kubernetes, by default, adopts a permissive networking model where all pods can freely communicate unless explicitly restricted using network policies. While this simplifies application deployment, it introduces significant security risks. Unrestricted network traffic allows workloads to interact with unauthorized destinations, increasing the potential for cyberattacks such as Remote Code Execution (RCE), DNS spoofing, and privilege escalation.

AWS Lambda, OpenTelemetry, and Grafana Cloud: a guide to serverless observability considerations

In our increasingly serverless world, observability isn’t just a “nice to have”—it’s essential. Serverless functions such as AWS Lambda bring incredible benefits, but they also introduce complexities, especially around monitoring and debugging. In a previous article, I provided a quick, practical guide for sending AWS Lambda traces to Grafana Cloud using OpenTelemetry.

OpenTelemetry for AI Systems: Implementation Guide

AI systems, from machine learning models to Large Language Models (LLMs) and autonomous AI agents, introduce unique observability challenges. Their non-deterministic nature, complex dependencies, and specialized performance characteristics require thoughtful instrumentation approaches. OpenTelemetry has emerged as the leading standard for implementing observability across these systems.