Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Monitoring for Websites, Applications, APIs, Infrastructure, and other technologies.

Why we open-sourced AURA: Infrastructure for production AI

Over the last year, I’ve talked to dozens of SRE teams about AI. The excitement is real, but conversations hit a wall when we get to production reality. How does an agent manage complex context without losing the plot? How does it avoid hallucinating relationships between signals? Who owns the orchestration logic that ties it all together? We realized the bottleneck wasn’t model intelligence. It was the lack of a reliable logic layer between the data and the model.

Grafana Alerting: faster rules, personalized filters, and an operations workspace

Alerts are only useful when you can quickly find and act on the right signal. That's why, over the past two years, we rebuilt Grafana Alerting’s UI to make it more reliable and efficient, especially at scale. The result: a faster, paginated alert rules page that handles tens of thousands of rules, with a powerful filter dropdown and saved searches so you can quickly get back to the views you care about most.

Tech Talk | Application management with Targeted Application Install for Victoria Experience

Apps create endless opportunities to leverage the strengths of the Splunk Cloud platform. Until now, you could only install Splunk apps across every search head on a Splunk Cloud Platform Victoria Experience deployment. With TAI you now have fine-grained control over which search head groups will run which apps.

System Datasets: From Alert Fatigue to Optimized Notifications

Alert fatigue rarely begins as a single mistake. It grows as systems scale, teams grow, and “just in case” monitoring becomes the default. A few extra alerts, another threshold, and soon the on-call channel becomes overwhelmed. Engineers get interrupted for noise or stop trusting pages; either way, real signals get missed. Reliability drops, and productivity quietly declines. Most teams respond tactically: tune thresholds, change notifications, suppress noise.
Sponsored Post

Fabrix.ai at Cisco Live 2026 Amsterdam

This post highlights the biggest Cisco AI Summit takeaways that came up again and again in Cisco Live conversations, and what they mean for teams operating AI in production. If you are following the broader AgentOps movement and the rise of agentic workflows, Fabrix.ai’s point of view is grounded in a core idea: AI agents create value only when they can be operated safely and consistently. A good starting point is here: Fabrix.ai’s approach to agentic.
Sponsored Post

What is a Real-Time Data Lake?

A data lake is a centralized data repository where structured, semi-structured, and unstructured data from a variety of sources can be stored in their raw format. Data lakes help eliminate data silos by acting as a single landing zone for data from multiple sources. But what's the difference between a traditional data lake and a real-time data lake? Some traditional data lakes use batch processing, which involves processing and analyzing a collection of data that has been stored over a specific timeframe. For example, payroll and billing systems that are handled on a weekly or monthly basis might use batch processing.

Behind the magic of auto-instrumentation (Grafana OpenTelemetry Community Call)

You add the OpenTelemetry Java agent, restart your app - and like magic, observability appears. But is it really magic? What’s actually enabled by default? What telemetry should you expect to see? What’s missing? And what might you want to tweak, tune, or even turn off?

IT Cost Optimization Strategy: Eliminating Guesswork with Observability

IT organizations are being asked to reduce costs, manage risk, and maintain performance at the same time. Meanwhile, infrastructure complexity continues to grow, and vendor pricing changes are reshaping budget assumptions. Too often, an IT cost optimization strategy is shaped by incomplete data around sizing, licensing, refresh timing, and platform decisions. That uncertainty leads to overprovisioning, budget surprises, and reactive operations. Observability changes that equation.