Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Monitoring for Websites, Applications, APIs, Infrastructure, and other technologies.

Unlock telemetry value with a well-planned data lake

Your SIEM only holds a slice of your telemetry. Your data lake holds the rest. We'll show you how to use that to your advantage for investigations, threat hunting, and reporting. Why your data lake beats your SIEM for investigations – Your SIEM keeps a short window of expensive, filtered data. Your data lake keeps everything. When something goes wrong, that difference matters more than you think Threat hunting without the handcuffs – Hunting across months of data in a SIEM is painful and costly. We'll show you how a well-planned lake makes broad, deep searches practical and affordable.

Teach Your AI Coding Agent to Instrument, Monitor, and Troubleshoot Infrastructure with netdata/skills

There’s a growing ecosystem of AI coding agents: Claude Code, Cursor, Copilot, Codex, Gemini CLI, Windsurf, and others. They’re good at writing code, but they don’t inherently know how to instrument that code for observability, configure monitoring infrastructure, or troubleshoot production systems using real telemetry data. That knowledge lives in documentation, runbooks, and the heads of your senior SREs.

The Productivity Tax of Repeat IT Failures in Technology Companies

Technology companies are being pushed to deliver faster outcomes while justifying growing investment in AI, SaaS, and digital infrastructure. But productivity does not improve just because new tools are deployed. It improves when employees can use those tools without the constant drag of slow devices, unstable applications, and fixes that do not fully solve the problem. That is the productivity tax of digital friction.

How to Create Your Own Plugins and Check Commands in Icinga 2

If you’ve been using Icinga 2 for a while, you probably know the built-in checks cover a lot of ground: disk space, CPU, memory, ping. But sooner or later you’ll run into something specific to your setup that no existing plugin handles. That’s where writing your own plugin comes in. The good news? It’s simpler than it sounds. Icinga 2 doesn’t care what language your plugin is written in. It just runs the script, reads the exit code, and displays the output. That’s it.

Honeycomb Canvas: The Multiplayer Workspace for the Agentic Era

Last week, we launched a major update to Canvas, our investigation workspace. The new Canvas has evolved from an AI co-pilot you chat with to a place where your whole team, human and agent, can work the same problem on the same surface. Auto-investigations begin the moment a trigger, SLO, or anomaly fires. Custom skills encode your team's runbooks so every agent investigates with your team's expertise built in.

Introducing Atatus Sensitive Data Classifier

Your logs know too much. Every debug statement, every traced request, every APM span can carry the risk of capturing something they shouldn't. A customer email. A JWT token. A credit card number. An API key that was never meant to leave your payment service. It doesn't look like a breach. There's no alert. Your observability platform just quietly accumulates sensitive data like indexed, replicated, and accessible to every engineer with log query access.

How we made a SQL query optimization agent 59% more accurate using autoresearch and LLM Observability

Without experiment infrastructure to help you test your LLM applications, every research session starts with the same questions: What have we tried previously? What were the numbers? Which prompt version produced that result? Why did we discard that approach? The answers live in scattered notes, terminal history, and half-remembered conversations. Each handoff between sessions loses context. In practice, iteration can slow down as teams get bogged down in testing and analysis.

How to audit and clean up monitors effectively

Alert fatigue and blind spots develop together. Monitoring stacks that generate noise while missing critical issues may have incomplete coverage or poorly configured alerts. As they grow reactively and without structured coverage assessment, both issues worsen. Teams will often add monitors when something breaks and tune thresholds when alerts become unbearable, but rarely audit their overall setup to see if it works.