Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Monitoring for Websites, Applications, APIs, Infrastructure, and other technologies.

Reducing Alert Noise with Composite Alerts in Hosted Graphite

Traditional alerts are simple by design: if a metric crosses a threshold, fire an alert. While that simplicity makes alerts easy to configure, it also leads to alert noise, because single metrics rarely tell the full story and often trigger during non-actionable conditions. Hosted Graphite Composite Alerts solve this by allowing you to combine multiple alert conditions using logical expressions like AND (&&) and OR (||).

Why AI Automation for ITOps Needs Context Graphs

AI automation in ITOps fails because execution loses decision context, and context graphs turn incident history into durable execution memory that systems can actually reuse. AI automation for ITOps fails because it remembers what it did, but not why. Fixing an issue depends on what was tried last time, what failed, what worked, which exceptions were approved, and under what conditions. That information rarely lives in the system.

Green dashboards, red flags

A VP of Engineering (from a company I’m not allowed to name) told me recently: "You helped us find and fix real user-facing issues. Now we need to convince our CTO why that matters more than the standard SLO’s and systems." Here's the thing: your CTO is not wrong in measuring the systems and basic uptime. That’s the baseline though. They’re all trying to watch everything, but they’re seeing nothing as it relates to users.

What is HEAL Monitoring Tool? A Comprehensive Guide for IT Leaders

Your organization has invested heavily in monitoring tools for application performance, infrastructure monitoring tools for servers and databases, log monitoring tools, network monitoring tools, and third-party monitoring tools for specific services. But the actual problem is your IT team is drowning in that data. A single production issue generates 30+ alerts across applications, databases, servers, and monitoring tools, creating an alert flood that buries the actual problem.

When Things Go Wrong, Systems Should Help Humans - Not Fight Them

In the previous post, we explored how AI accelerates delivery and compresses the time between change and user impact. As velocity increases, knowing that something has gone wrong before users do becomes a critical capability. But detection is only the beginning. Once alerts fire and dashboards light up, humans still have to interpret what’s happening, make decisions under pressure, and act.

Easily Map Logs to OCSF with Datadog Observability Pipelines

Normalizing security logs into the Open Cybersecurity Schema Framework (OCSF) is often complex, manual, and time-consuming. With Datadog Observability Pipelines, you can easily transform logs into OCSF format—right in your own environment—before routing them to destinations like Splunk, CrowdStrike, and AWS Security Lake. This video show how Security teams can use Observability Pipelines to: Collect, process, and transform logs into OCSF format automatically.

Testing Icinga in a Homelab Setup With Nextcloud

If you want to get started with Icinga but don’t have a data center lying around, no worries. Icinga is a lightweight monitoring tool that works for both large infrastructures and small home labs. When I first explored Icinga during my first year as an apprentice, it was also my first real contact with monitoring tools. After completing the Icinga Fundamentals training, I wanted to experiment with hosts and services, but what should I monitor?

Key Financial Services Industry Trends Shaping 2026

The financial services industry is continuing its acceleration. AI is rolling out across the enterprise, and compliance expectations continue to diverge based on jurisdiction. It’s an unprecedented technology shift to say the least, and the pressure is being felt throughout the IT industry to catch up and remain resilient. More important now than ever before, learn how Auvik provides financial institutions with full network visibility and monitoring that catches problems before they become outages.