Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Monitoring for Websites, Applications, APIs, Infrastructure, and other technologies.

Patterns for Deploying OpenTelemetry Collector at Scale

So, you've embraced OpenTelemetry, and it's been great. Pat, Pat. That single, vendor-neutral pipeline for your traces, metrics, and logs felt like the future. But now, the future is getting bigger. That simple OTel Collector configuration that worked perfectly for a few services is starting to show its limits as you scale. The data volume is climbing, reliability is becoming a concern, and you're wondering if that single collector instance is now a bottleneck waiting to happen.

Datadog Bits AI SRE: Your new teammate for on-call shifts

Bits AI SRE is an always-on SRE agent built to handle complex troubleshooting and late-night alerts. Developed against thousands of real-world incidents and powered by Datadog’s platform, Bits AI SRE analyzes your entire stack, tests hypotheses, and identifies root causes in minutes. Resolve faster, get back to sleep sooner, and give your on-call team the confidence and capacity they need.

Optimize Your Oracle Cloud (OCI) Spend with Datadog Cloud Cost Management

Support for Oracle Cloud Infrastructure (OCI) is now live in Datadog Cloud Cost Management. In this short demo, you’ll learn how to: Get granular visibility into OCI cost and usage—by service, compartment, tag, and resource tier. Uncover savings opportunities by combining cost data with observability metrics like CPU, memory, and storage utilization. Set up anomaly monitors and budgets to avoid cost overruns—especially for high-risk workloads like AI and GPU training.

New Feature: Filter HTTP Pings by Keywords

Healthchecks.io can now classify HTTP pings from clients as start, success, or failure signals not only by URL suffixes (no suffix, /start, /fail, /{exit-status}) but also by looking for specific keywords or phrases in the HTTP request body. The content filtering feature was already available for email pings, and now it has been extended to HTTP pings as well.

Contextual, in-product guidance for every Grafana user: A closer look at Interactive Learning

As developer advocates at Grafana Labs, we’re always looking for new ways to help our users better understand and learn observability. You might remember our previous project that brought learning to life through an adventure-style game, and now we’re really excited to share something else we’ve been working on: Interactive Learning, a new way to get the technical help you need directly in Grafana.

Part 1: What If Data Wasn't Just the Fuel for AI but the Foundation of Everything It Knows?

Every breakthrough begins with a question. What if we looked beyond today’s tools, buzzwords, and hype and examined the design principles shaping tomorrow’s intelligent enterprises? The What If series explores those inflection points: moments where technology meets human judgment, where automation meets accountability, and where AI begins to resemble something more like understanding than output.

Better Together: Building the Self-Healing Enterprise

When technology slows, everything does. Guests wait to check in. Travelers queue at kiosks. Shoppers refresh the page, hoping the payment goes through. Every second of downtime costs companies millions and frustrates millions more. LogicMonitor and Catchpoint have been solving that problem from different sides: one focused on the systems and infrastructure that keep businesses running, the other on the experiences and performance that users actually feel.

A New Chapter: LogicMonitor + Catchpoint - A Personal Note from Mehdi

In 2008, I was sitting in my garage office with a simple but stubborn idea: the Internet deserved better. End users deserved better. Companies needed a way to truly understand what their customers were experiencing, not just what their servers were reporting. Digital Experience Monitoring wasn’t a category yet. But the need was unmistakable. That idea didn’t come from theory or ambition. It came from lived experiences.

Optimize Kubernetes cluster cost with Datadog Cluster Autoscaler

Running Kubernetes at scale almost always means paying for more compute than you need. To protect reliability, platform and application teams typically overprovision nodes early in development and keep scaling up as they add features and workloads. They are often reluctant to move to smaller or different instance types without a clear picture of how those changes will affect performance or availability. The result is a fleet of underutilized nodes that silently inflate your cloud bill.