Operations | Monitoring | ITSM | DevOps | Cloud

The latest News and Information on Application Performance Monitoring and related technologies.

Training Foundation Models on a Trillion Data Points with Apache Iceberg

Training an AI foundation model on over a trillion data points sounds impossible without hitting your production systems. Here's how Datadog did it with Apache Iceberg for their time series forecasting model TOTO. The key challenge: extracting massive historical observability data (metrics spanning years) and running incremental preprocessing pipelines without overwhelming production services. Iceberg solved this by providing schema governance, consistency guarantees, and seamless integration with ML tools like Ray and PyTorch.

OpenTelemetry Metrics with 5 Practical Examples

Picture this, your observability tool already nails the basics like request rates, latency and memory usage, but you need more insight. Think user churn rates, engagement spikes, or even how many carts get abandoned mid-checkout. That’s where OpenTelemetry steps in, providing a way to track those critical custom metrics with ease.

How Inkeep Monitors Their AI Agent Framework with SigNoz

AI agents are fundamentally different beasts to monitor compared to traditional applications. A single user request can trigger a cascade of 10+ internal operations: sub-agent transfers, tool executions, LLM calls, API requests, each with unpredictable latency and failure modes. When something goes wrong (and with LLMs, things go wrong in creative ways), you need to see the entire execution flow to debug effectively.

Overcoming ClickHouse's JSON constraints to build a high-performance JSON log store

Customer logs data is always messy. Being (and building!) an observability platform, we get to see all the beautiful, creative ways it can be messy, every single day. And yet, our customers expect, quite fairly, I might add, perfect query results and peak performance. Info SigNoz is an open-source observability platform that can be your one-stop solution for logs, metrics and traces.

How to Track Cloud Costs in Real-Time Instead of Waiting Days

Tired of waiting days to see your AWS bill spike? Datadog solved this problem using Apache Iceberg to deliver real-time cloud cost visibility - updating every 15 minutes instead of waiting for billing data. Here's how it works: They sync real-time resource inventory (EC2 instances, Kubernetes pods) into Iceberg tables, then use Trino to join those snapshots with unit pricing data. The result? FinOps teams can catch cost anomalies before they become budget disasters.

How Datadog Manages 50,000 Apache Iceberg Tables at Scale

Think managing a few database tables is hard? Try 50,000 production Iceberg tables storing petabytes of data with 8 million scans per day. In this clip, Datadog's platform team reveals the architecture choices behind their managed Iceberg implementation that serves hundreds of internal engineering teams.

Datadog at AWS re:Invent, Bits AI SRE, MCP Server, CloudPrem, and more | This Month in Datadog

Get a closer look at features we announced at AWS re:Invent in the latest episode of This Month in Datadog. Tune in for spotlights of Bits AI SRE, now generally available, and Datadog’s MCP Server, which connects AI agents to our platform by ingesting prompts and mapping them to Datadog resources and data. Plus, we cover how to: This Month in Datadog brings you the latest updates on our newest product features, announcements, resources, and events.

Datadog on Apache Iceberg

Historically, Datadog has relied on technologies like Snowflake and Apache Spark on raw parquet files (lacking consistent table structure) to power internal analytics and data science at scale. As usage grew across product teams, more features depended on data science teams, and our datasets grew to include more telemetry data, these systems became complex to manage and govern both technically and financially. The need for a more flexible and scalable solution led Datadog to adopt Apache Iceberg, an open source table format for data lakes that brings reliability and performance while remaining SQL-friendly.
Sponsored Post

Adding a CDN to a load balancer (for a much faster website)

Here at Raygun, we like to go fast. Really fast. That's what we do! When we see something that isn't zooming, we try to figure out how to make it go faster. So today, we're answering a simple (and relevant) question; how do we make our public site, raygun.com, much, much faster? The answer, at first glance, is simple-we build it into a Content Delivery Network (CDN). But what if you have a load balancer serving your website, and you don't want to rebuild everything to serve from a CDN? Well, that's more complicated. Let's start by describing the issue.