Operations | Monitoring | ITSM | DevOps | Cloud

This Month in Datadog - July 2025

In July’s episode of This Month in Datadog, we’re doing things differently by spotlighting the people behind the products you rely on. Jeremy is joined by Tristan Ratchford to discuss saving time and effort when you’re on call with Bits AI SRE, and by Kevin Hu to explore gaining visibility into datasets across the entire data lifecycle with Data Observability.

Datadog Disaster Recovery mitigates cloud provider outages

A loss in infrastructure and applications observability can leave SRE and DevOps teams without insight into the real-time state of their production systems, causing them to temporarily pause code deployments and limit their ability to troubleshoot issues or respond to critical alerts. In modern cloud environments, where services are distributed and deeply interconnected, this lack of visibility can escalate quickly.

Bring high-performance observability to secure Kubernetes environments with Datadog's new CSI driver

In Kubernetes environments, applications often communicate with the Datadog Agent to send telemetry data such as custom metrics via DogStatsD or traces through Datadog APM. How this communication takes place depends on the communication mode set on the Datadog Cluster Agent's Admission Controller. With the sockets option, communication takes place through local inter-process communication via Unix domain sockets (UDS), whereas the service and default hostip options rely on network communication.

This Month in Datadog: Bits AI SRE, Datadog Data Observability, and more

Datadog is constantly elevating the approach to cloud monitoring and security. This Month in Datadog updates you on our newest product features, announcements, resources, and events. To learn more about Datadog and start a free 14-day trial, visit Cloud Monitoring as a Service | Datadog. This month, we chat with two guests about Bits AI SRE and Datadog Data Observability.

AI Agents Console: Monitor the behavior and interactions of any AI agent in your stack

With Datadog's AI Agents Console, you can monitor the behavior and interactions of any AI agent that’s a part of your enterprise stack, whether that’s a computer use agent like OpenAI’s Operator, IDE agent like Cursor, DevOps agent like Github Copilot, enterprise business agent like Agentforce, or your internally built agents. You'll have full visibility into every agent's actions, insights into the security and performance of your agents, analytics on user engagement, and measurable business value from every agent, all in a centralized location.

New in APM

Datadog’s Latency Investigator for APM—now in Preview—automatically investigates hypotheses in the background, comparing historical traces and correlating change tracking, DBM, and profiling signals. This helps teams quickly isolate root causes and understand impact without combing through raw telemetry data. You can go from detection to resolution in a single workflow, and generate a pull request to apply a recommended fix, all without leaving Datadog..

Data Observability: Build confidence in the data life cycle

Datadog Data Observability provides a complete solution with quality checks (e.g., volume, row changes, freshness), custom SQL-based monitors, anomaly detection, column-level lineage across systems like Snowflake and Tableau, full pipeline visibility, and targeted alerts when data issues arise.

Why continuous profiling is the fourth pillar of observability

Developers have long used profilers to diagnose performance bottlenecks and improve the efficiency of their code. But a modern version of profiling, continuous profiling, is quietly redefining what profiling is and what it can do. By running nonstop in production with very low overhead, continuous profilers give teams always-on visibility into how their code behaves in the real world.

Debug live production issues with the Datadog Cursor extension

The Datadog Cursor Extension uses the Datadog remote MCP Server to give developers access to Datadog tools and observability data directly from within the Cursor IDE. The Cursor Extension enables you to view live variable values that your logpoints capture during execution, and you can use the Cursor Agent to identify the lines of code responsible for the issue at hand. The Datadog Cursor Extension is now available in Preview.

How Datadog Cloud Network Monitoring helps you move to a deny-by-default network egress policy at scale

When organizations first begin deploying workloads on Kubernetes, it's common for them to start with a permissive egress traffic policy that allows any workload to reach the internet. This approach can make it easier for teams to stay agile and to get services up and running in fast-moving environments. But as your Kubernetes footprint grows, it's important to minimize public internet access on a per-workload basis to improve your organization's security posture.

Bits AI Dev Agent: Automatically identify issues and generate code fixes

The Bits Dev Agent is an AI-powered coding assistant in Datadog designed to reclaim developer productivity by autonomously monitoring telemetry data, identifying key issues, and generating production-ready pull requests. Developers receive asynchronous, context-rich PRs with clear explanations, allowing them to shift their focus from troubleshooting to reviewing solutions and building better code.

Introducing Bits AI SRE, your AI on-call teammate

Bits AI SRE is your AI on-call teammate, built to autonomously investigate alerts and coordinate incident response. Integrated with Datadog, Slack, GitHub, Confluence, and more, Bits analyzes telemetry, reads documentation, and reviews recent deployments to determine the root cause of alerts—often before you’ve even opened your laptop. In fact, if you're using Datadog On-Call, you can view Bits’s findings right from your phone—so you’re always one step ahead, no matter where you are.

Datadog Incident Response: Unify remediation and communication

With Datadog's new AI voice agent in Incident Response, you can quickly get up to speed on the issue and start taking action directly from your phone. Handoff notifications make it easy to jump straight to the relevant context and quickly communicate with other responders. Finally, our status pages enable you to automatically update users on your remediation progress.

Monitor Lambda-hosted web apps with the Lambda Web Adapter integration

As organizations migrate their legacy web applications from containerized or server-based deployments to serverless environments, they often run into a critical compatibility challenge. Traditional web frameworks like Flask, Express, or SpringBoot are designed to run on persistent HTTP servers, not event-driven, stateless environments like AWS Lambda. The AWS Lambda Web Adapter bridges this gap by allowing teams to run web server-based applications inside Lambda with minimal changes.

Choosing the right OpenTelemetry Collector distribution

The OpenTelemetry (OTel) Collector plays a central role in collecting, processing, and exporting telemetry data. If you’re deploying the Collector in production, chances are you’ve reached for the otelcol-contrib distribution. It’s the easiest, most flexible, and most documented distribution, used in nearly every demo and getting-started guide. But here’s the catch: It’s not actually recommended for production use.

Missing container-layer metadata: Why it happens and what you can do

Container image layers provide valuable insight into what goes into a container, including which packages were installed, what commands were run, and where vulnerabilities might live. The metadata associated with these image layers is essential for debugging, optimizing image size, and managing security risks. However, key container-layer metadata fields such as digest, size, and created_by are sometimes missing, which can disrupt important tasks.

A look back at DASH 2025

DASH 2025 brought the Datadog community together like never before. During our biggest event yet, thousands of attendees gathered at the North Javits Center in New York City for two and a half days of content, learning, and community, where they deepened their knowledge and connected with peers. Here's a quick look back at some of the highlights from this year's DASH.

Proactively troubleshoot with synthetic testing and distributed tracing

As your application grows in complexity, identifying the root cause of issues becomes increasingly difficult. Many monitoring strategies make this even harder by siloing frontend and backend data. To effectively troubleshoot problems that spread across your app, you need visibility not just into each part of your stack, but also into how these parts interact.

Monitor agents built on Amazon Bedrock with Datadog LLM Observability

As large language models (LLMs) grow more powerful, organizations are deploying agentic AI applications to tackle complex, multi-step tasks. With Amazon Bedrock Agents, developers can orchestrate these agents to manage tasks such as triggering serverless functions, calling APIs, accessing knowledge bases, and maintaining contextual conversations—all while breaking down complex user requests or tasks into manageable steps.

Beyond Metrics: How We Reimagined Incident Response with RUM

When your monitoring tools and logs tell you everything's fine, but users can't access critical healthcare services, where do you look? Our team discovered that Real User Monitoring (RUM) isn't just for tracking page load times and user journeys – it's a powerful incident response tool that can uncover issues traditional monitoring misses entirely.

Datadog named Leader in 2025 Gartner Magic Quadrant for Observability Platforms

We are thrilled to announce that, for the fifth consecutive year, Datadog has been named a Leader in the 2025 Gartner Magic Quadrant for Observability Platforms. We believe that this recognition reflects our continued focus on helping customers observe, secure, and act on everything that matters across their technology stack.

Here's how to add business data to logs from retail endpoints | Datadog Tips & Tricks

Some sources simply do not generate data-rich logs. Retail endpoints that are older or run on proprietary services, for example, very often produce logs without the kinds of data that are needed to perform useful business analytics. So, what can you do?

Troubleshoot root causes with GitHub commit and ownership data in Error Tracking

When an error occurs, developers need to act quickly. But too often, they’re left searching through stack traces without enough context to understand what happened, who owns the code, or what change may have introduced the issue. This slows down triage, creates inefficient handoffs, and takes time away from building new features.

Monitor your LiteLLM AI proxy with Datadog

As organizations rapidly scale their use of large language models (LLMs), many teams are adopting LiteLLM to simplify access to a diverse set of LLM providers and models. LiteLLM provides a unified interface through both an SDK and proxy to speed up development, centralize control, and optimize LLM-powered workflows. But introducing a proxy layer adds abstraction, making it harder to understand how requests are processed.

Reduce your mean time to repair with the Datadog mobile app

For on-call engineers responding to alerts, every minute counts. Faster incident response means faster mitigation, reduced downtime, and better customer experience. But even the most finely tuned, meticulously detailed alerts can leave responders scrambling for more information. In order to effectively triage and investigate incidents and set remediation in motion, responders need data to help them contextualize alerts.

How we created a single app to automate repetitive tasks with Datadog Workflow Automation, Datastore, and App Builder

For many organizations, scaling up their systems means incorporating new tools to build out infrastructure, optimize code performance and security, improve communication, and track cost changes. While these changes are necessary to support an increased workload, they often result in a situation where even the most basic tasks involve switching between multiple platforms.

Why GovRAMP-authorized observability matters for state, local, and education IT teams

Building on our FedRAMP Moderate authorization and our “In Process” status for FedRAMP High, Datadog for Government is now "In Process" for GovRAMP High Authorization, giving agencies a unified observability platform that meets the toughest public-sector security bars.