Latest Posts

Trace Distributed Map states for AWS Step Functions with Datadog

Jun 25, 2025 By Abhinav Vedmala In Datadog

AWS Step Functions offers the Distributed Map state, enabling you to coordinate massively parallel workloads within your serverless applications. With this feature, a single Step Functions execution can fan out into up to 10,000 parallel workflows simultaneously, making it possible to efficiently process millions of items in parallel. This capability unlocks new possibilities for large-scale data processing, such as image transformation, log ingestion, or batch analytics.

Read Post

Datadog

Read more about Trace Distributed Map states for AWS Step Functions with Datadog

Datadog + OpenAI: Codex CLI integration for AIassisted DevOps

Jun 12, 2025 By Reilly Wood In Datadog

We are exploring how we can help on-call engineers troubleshoot incidents more effectively by providing the OpenAI Codex agent with access to real-time observability data in terminals. We've developed an integration and new tool visualizations that connect OpenAI's Codex CLI to the new Datadog MCP server. In this post, we'll share what we've been experimenting with: enabling an AI agent to retrieve production metrics, logs, and incidents from Datadog in real time and act on that context.

Read Post

Datadog

Read more about Datadog + OpenAI: Codex CLI integration for AIassisted DevOps

Optimize and troubleshoot AI infrastructure with Datadog GPU Monitoring

Jun 10, 2025 By Anjali Thatte In Datadog

As organizations bring more AI and LLM workloads into production, the underlying GPU infrastructure that supports these workloads becomes even more critical in ensuring these workloads remain fast, reliable, and scalable. Inefficient GPU resource usage, for instance, can lead to longer runtimes and reduced throughput, negatively impacting overall model performance. Additionally, idle and underutilized GPUs can quickly drive up costs and lead to needless spending.

Read Post

Datadog

Read more about Optimize and troubleshoot AI infrastructure with Datadog GPU Monitoring

Datadog MCP Server: Connect your AI agents to Datadog tools and context

Jun 10, 2025 By Bowen Chen In Datadog

As development teams adopt AI-powered tools and build services that make use of AI agents, they want to extend their AI capabilities to incorporate familiar tools and observability data. However, AI agents struggle with regular API endpoints and frequently fail when parsing complex nested JSON hierarchies or incorrectly handling errors. As a result, these agents often fail to retrieve relevant results.

Read Post

Datadog

Read more about Datadog MCP Server: Connect your AI agents to Datadog tools and context

Automatically identify issues and generate fixes with Bits AI Dev

Jun 10, 2025 By Mike Leach In Datadog

Developers lose hours each week to a familiar troubleshooting loop: chase down telemetry across dashboards, decipher vague errors, and juggle alerts to find the signal worth fixing. Production issues, performance regressions, and security vulnerabilities all demand attention, but they often come with little context for taking action.

Read Post

Datadog

Read more about Automatically identify issues and generate fixes with Bits AI Dev

Improve performance and reliability with Proactive App Recommendations

Jun 10, 2025 By Yoann Robin In Datadog

As your organization grows, you may operate in increasingly complex environments and manage more services and larger teams to maintain them. Evolution like this can lead to an explosion of telemetry data from across your stack, including metrics, traces, logs, and frontend interactions. The benefit of greater visibility is often outweighed by the challenge of acting on the data you collect, and you can easily fall behind on implementing the fixes your services require to operate reliably and efficiently.

Read Post

Datadog

Read more about Improve performance and reliability with Proactive App Recommendations

Ensure trust across the entire data life cycle with Datadog Data Observability

Jun 10, 2025 By Nicholas Thomson In Datadog

As data systems grow more complex and data becomes even more business-critical, teams struggle to detect and resolve issues that impact data quality, reliability, and, ultimately, trust. Engineers have to rely on manual checks and ad hoc SQL queries to catch data quality issues—often after teams relying on the data have noticed something has gone wrong.

Read Post

Datadog

Read more about Ensure trust across the entire data life cycle with Datadog Data Observability

Accelerate Oracle Cloud Infrastructure monitoring with Datadog OCI QuickStart

Jun 10, 2025 By Natalie Wilkinson In Datadog

Datadog’s Oracle Cloud Infrastructure integration enables you to collect metrics and logs from your entire OCI stack and monitor them within a single platform alongside other third-party technologies. Datadog’s new OCI QuickStart is a fully managed, single-flow setup experience that helps you monitor your OCI infrastructure and applications in just a few clicks.

Read Post

Datadog

Read more about Accelerate Oracle Cloud Infrastructure monitoring with Datadog OCI QuickStart

Create and monitor LLM experiments with Datadog

Jun 10, 2025 By Tom Sobolik In Datadog

To efficiently optimize your LLM application before pushing to production, you need a comprehensive testing and evaluation framework. By running experiments, you can optimize prompts, fine-tune temperature and other key parameters, test complex agent architectures, and understand how your application may respond to atypical, complex, or adversarial inputs. However, it can be difficult to manage your experiment runs and aggregate the results for meaningful analysis.

Read Post

Datadog

Read more about Create and monitor LLM experiments with Datadog

Introducing Bits AI SRE, your AI on-call teammate

Jun 10, 2025 By Kai Xin Tai In Datadog

Getting paged pulls engineers away from meaningful work, yet incident response in many organizations remains manual, reactive, and draining. An alert fires and teams scramble to find the root cause, relying on siloed knowledge, incomplete context, and a few on-call experts who are already stretched thin. The rise of AI coding agents has only intensified this challenge: As teams ship code faster with less human oversight, production systems grow increasingly complex and harder to understand.

Read Post

Datadog

Read more about Introducing Bits AI SRE, your AI on-call teammate

Operations | Monitoring | ITSM | DevOps | Cloud

Trace Distributed Map states for AWS Step Functions with Datadog

Datadog + OpenAI: Codex CLI integration for AIassisted DevOps

Optimize and troubleshoot AI infrastructure with Datadog GPU Monitoring

Datadog MCP Server: Connect your AI agents to Datadog tools and context

Automatically identify issues and generate fixes with Bits AI Dev

Improve performance and reliability with Proactive App Recommendations

Ensure trust across the entire data life cycle with Datadog Data Observability

Accelerate Oracle Cloud Infrastructure monitoring with Datadog OCI QuickStart

Create and monitor LLM experiments with Datadog

Introducing Bits AI SRE, your AI on-call teammate

Monthly Archive

Follow Us