Operations | Monitoring | ITSM | DevOps | Cloud

April 2024

This Month in Datadog: Bits AI for Incident Management, KSPM, New Observability Pipelines, and more

Datadog is constantly elevating the approach to cloud monitoring and security. This Month in Datadog updates you on our newest product features, announcements, resources, and events. To learn more about Datadog and start a free 14-day trial, visit Cloud Monitoring as a Service | Datadog. This month, we put the Spotlight on Bits AI for Incident Management.

Best practices for monitoring managed ML platforms

Machine learning (ML) platforms such as Amazon Sagemaker, Azure Machine Learning, and Google Vertex AI are fully managed services that enable data scientists and engineers to easily build, train, and deploy ML models. Common use cases for ML platforms include natural language processing (NLP) models for text analysis and chatbots, personalized recommendation systems for e-commerce web applications and streaming services, and predictive business analytics.

Best practices for monitoring ML models in production

Regardless of how much effort teams put into developing, training, and evaluating ML models before they deploy, their functionality inevitably degrades over time due to several factors. Unlike with conventional applications, even subtle trends in the production environment a model operates in can radically alter its behavior. This is especially true of more advanced models that use deep learning and other non-deterministic techniques.

Lessons learned from running a large gRPC mesh at Datadog

Datadog’s infrastructure comprises hundreds of distributed services, which are constantly discovering other services to network with, exchanging data, streaming events, triggering actions, coordinating distributed transactions involving multiple services, and more. Implementing a networking solution for such a large, complex application comes with its own set of challenges, including scalability, load balancing, fault tolerance, compatibility, and latency.

Access Datadog privately and monitor your Google Cloud Private Service Connect usage

Private Service Connect (PSC) is a Google Cloud networking product that enables you to access Google Cloud services, third-party partner services, and company-owned applications directly from your Virtual Private Cloud (VPC). PSC helps your network traffic remain secure by keeping it entirely within the Google Cloud network, allowing you to avoid public data transfer and save on egress costs. With PSC, producers can host services in their own VPCs and offer a private connection to their customers.

Control your log volumes with Datadog Observability Pipelines

Modern organizations face a challenge in handling the massive volumes of log data—often scaling to terabytes—that they generate across their environments every day. Teams rely on this data to help them identify, diagnose, and resolve issues more quickly, but how and where should they store logs to best suit this purpose? For many organizations, the immediate answer is to consolidate all logs remotely in higher-cost indexed storage to ready them for searching and analysis.

Aggregate, process, and route logs easily with Datadog Observability Pipelines

The volume of logs generated from modern environments can overwhelm teams, making it difficult to manage, process, and derive measurable value from them. As organizations seek to manage this influx of data with log management systems, SIEM providers, or storage solutions, they can inadvertently become locked into vendor ecosystems, face substantial network costs and processing fees, and run the risk of sensitive data leakage.

Dual ship logs with Datadog Observability Pipelines

Organizations often adjust their logging strategy to meet their changing observability needs for use cases such as security, auditing, log management, and long-term storage. This process involves trialing and eventually migrating to new solutions without disrupting existing workflows. However, configuring and maintaining multiple log pipelines can be complex. Enabling new solutions across your infrastructure and migrating everyone to a shared platform requires significant time and engineering effort.

A closer look at our navigation redesign

Helping our users gain end-to-end visibility into their systems is key to the Datadog platform— to achieve this, we offer over 20 products and more than 700 integrations. However, with an ever-expanding, increasingly diverse catalog, it’s more important than ever that users have clear paths for quickly finding what they need.

Recapping Datadog Summit London 2024

In the last week of March 2024, Datadog hosted its latest Datadog Summit in London to celebrate our community. As Jeremy Garcia, Datadog’s VP of Technical Community and Open Source, mentioned during his welcome remarks, London is the first city that has seen two Datadog Summits, with the first one in 2018. It was great to be able to see how our community there has grown over the past six years.

And What About my User Experience?

Monitoring backend signals has been standard practice for years, and tech companies have been alerting their SRE and software engineers when API endpoints are failing. But when you’re alerted about a backend issue, it’s often your end users who are directly affected. Shouldn’t we observe and alert on this user experience issues early on? As frontend monitoring is a newer practice, companies often struggle to identify signals that can help them pinpoint user frustrations or performance problems.

What is an Anomaly? Avoiding False Positives in Watchdog Detected Anomalies

In 2018 Datadog released Watchdog to proactively detect anomalies on your observability data. But what defines an anomaly? How do you avoid false positives? At Datadog Summit London 2024, Nils Bunge, product manager at Datadog, shared the story of the creation of the first Datadog AI feature (Watchdog Alert), what we learned from it and how we applied those lessons to all the added AI functionalities across the years.

Stay up to date on the latest incidents with Bits AI

Since the release of ChatGPT, there’s been growing excitement about the potential of generative AI—a class of artificial intelligence trained on pre-existing datasets to generate text, images, videos, and other media—to transform global businesses. Last year, we released our own generative AI-powered DevOps copilot called Bits AI in private beta. Bits AI provides a conversational UI to explore observability data using natural language.

Monitor SQS with Data Streams Monitoring

Datadog Data Streams Monitoring (DSM) provides detailed visibility into your event-driven applications and streaming data pipelines, letting you easily track and improve performance. We’ve covered DSM for Kafka and RabbitMQ users previously on our blog. In this post, we’ll guide you through using DSM to monitor applications built with Amazon Simple Queue Service (SQS).

Datadog on Site Reliability Engineering #shorts #datadog #observability

There are many different ways to implement Site Reliability Engineering (SRE). From team structures to roles and responsibilities to planning and prioritization flows, there’s no golden path for how to organize things. As Datadog has shifted from a startup to a quickly-growing public company, we’ve seen our own SRE practice evolve. With over 22,000 customers sending trillions of data points each day, keeping Datadog reliable is critical to our business.

Empower engineers to take ownership of Google Cloud costs with Datadog

Google Cloud provides a wide range of services and tools to help engineering teams reduce the complexity of migrating and deploying applications in the cloud. As engineering teams work to improve the performance, reliability, and security of their applications, they also need to be conscious of cloud costs. But engineers often don’t have access to cost data, or they only see cost data in monthly reports.

Filter and correlate logs dynamically using Subqueries

Logs provide valuable information that can help you troubleshoot performance issues, track usage patterns, and conduct security audits. To derive actionable insights from log sources and facilitate thorough investigations, Datadog Log Management provides an easy-to-use query editor that enables you to group logs into patterns with a single click or perform reference table lookups on-the-fly for in-depth analysis.