Operations | Monitoring | ITSM | DevOps | Cloud

Latest Posts

Save up to 14 percent CPU with continuous profile-guided optimization for Go

We are excited to release our tooling for continuous profile-guided optimization (PGO) for Go. You can now reduce the CPU usage of your Go services by up to 14 percent by adding the following one line before the go build step in your CI pipeline: You will also need to supply a DD_API_KEY and a DD_APP_KEY in your environment. Please check our documentation for more details on setting this up securely.

Manage incidents seamlessly with the Datadog Slack integration

Modern, distributed application architectures pose particular challenges when it comes to coordinating incident management. DevOps, SREs, and security teams—often spread out across separate locations and time zones, and equipped with limited knowledge of each other’s services—must work quickly to collaboratively triage, troubleshoot, and mitigate customer impact.

Aggregate, correlate, and act on alerts faster with AIOps-powered Event Management

Maintaining service availability is a challenge in today’s complex cloud environments. When a critical incident arises, the underlying cause can be buried in a sea of alerts from interconnected services and applications. Central operations teams often face an overload of disparate alerts, causing confusion, delayed incident response, alert fatigue, and redundant resolution efforts. These issues can negatively impact revenue and customer experience, especially during an outage.

Track changes in your containerized infrastructure with Container Image Trends

Datadog’s Container Images view provides key insights into every container image used in your environment, helping you quickly detect and remediate security and performance problems that can affect multiple containers in your distributed system. In addition to having a snapshot of the performance of your container fleet, it’s also critical to understand large-scale trends in security posture and resource utilization over time.

Best practices for monitoring managed ML platforms

Machine learning (ML) platforms such as Amazon Sagemaker, Azure Machine Learning, and Google Vertex AI are fully managed services that enable data scientists and engineers to easily build, train, and deploy ML models. Common use cases for ML platforms include natural language processing (NLP) models for text analysis and chatbots, personalized recommendation systems for e-commerce web applications and streaming services, and predictive business analytics.

Best practices for monitoring ML models in production

Regardless of how much effort teams put into developing, training, and evaluating ML models before they deploy, their functionality inevitably degrades over time due to several factors. Unlike with conventional applications, even subtle trends in the production environment a model operates in can radically alter its behavior. This is especially true of more advanced models that use deep learning and other non-deterministic techniques.

Lessons learned from running a large gRPC mesh at Datadog

Datadog’s infrastructure comprises hundreds of distributed services, which are constantly discovering other services to network with, exchanging data, streaming events, triggering actions, coordinating distributed transactions involving multiple services, and more. Implementing a networking solution for such a large, complex application comes with its own set of challenges, including scalability, load balancing, fault tolerance, compatibility, and latency.

Access Datadog privately and monitor your Google Cloud Private Service Connect usage

Private Service Connect (PSC) is a Google Cloud networking product that enables you to access Google Cloud services, third-party partner services, and company-owned applications directly from your Virtual Private Cloud (VPC). PSC helps your network traffic remain secure by keeping it entirely within the Google Cloud network, allowing you to avoid public data transfer and save on egress costs. With PSC, producers can host services in their own VPCs and offer a private connection to their customers.

Control your log volumes with Datadog Observability Pipelines

Modern organizations face a challenge in handling the massive volumes of log data—often scaling to terabytes—that they generate across their environments every day. Teams rely on this data to help them identify, diagnose, and resolve issues more quickly, but how and where should they store logs to best suit this purpose? For many organizations, the immediate answer is to consolidate all logs remotely in higher-cost indexed storage to ready them for searching and analysis.

Aggregate, process, and route logs easily with Datadog Observability Pipelines

The volume of logs generated from modern environments can overwhelm teams, making it difficult to manage, process, and derive measurable value from them. As organizations seek to manage this influx of data with log management systems, SIEM providers, or storage solutions, they can inadvertently become locked into vendor ecosystems, face substantial network costs and processing fees, and run the risk of sensitive data leakage.