Operations | Monitoring | ITSM | DevOps | Cloud

Latest Posts

Monitor Ray applications and clusters with Datadog

Ray is an open source compute framework that simplifies the scaling of AI and Python workloads for on-premise and cloud clusters. Ray integrates with popular libraries, data stores, and tools within the machine learning (ML) ecosystem, including Scikit-learn, PyTorch, and TensorFlow. This gives developers the flexibility to scale complex AI applications without making changes to their existing workflows or AI stack.

Track service provider outages with IsDown and Datadog

When your apps and infrastructure rely on dozens of third-party providers for key functionality, it’s important to closely track their outages. If a service you rely on goes down, you need to move quickly to limit the outage’s impact on your users. IsDown provides a detailed status page aggregator and uptime monitoring for all your third-party dependencies.

Monitor your chaos engineering experiments with Steadybit's offering in the Datadog Marketplace

Steadybit is a software reliability platform that uses chaos engineering and fault injection to help organizations improve the stability and performance of their applications. By allowing customers to simulate turbulent scenarios in a controlled environment, Steadybit enables you to identify and mitigate potential system issues to reduce downtime and improve resilience.

A deep dive into CPU requests and limits in Kubernetes

In a previous blog post, we explained how containers’ CPU and memory requests can affect how they are scheduled. We also introduced some of the effects CPU and memory limits can have on applications, assuming that CPU limits were enforced by the Completely Fair Scheduler (CFS) quota. In this post, we are going to dive a bit deeper into CPU and share some general recommendations for specifying CPU requests and limits.

Highlights from AWS re:Invent 2023

Whether or not you made the journey to this year’s re:Invent, there’s always a variety of great announcements lost amid an action-packed week of keynotes, breakouts, expo hall demos, and networking sessions. No need to worry—we’re always happy to be a big part of the re:Invent experience and share our observations with you.

Introducing CoTerm, your collaborative terminal for pair programming and debugging

For too long, engineers have had to piece together an unwieldy combination of tools to collaboratively debug and resolve incidents while pair programming in real time. These activities normally require developers to work individually through a terminal, but the patchwork solutions that allow teams to work together in terminals all have significant drawbacks.

Monitor Amazon S3 Express One Zone with Datadog

Amazon Simple Storage Service (S3) now offers a high-performance storage class, S3 Express One Zone, that delivers consistent single-digit millisecond data access for your most latency-sensitive applications. Designed for your most frequently accessed datasets, S3 Express One Zone replicates and stores your data within a single AWS Availability Zone, scales to process millions of requests per minute, and uses hardware and software optimized for low latency.

Govern your infrastructure resources with the Datadog Resource Catalog

As an administrator of an expanding, highly distributed infrastructure, you may be responsible for overseeing thousands of on-premise and cloud resources from multiple providers—governed under dozens of accounts by a complex nest of RBAC rules. To query all these resources for purposes such as compliance audits and access management, you may be required to write custom scripts and painstakingly sift through data across disparate tools.

Monitor and improve your CI/CD on AWS CodePipeline with Datadog CI Visibility

CI/CD services such as AWS CodePipeline enable developers to automate and accelerate the process of building, testing, and deploying code. But with the speed, scale, and complexity of the modern software development life cycle, even small performance regressions or increases in failure rates in your CI system can quickly snowball, slowing or even halting releases and causing cost overruns.

Enhance your troubleshooting workflow with Container Images in Datadog Container Monitoring

Containers are powerful tools for scaling and deploying your applications, but with so many components pulled from different sources, there’s a greater potential for issues within them to go undetected. As a result, you need to monitor every layer of your containerized environments for vulnerabilities and performance problems—from your application to your container images.