Operations | Monitoring | ITSM | DevOps | Cloud

Datadog

Paving the Road for Proactive Reliability

At Expedia Group, Kaushik Patel and Nikos Katirtzis have thousands of engineers and micro-services. Heterogeneity in terms of infrastructure and technologies used over the years created inefficiencies and posed the need for a set of automated best practices for our engineering teams. Over the past 2 years, using a data-driven approach, we’ve worked on creating a set of platforms that helps teams to adopt good reliability practices, including chaos engineering, release safety, or automatic failover between cloud regions. In this talk Kaushik and Nikos will cover the platforms they’ve built, including how they used data to drive their investment decisions.

Detect Java code-level issues with Seagence and Datadog

In Java applications, concurrency issues can be difficult to reproduce and debug. Because work is scheduled nondeterministically across threads, the conditions that have led to an error in one execution of the program may not trigger the same issue the next time around. Exceptions that are silently handled—also known as swallowed exceptions—can also be challenging to debug because they typically do not leave any trace in the logs.

Quickly remediate issues in your Azure applications with Datadog Workflow Automation

Datadog Workflow Automation speeds up incident response and remediation for DevOps, SRE, and security teams by enabling them to automatically run predefined task sequences whenever specific alerts or security signals are triggered. After the feature’s initial release in 2023, Datadog is now excited to announce a significant expansion of its Workflow Automation capabilities with Azure actions, allowing engineers to create automated workflows for their Azure resources for the first time.

Improve your shift-left observability with the Datadog Service Catalog

Your applications are only as powerful as they are iterable. To keep up with their rapidly changing production environments, your teams need reliable CI/CD systems that implement best practices—including build and test automation, flaky test management, and deployment management. By optimizing their CI/CD pipelines, your teams can build their apps more efficiently, deploy them more safely, and catch bugs and security vulnerabilities before they make it to production.

How Toyota is using Datadog and AI/ML to invent new ways for humans to be more mobile #datadog

Toyota is best known for making great cars and trucks, and as a leader in technology and mobility, they are on a mission to build a better future where everyone has the freedom to move. By partnering with Datadog, Toyota is taking advantage of the latest AI/ML to innovate and invent new ways for humans to be more mobile, while future proofing Toyota’s tech stack.

Investigate your log processing with the Datadog Log Pipeline Scanner

Large-scale organizations typically collect and manage millions of logs a day from various services. Within these orgs, many different teams may set up processing pipelines to modify and enrich logs for security monitoring, compliance audits, and DevOps. Datadog Log Pipeline let you ingest logs from your entire stack, parse and enrich them with contextual information, add tags for usage attribution, generate metrics, and quickly identify log anomalies.

Scaling Up, One Network Bottleneck at a Time #shorts #datadog

Processing data at scale involves moving packets through a network—but what happens when that network isn't cooperative? Anatole Beuzon, a Software Engineer at Datadog, discusses how he investigated and resolved network issues in Datadog’s larger data-processing apps and how you can apply these same methods to your own production workloads.

Monitor Ray applications and clusters with Datadog

Ray is an open source compute framework that simplifies the scaling of AI and Python workloads for on-premise and cloud clusters. Ray integrates with popular libraries, data stores, and tools within the machine learning (ML) ecosystem, including Scikit-learn, PyTorch, and TensorFlow. This gives developers the flexibility to scale complex AI applications without making changes to their existing workflows or AI stack.