Operations | Monitoring | ITSM | DevOps | Cloud

August 2023

Monitor Google Cloud Vertex AI with Datadog

Vertex AI is Google’s platform offering AI and machine learning computing as a service—enabling users to train and deploy machine learning (ML) models and AI applications in the cloud. In June 2023, Google added generative AI support to Vertex AI, so users can test, tune, and deploy Google’s large language models (LLMs) for use in their applications.

This Month in Datadog: DASH 2023 Recap, featuring Bits AI, Single-Step APM Instrumentation, and more

Datadog is constantly elevating the approach to cloud monitoring and security. This Month in Datadog updates you on our newest product features, announcements, resources, and events. This month, we’re recapping DASH 2023..

Datadog On Mobile Software Development

Understanding the health and user experience of your mobile application is critical in order to avoid user frustration, understand application crashes, and reduce bugs mean time to resolution. To help with that task, Datadog has a mobile monitoring solution that allows developers to better understand and improve their application. But what are the things to take into account when building observability mobile SDKs? How can we gather the right telemetry without affecting the underlying application?

Visualize service ownership and application boundaries in the Service Map

The complexity of microservice architectures can make it hard to determine where an application’s dependencies begin and end and who manages which ones. This can pose a variety of challenges both in the course of day-to-day operations and during incidents. Lacking a clear picture of the ownership and interplay of your services can impede accountability and cause application development, incident investigations, and onboarding processes to become prolonged and haphazard.

Generative AI and Observability Automation - Sajid Mehmood & Michael Gerstenhaber

One of the biggest challenges in observability is separating the signal from the noise. As artificial intelligence (AI) tools become more powerful and accessible, it has generated a lot of buzz around the role of AI with respect to the performance and reliability of our technical systems and the teams that build and operate them. In this fireside chat, Michael Gertenhaber (Datadog VP of Product) and Sajid Mehmood (Datadog VP of Engineering) will sift through the hype to chat about what generative AI and Large Language Models (LLMs) will really mean for the future of observability and how it can benefit your teams today.

Efficiency and Effectiveness

WIth unlimited money, most technology problems become easy to solve. But how do you design, build, and operate large scale, performant systems without breaking the bank? In this session, Chandru Subramanian (Director of Engineering, Runtime Efficiency at Datadog) and Neil Innes (Sr. Engineering Manager, DevOps at FanDuel) will discuss how they balance efficiency and effectiveness to save money while also meeting key goals.

Right Size, Right Performance, Right Time

It’s been said that, “premature optimization is the root of all evil.” Contrarily, many engineers have also had to work with software riddled with so much technical debt and inefficiency that optimization is practically impossible and a complete rewrite is required. So when is the right time? In this panel session, we’ll talk with engineering leaders and architects about their approach to software optimization, when to do it, and how to design systems that scale and stay performant.

CTO Fireside Chat

Building large scale technical systems is hard, but building and scaling high performing technical organizations is even more difficult. In this session, Datadog Co-founder and CTO Alexis Lê-Quôc will sit down with Prashant Pandey, Head of Engineering at Asana, to discuss their approach to engineering leadership. They’ll share the hard-learned lessons from their long careers to help you cultivate better technical teams, covering topics from staying in tune with new technologies, enabling innovation, shipping modern ML and AI-based features, and scaling teams.

The Darkside of GraphQL

GraphQL is a query language for APIs that provides a powerful and efficient way to query and manipulate data. As powerful and versatile as GraphQL is, its downside is that it can be vulnerable to certain security threats. In this presentation, we will discuss the security vulnerabilities associated with GraphQL, from the basics to more advanced threats, and how to best protect against them. After this presentation, attendees will have a better understanding of security vulnerabilities in GraphQL, as well as an understanding of the steps needed to protect against them.

Innovating with Faster, Safer Experimentation

Experimentation is the key to innovation. But experiments come with risks, not just of failure, but of wasted time, effort, and money. I’ll share the experimental approach that NTT DOCOMO, Japan’s largest wireless provider, takes to build digital products that customers love. I’ll also present examples from experiments we performed on NTT DOCOMO’s Smart-life website that improved the user experience and significantly increased conversion rates. In this session, you’ll learn how to reduce the risk of experiments and iterate faster to improve your services.

Container Security Fundamentals - Linux Namespaces (Part 4): The User Namespace

In this video we continue our examination of Linux namespaces by looking at some details of how the user namespace can be used to de-couple the user ID inside a container from the user ID on the host, allowing a container to run as the root user without the risks of being root on the host. To learn more, read our blog on Datadog’s Security Labs site.

Key questions to ask when setting SLOs

Many organizations rely on service level objectives (SLOs) to help them gauge the reliability of their products. By setting SLOs that define clear and measurable reliability targets, businesses can ensure they are delivering positive end-user experiences to their customers. Clearly defined SLOs also make it much easier for businesses to understand what tradeoffs they may have to make in order to deliver those specific experiences.

How to monitor CoreDNS with Datadog

In Part 1 of this series, we introduced you to the key metrics you should be monitoring to ensure that you get optimal performance from CoreDNS running in your Kubernetes clusters. In Part 2, we showed you some tools you can use to monitor CoreDNS. In this post, we’ll show you how you can use Datadog to monitor metrics, logs, and traces from CoreDNS alongside telemetry from the rest of your cluster, including the infrastructure it runs on.

Tools for collecting metrics and logs from CoreDNS

In Part 1 of this series, we looked at key metrics you should monitor to understand the performance of your CoreDNS servers. In this post, we’ll show you how to collect and visualize these metrics. We’ll also explore how CoreDNS logging works and show you how to collect CoreDNS logs to get even deeper visibility into your Deployment.

Key metrics for CoreDNS monitoring

CoreDNS is an open source DNS server that can resolve requests for internet domain names and provide service discovery within a Kubernetes cluster. CoreDNS is the default DNS provider in Kubernetes as of v1.13. Though it can be used independently of Kubernetes, this series will focus on its role in providing Kubernetes service discovery, which simplifies cluster networking by enabling clients to access services using DNS names rather than IP addresses.

SRE in Transition: From Startup to Enterprise

"Startups are defined by “ship or die”. As a result, SRE teams at a startup should be focused on enabling product engineers to ship features as quickly as possible. As your startup transitions from “we’ll run out of money in the next 18 months” to “we have more than 1000 engineers”, how should the SRE organization evolve and provide the best value through that transition (including booting one up if you don’t have one)? I will discuss specific ways the organization needs to evolve to meet this challenge, how the SRE org can advocate for and support this change (both in direct actions and in “influence”), and how the overhang of startup technical and cultural debt can make this shift more challenging (but also more necessary).

From On-call to Non-call: Resolving Incidents Before They Even Happen

Artificial intelligence has captured the attention of the world, with tools like ChatGPT and large language models (LLMs) driving the conversation. But you don’t need to wait for the future or new features powered by LLMs to start working smarter—the tech industry has been investing in intelligent, automated tools for years and they’re ready for production now. In this talk, you’ll learn how the engineering teams at Toyota Connected use tools like Datadog Watchdog, Anomaly Detection, and Workflows to make our lives easier and keep our platform stable.

From Solution to Startup

Before Datadog was a widely adopted SaaS platform, it was a tool developed to solve our founders’ own monitoring needs. As technology-oriented people, we often build solutions for our own problems, then discover those problems are widespread. But how do you know when your solution should be something more? In this panel session, we’ll talk with tech startup founders to hear their stories and advice for turning tools into businesses.

Easily test and monitor your mobile applications with Datadog Mobile Application Testing

Effective mobile application testing that meets all the requirements of modern quality assurance can be challenging. Not only do teams need to create tests that cover a range of different device types, operating system versions, and user interactions—including swipes, gestures, touches, and more—they also have to maintain the infrastructure and device fleets necessary to run these tests.

Store and analyze high-volume logs efficiently with Flex Logs

The volume of logs that organizations collect from all over their systems is growing exponentially. Sources range from distributed infrastructure to data pipelines and APIs, and different types of logs demand different treatment. As a result, logs have become increasingly difficult to manage. Organizations must reconcile conflicting needs for long-term retention, rapid access, and cost-effective storage.

DASH 2023: Guide to Datadog's newest announcements

This year at DASH, we announced new products and features that enable your teams to get complete visibility into their AI ecosystem, utilize LLM for efficient troubleshooting, take full control of petabytes of observability data, optimize cloud costs, and more. With Datadog’s new AI integrations, you can easily monitor every layer of your AI stack. And Bits AI, our new DevOps copilot, helps speed up the detection and resolution of issues across your environment.

Send your logs to multiple destinations with Datadog's managed Log Pipelines and Observability Pipelines

As your infrastructure and applications scale, so does the volume of your observability data. Managing a growing suite of tooling while balancing the need to mitigate costs, avoid vendor lock-in, and maintain data quality across an organization is becoming increasingly complex. With a variety of installed agents, log forwarders, and storage tools, the mechanisms you use to collect, transform, and route data should be able to evolve and adjust to your growth and meet the unique needs of your team.

Integration roundup: Monitoring your AI stack

Integrating AI, including large language models (LLMs), into your applications enables you to build powerful tools for data analysis, intelligent search, and text and image generation. There are a number of tools you can use to leverage AI and scale it according to your business needs, with specialized technologies such as vector databases, development platforms, and discrete GPUs being necessary to run many models. As a result, optimizing your system for AI often leads to upgrading your entire stack.

Enhance code reliability with Datadog Quality Gates

Maintaining the quality of your code becomes increasingly difficult as your organization grows. Engineering teams need to release code quickly while still finding a way to enforce best practices, catch security vulnerabilities, and prevent flaky tests. To address this challenge, Datadog is pleased to introduce Quality Gates, a feature that automatically halts code merges when they fail to satisfy your configured quality checks.

Quickstart network investigations with NPM's story-centric UX

Datadog Network Performance Monitoring (NPM) gives you visibility into all the communication that takes place between the network components in your environment, including hosts, processes, containers, clusters, zones, regions, and VPCs. As organizations scale, and as their networks grow in complexity, the massive volume of network data to be monitored can become overwhelming. Knowing precisely what network data to surface to resolve issues within these larger environments can be a challenge.