Datadog on gRPC

Datadog

Oct 6, 2022

Datadog, the observability platform used by thousands of companies, is made up of hundreds of services that communicate over the network using gRPC, an RPC framework, making it a critical component for Datadog’s reliability.

As teams investigated incidents related to their services, they discovered that some of them were gRPC related. But, were there common patterns to those incidents? Could we use them to learn more about gRPC and how to use it better?

During this past year, an engineering squad with members from different teams was formed to study gRPC related incidents and share lessons learned. They wrote a set of best practices for all engineering teams to follow and common libraries that implement them.

In this session Ara Pulido, Staff Developer Advocate, will chat with Anthonin Bonnefoy, Senior Software Engineer in the Core Resilience team and Antoine Tollenaere, Team Lead in the Networking team, who were part of this squad, to share their investigation of the incidents and the gRPC best practices they came up with to avoid those in the future.

By the end of the session you will have a better understanding of the internals of gRPC and how to better implement it at your organization.

00:00 - Introduction

03:58 - Introduction to gRPC

10:00 - gRPC Observability

11:56 - gRPC Stack

18:16 - Load Imbalance

24:26 - IP Recycling Problem

30:06 - Proper Scale-Outs Detection

33:36 - Silent Connection Drop

44:46 - Takeaways

46:56 - Q&A