Datadog on gRPC
Datadog, the observability platform used by thousands of companies, is made up of hundreds of services that communicate over the network using gRPC, an RPC framework, making it a critical component for Datadog’s reliability.
As teams investigated incidents related to their services, they discovered that some of them were gRPC related. But, were there common patterns to those incidents? Could we use them to learn more about gRPC and how to use it better?
During this past year, an engineering squad with members from different teams was formed to study gRPC related incidents and share lessons learned. They wrote a set of best practices for all engineering teams to follow and common libraries that implement them.
In this session Ara Pulido, Staff Developer Advocate, will chat with Anthonin Bonnefoy, Senior Software Engineer in the Core Resilience team and Antoine Tollenaere, Team Lead in the Networking team, who were part of this squad, to share their investigation of the incidents and the gRPC best practices they came up with to avoid those in the future.
By the end of the session you will have a better understanding of the internals of gRPC and how to better implement it at your organization.
00:00 - Introduction
03:58 - Introduction to gRPC
10:00 - gRPC Observability
11:56 - gRPC Stack
18:16 - Load Imbalance
24:26 - IP Recycling Problem
30:06 - Proper Scale-Outs Detection
33:36 - Silent Connection Drop
44:46 - Takeaways
46:56 - Q&A