Datadog on Chaos Engineering
As you scale your applications, remaining resilient to underlying network failures, resource constraints introduced by other applications, or spikes in traffic can become exponentially more complex, even with very thorough testing and processes. Chaos engineering is a discipline that encourages experimenting in production and injecting controlled failures into the system to understand how the system will react in such conditions and to improve its reliability.
In this session Ara Pulido, Technical Evangelist, chats with Tay Nishimura and Joris Bonnefoy, both site reliability engineers on the Chaos Engineering team at Datadog, to discuss how chaos engineering is done at Datadog and the in-house tooling they have built over the past few years to enable more robust testing of Datadog's rapidly growing enterprise systems.
By the end of the session you will have a better understanding of what chaos engineering is, how it can help your organization, and what you need to get started in your organization.
00:00 - Introduction to the episode
04:10 - Introduction to Chaos Engineering
07:33 - Chaos Engineering at Datadog
12:19 - The chaos-controller project
16:02 - chaos-controller usage examples
20:02 - Introduction to chaos-controller internals
21:34 - chaos-controller internals: CPU disruptions
25:38 - chaos-controller internals: Network disruptions
35:08 - Large-Scale Gamedays
43:25 - Q&A