Improving a Distributed System Post-Incident Julius Zerwick Failover Conf 2020

Improving a Distributed System Post-Incident Julius Zerwick Failover Conf 2020

May 5, 2020

This talk was presented at Failover Conf on April 21, 2020.

In this session, we will dive into a case study of how a team can recover & improve a distributed system after a major incident. Distributed systems are more prone to failure than other systems due to their incredible complexity and scale, and incidents are a fact of life with these systems.

This year, my team faced a week long incident for our IP address management system which impacted out customers. From this incident, we had had to reevaluate our system's performance & overhaul several keys areas of our codebase, as well as improve our monitoring, testing processes, database interactions, and reliability. Viewers will learn about these improvements and how they can apply them to their own systems to achieve greater reliability and performance.

Additionally, viewers will learn how to effectively leverage monitoring practices to uncover inefficiencies in their system, tips for creating a testing process to properly stress your system before deploying to production, and how to rally a team together during a high-pressure incident.