Iterating on Observability: Reproducing Production Errors in Isolation
Jeremy Brett, Charter Communications
Robert Gladmon, Charter Communications
The orchestration required to support Charter’s 32 million customers is expansive. Many teams are involved in the operations of hundreds of microservices in support of dozens of customer-facing applications with data from nearly 100 back-office systems. All of these systems are in constant flux– fixing bugs, adding features, deploying changes. Although APM already provides immense insights to help resolve issues, the team at Charter needed to go deeper to continually get closer to the zero-production incident target. What if developers could attach a debugger and step through all of these systems to see what really went down?
In this session, Robert Gladmon and Jeremy Brett will share how Charter built Traffic Catch and Release, a tool that allows developers to walk through production transactions and reproduce system behavior in isolation. You will hear how they are able to step through their complex systems to fix issues in minutes instead of days.
They will also show how Catch and Release goes beyond bug fixing. By replaying thousands of transactions against their service, developers are able to spot problems before they go into production, giving them more time to innovate and be masters of their craft.