Have you ever been paged for a critical issue and started troubleshooting only to find an obvious drop in requests that weren’t caught by a static threshold? Or a significant increase in a metric that didn’t cross a static threshold? Or even, evidence of warning alerts triggered long ago that should have enabled someone to resolve the issue and prevent it from causing business impact, but instead was ignored in the massive alert volume received by the team?
As systems become increasingly complex, we’ve seen the growth of engineering tools to abstract away and manage the complexity. But often our tools are “opinionated” and the default actions or settings may not align with how our systems are intended to work or how we think they work. Chaos Engineering is a good way to not only test your applications, but also the tools you use to build them.