Here’s a common situation that plagues many development teams. You run an application through your CI/CD pipeline and all of the tests pass, which is great. But when you deploy it to a live target environment the application just does not function as expected. You can’t always predict what will happen when your application is pushed live. The solution?
It’s 2 AM and you’re paged when you’re still awake – how well can you find what you need to fix the latest mistake? When the incident begins it might only be impacting a single service, but as time progresses, your brain boots, the coffee is poured, the docs are read, and all the while as the incident is escalating to other services and teams that you might not see the alerts for if they’re not in your scope of ownership.
With over 200 products offered by AWS, when designing a solution, such as a micro-services based system using a number of these services at its core, it becomes rather challenging to not only monitor them but on the onset of a problem troubleshooting it and resolving it within the least amount of time becomes a daunting task. Building a monitorable system requires a deep understanding of the failure domain of the critical components, which is a tall order for a fairly complex system.