How to use Prometheus to efficiently detect anomalies at scale
When you investigate an incident, context is everything. Let’s say you’re working on-call and get pinged in the middle of the night. You open the alert and it sends you to a dashboard where you recognize a latency pattern. But is the spike normal for that time of day? Is it even relevant? Next thing you know, you’re expanding the time window and checking other related metrics as you try to figure out what’s going on. It’s not to say you won’t find the answers.