How a Production Outage Was Caused Using Kubernetes Pod Priorities
On Friday, July 19, Grafana Cloud experienced a ~30min outage in our Hosted Prometheus service. To our customers who were affected by the incident, I apologize. It’s our job to provide you with the monitoring tools you need, and when they are not available we make your life harder. We take this outage very seriously. This blog post explains what happened, how we responded to it, and what we’re doing to ensure it doesn’t happen again.