The latest News and Information on Service Reliability Engineering and related technologies.
This blog post is adapted from my talk at SRECon EMEA 2023 - original slides are available here! Status pages are a simple yet underutilized element of incident communication. Done well, they’re a low-lift way to keep your customers and stakeholders informed when incidents impact them. But without a solid approach, updating status pages can easily become a tedious and often neglected task during incidents. In this post, we’ll cover some tips to get your status page right.
How to monitor serverless async jobs from Google Cloud Functions with Prometheus Pushgateway and Levitate using the push model.
Understanding limitations and challenges scaling Prometheus in modern cloud-native environments. Here we delve into long-term retention, downsampling, high availability, and other challenges.