Taking down (and restoring) the Raygun ingestion API
In a world where Software as a Service (SaaS) products are integral to daily life, maintaining uninterrupted service for end-users is paramount. However, stuff happens. When it does, our most valuable response (other than restoring service ASAP) is to review the series of events that led up to the incident and learn from them. On August 25th, 2023, at 7:02 AM NZT, Raygun experienced a significant incident that impacted our API ingestion cluster, leading to an outage lasting approximately 1 hour and 15 minutes. While this wasn't fun for anyone involved, this incident did prove to be a valuable learning experience, shedding light on the importance of infrastructure management and resilience.