The latest News and Information on Service Reliability Engineering and related technologies.
The Suez Canal has been big news over the last couple of weeks. We wondered how a Site Reliability Engineer (SRE) might conduct a postmortem on what happened with the Ever Given, and what that might mean if a comparable incident occurred at a modern tech company.
Distributed systems such as microservices have defined software engineering over the last decade. The majority of advancements have been in increasing resilience, flexibility, and rapidity of deployment at increasingly larger scales. For streaming giant Netflix, the migration to a complex cloud based microservices architecture would not have been possible without a revolutionary testing method known as fault injection. With tools like chaos monkey, Netflix employs a cutting edge testing toolkit.
By adding new complexity to reliability engineering and making physical war rooms a thing of the past, COVID-19 has imposed permanent changes on incident management. Here’s how SREs can respond.
A selection of live questions and answers from the audience of our recent webinar on how site reliability engineers can best leverage intelligent observability to monitor SLIs and SLOs, prioritize reliability over functionality, and more.