We explain how a DevOps team is structured, the roles and responsibilities within the team, and the balance between an individual contributor and the needs of the team.
Looking into DevOps automation? We explain how automation can improve your process, how to prioritize which tasks to automate, best practices, and how to avoid common mistakes.
A few weeks ago we released episode two of our ongoing webinar series, SRE: From Theory to Practice. In this series, we break down a challenge facing SREs through an open and honest discussion. Our topic this episode was “what’s difficult about incident command?” When things go wrong, who is in charge? And what does it feel like to do that role?
Wondering about severity vs. priority? We explain severity and priority and discuss their differences and their impact on the incident management process.
At first glance, people tend to think that incidents are cut-and-dried, relatively objective occurrences. But if you look closely, incidents are highly varied, often require unique handling, and often defy clear answers to something as seemingly simple as knowing when they even start.
Since 2015, Lex Neva has been publishing SRE Weekly. If you’re interested enough in reading about SRE to have found this post, you’re probably familiar with it. If not, there’s a lot of great articles to catch up on! Lex selects around 10 entries from across the internet for each issue, focusing on everything from SRE best practices to the socio- side of systems to major outages in the news. I had always figured Lex must be among the most well-read people in SRE, and likely #1.
We live in the cloud era, where your services don’t live in machines in your garage, but are spread across huge data centers around the world. Cloud providers can help meet increasing demands for reliability – for example, they offer dynamic resource allocation that can handle usage spikes. At the same time, going cloud native means not having a physical server onsite that you can fiddle with, introducing its own unique challenges.
In this video, our Solutions Engineer walks you through the Reliability Insights view in Blameless. Discover how to create custom data dashboards. You might start with MTTX metrics, but what other metrics are reliability teams following closely? We'll show you how to get those set up in Blameless.
Reliability is more important than ever. As users depend on services more and more, and competition in every sector grows, a great digital experience becomes the baseline for expectations, not the ceiling. It’s crucial to invest in making your software reliable enough to keep customers happy. But what does investing in reliability look like?
During an incident, time is fungible. At points it seems to go way too fast, and at times it seems like an eternity for a command to complete. More importantly, however, is how it feels to be in an incident. It’s a heightened state of being, where any and every piece of information could be “the one” that helps crack open what is really going on. Likewise, there is an inherent distrust of incoming information.