SRE

The latest News and Information on Service Reliability Engineering and related technologies.

How to Analyze Contributing Factors Blamelessly

Mar 16, 2021 By Emily Arnott In Blameless

SRE advocates addressing problems blamelessly. When something goes wrong, don’t try to determine who is at fault. Instead, look for systemic causes. Adopting this approach has many benefits, from the practical to the cultural. Your system will become more resilient as you learn from each failure. Your team will also feel safer when they don’t fear blame, leading to more initiative and innovation. Learning everything you can from incidents is a challenge.

Read Post

Blameless

Read more about How to Analyze Contributing Factors Blamelessly

It's all Chaos! And it Makes for Resilience at Scale

Mar 15, 2021 By Emily Arnott In Blameless

Chaos engineering is a practice where engineers simulate failure to see how systems respond. This helps teams proactively identify and fix preventable issues. It also helps teams prepare responses to the types of issues they cannot prevent, such as sudden hardware failure. The goal of chaos engineering is to improve the reliability and resilience of a system. As such, it is an essential part of a mature SRE solution.

Read Post

Blameless

Read more about It's all Chaos! And it Makes for Resilience at Scale

How to Build an SRE Team with a Growth Mindset

Mar 9, 2021 By Emily Arnott In Blameless

The biggest benefit of SRE isn’t always the processes or tools, but the cultural shift. Building a blameless culture can profoundly change how your organization functions. Your SRE team should be your champions for cultural development. To drive change, SREs need to embody a growth mindset. They need to believe that their own abilities and perspectives can always grow, and encourage this mindset across the organization.

Read Post

Blameless

Read more about How to Build an SRE Team with a Growth Mindset

Observability vs. Monitoring for DevOps Professionals

Mar 9, 2021 By Will Cappelli In Moogsoft

What precisely are the requirements of a DevOps practitioner, as opposed to an SRE, legacy developer, or operations manager? And do those specific requirements require a different approach to monitoring?

Read Post

Moogsoft

Read more about Observability vs. Monitoring for DevOps Professionals

How We Built and Use Runbook Documentation at Blameless

Mar 8, 2021 By Alicia Li and Lucas Bartroli In Blameless

Even if you don’t notice, you are executing runbooks everyday, all the time. When you have an incident in your day-to-day operations, you follow a series of ordered and connected steps to solve it. For instance, if you lose your internet connection, you will follow a series of steps to resolve that issue: This could be different depending on your method, but you have the idea.

Read Post

Blameless

Read more about How We Built and Use Runbook Documentation at Blameless

SRE as Organizational Transformation: Lessons from Activist Organizers

Mar 3, 2021 By Chris Hendrix In Blameless

In the software industry’s recent past, the biggest disruptive wave was Agile methodologies. While Site Reliability Engineering is still early in its adoption, those of us who experienced the disruptive transformation of Agile see the writing on the wall: SRE will impact everyone. Any kind of major transformation like this requires a change in culture, which is a catch-all term for changing people’s principles and behaviors.

Read Post

Blameless

Read more about SRE as Organizational Transformation: Lessons from Activist Organizers

SRE Survey 2021: Where do we go from here

Mar 2, 2021 By JP Blaho In Catchpoint

What a difference a year makes. In a matter of 365 days, the entire planet stared down at uncertainty, and while most of the world is far from recovered, we are starting to see a time where some level of normalcy will return. But what will this look like? How will the past year transform our social interactions, our time out of the house, and how we conduct business?

Read Post

Catchpoint

Read more about SRE Survey 2021: Where do we go from here

SRE2AUX: How Flight Controllers were the first SREs

Mar 2, 2021 By Geoff White In Blameless

In the beginning, there were flight controllers. These were a strange breed. In the early days of the US Manned Space Program, most american households, regardless of class or race, knew the names of the astronauts. John Glen, Alan Shepard, Neil Armstrong. The manned space program was a unifying force of national pride. But no-one knew the names of the anonymous men and later, women, who got the astronauts to orbit, to the moon, and most importantly, got them back to earth.

Read Post

Blameless

Read more about SRE2AUX: How Flight Controllers were the first SREs

With SRE, failing to plan is planning to fail

Feb 26, 2021 By Ayelet Sachto In Google Operations

People sometimes think that implementing Site Reliability Engineering (or DevOps for that matter) will magically make everything better. Just sprinkle a little bit of SRE fairy dust on your organization and your services will be more reliable, more profitable, and your IT, product and engineering teams will be happy. It’s easy to see why people think this way. Some of the world’s most reliable and scalable services run with the help of an SRE team, Google being the prime example.

Read Post

Google Operations

Read more about With SRE, failing to plan is planning to fail

SREview Issue #10 February 2021

Feb 23, 2021 By Blameless Community In Blameless

Is love in the air? We think so. While we don’t have chocolate or flowers for you, we have something just as sweet. Here are some of the most exciting Tweets, content, and events happening in the SRE and resilience engineering community this February.

Read Post