Blameless

Structuring Your Teams for Software Reliability

Feb 12, 2020 By Hannah Culver In Blameless

How well positioned is your team to ship reliable software? What are the different roles in engineering that impact reliability, and how do you optimize the ratio of software engineers to SREs to DevOps within teams? These questions can be hard to answer in a quantifiable way, but projecting different scenarios using systems thinking can help. Will Larson’s blog post Modeling Reliability does just that, and serves as inspiration for this article.

Read Post

Blameless

Read more about Structuring Your Teams for Software Reliability

How to Network Effectively as an SRE

Feb 4, 2020 By Hannah Culver In Blameless

For many SREs, networking prompts a similar response as going to the dentist. You know you should do it, but you don’t really want to. But networking is much less like a root canal and more like a regular teeth cleaning; you may not want to go, but once you’re there, it’s not so bad. In fact, you may walk away feeling good knowing that you’ve done something that helps future you.

Read Post

Blameless

Blog
DevOps

Read more about How to Network Effectively as an SRE

New Postmortems Design and Commenting Functionality

Jan 29, 2020 By Blameless In Blameless

One of the most important steps in an incident’s lifecycle is the postmortem. It provides an essential time to reflect on what happened, what could have been done better, and how to build more resilience into a system. But we consistently hear from engineers that incredible toil is typically involved in coordinating stakeholders to write good postmortems.

Read Post

Blameless

Read more about New Postmortems Design and Commenting Functionality

2020 SRE Predictions

Jan 28, 2020 By Hannah Culver In Blameless

It’s a new year, so what will 2020 have in store for SRE? Here’s our two cents: SRE adoption will only continue to grow. However, the practice and culture shift, rather than the role, will take priority in 2020. More people (not just SREs) will have a reliability mindset, shifting reliability left through the software lifecycle. SLIs, SLOs, and error budget policies will become common practice to make this shift actionable.

Read Post

Blameless

Read more about 2020 SRE Predictions

What Are Service-Level Objectives? Lessons Learned

Jan 21, 2020 By Emily Arnott In Blameless

Service Level Objectives, or SLOs, are an internal goal for the essential metrics of a service, such as uptime or response speed. We’re probably familiar with this definition, but what is the value of setting these goals? We’ll take a look at SLOs as both a powerful safety net and a tool to inform the allocation of engineering resources, while also considering the cultural learnings of SLO adoption.

Read Post

Blameless

Read more about What Are Service-Level Objectives? Lessons Learned

5 Best Practices on Nailing Postmortems

Dec 26, 2019 By Hannah Culver In Blameless

Reading about postmortem best practices can sometimes be quite different from seeing them in action. Postmortems are like snowflakes; no two will ever look the same. There isn’t a definitive template for success that will work in every situation, but there are some practices and procedures when writing postmortems that can help. Here are five practices that can boost the effectiveness of your postmortems, with examples of postmortems or procedures that demonstrate these methods.

Read Post

Blameless

Read more about 5 Best Practices on Nailing Postmortems

An SRE Carol

Dec 18, 2019 By Emily Arnott In Blameless

We’re probably all familiar with Dickens’ story of Scrooge and the Three Ghosts of Christmas, written all the way back in 1843. What we may not know is that ghosts providing visions and teaching lessons is still common practice today! Let’s look into the carol of an ambitious, but unreliable, tech CEO.

Read Post

Blameless

Read more about An SRE Carol

Why I Joined Blameless - Simone Salman

Dec 11, 2019 By Emily Arnott In Blameless

My name is Simone Salman, and I’ve been working as a software engineer at Blameless since May 2019. In the spirit of thanks as we’re approaching the holidays, I wanted to reflect on my time at Blameless thus far, and share a few things about the culture that I’m especially grateful for.

Read Post

Blameless

Read more about Why I Joined Blameless - Simone Salman

Building Reliability Through Culture with Veteran Google SRE, Steve McGhee

Dec 10, 2019 By Steve McGhee In Blameless

Which of the following three scenarios do you experience the most when a new incident occurs? For many teams, incidents unfortunately fall into scenario 1, with some classes of incidents catching them by surprise. It’s astonishing that despite the vast amount of time we spend working on and thinking about our systems, we seem to have very little control over them. If we can’t predict where the next incidents will come from, then we will be forever stuck in a reactive cycle of repair.

Read Post

Blameless

Read more about Building Reliability Through Culture with Veteran Google SRE, Steve McGhee

Improving Postmortem Practices with Veteran Google SRE, Steve McGhee

Nov 26, 2019 By Blameless In Blameless

For many SREs, Google’s 99.999% availability seems like an untouchable dream. If anything, getting out of pager hell is already worth celebrating with all your coworkers, friends, and family on the moon. How can teams climb out of it? How can you get to a stage where you have time to proactively prevent incidents, and enter a mental state of calm and control? The rope out of pager hell is weaved with a thorough and rigorous postmortem process.

Read Post

Blameless

Read more about Improving Postmortem Practices with Veteran Google SRE, Steve McGhee

Operations | Monitoring | ITSM | DevOps | Cloud

Blameless

Structuring Your Teams for Software Reliability

How to Network Effectively as an SRE

New Postmortems Design and Commenting Functionality

2020 SRE Predictions

What Are Service-Level Objectives? Lessons Learned

5 Best Practices on Nailing Postmortems

An SRE Carol

Why I Joined Blameless - Simone Salman

Building Reliability Through Culture with Veteran Google SRE, Steve McGhee

Improving Postmortem Practices with Veteran Google SRE, Steve McGhee

Monthly Archive

Follow Us