SRE

The latest News and Information on Service Reliability Engineering and related technologies.

Overview of Incident Lifecycle in SRE

Feb 23, 2021 By Biju Chacko In Squadcast

Incidents that disrupt services are unavoidable. But every breakdown is an opportunity to learn & improve. Our latest blog is a deep dive into best practices to follow across the lifecycle of an incident, helping teams build a sustainable and reliable product - the SRE way As the saying goes, “Every problem we face is a blessing in disguise”.

Read Post

Squadcast

Read more about Overview of Incident Lifecycle in SRE

QA Engineers, This is How SRE will Transform your Role

Feb 22, 2021 By Emily Arnott In Blameless

When implementing SRE, almost every role within your IT organization will change. One of the biggest transformations will be in your Quality Assurance teams. A common misconception is that SRE “replaces” QA. People believe SLOs and other SRE best practices render the traditional role of QA engineering obsolete, as testing and quality shift left in the SDLC. This leads to QA teams resisting SRE adoption.

Read Post

Blameless

Read more about QA Engineers, This is How SRE will Transform your Role

Getting Started as an SRE? Here are 3 Things You Need to Know.

Feb 17, 2021 By Emily Arnott In Blameless

We live in the era of reliability. The most important feature for a service is how dependable it is in the eyes of a user. Companies are hiring with this in mind. In a 2019 LinkedIn article, site reliability engineers were listed as the 2nd most promising career in the United States. But how do you get started as an SRE? In this blog post, we’ll look at: SRE is a multifaceted role. You will contribute to an organization's code base, policy, culture, and more.

Read Post

Blameless

Read more about Getting Started as an SRE? Here are 3 Things You Need to Know.

4 Things you Need to Know about Writing Better Production Readiness Checklists

Feb 16, 2021 By Emily Arnott In Blameless

When we think of reliability tools, we may overlook the humble checklist. While tools like SLOs represent the cutting edge of SRE, checklists have been recommended in many industries such as surgery and aviation for almost a century. But checklists owe this long and widespread adoption to their usefulness. Checklists can also help limit errors when deploying code to production. In this blog post, we’ll cover: Production checklists should be holistic.

Read Post

Blameless

Read more about 4 Things you Need to Know about Writing Better Production Readiness Checklists

4 Tips on Preparing for a [Great] Failure

Feb 9, 2021 By Emily Arnott In Blameless

The most essential lesson of SRE is that failure is inevitable. This shouldn’t be a cause for despair. SRE shows how embracing failure is empowering. By celebrating failure, you can accelerate development and foster a culture of learning. Rather than hoping to prevent failure, SRE prepares you to respond well to it. It can be difficult, if not impossible, to anticipate where failure will occur in complex systems given unknown unknowns.

Read Post

Blameless

Read more about 4 Tips on Preparing for a [Great] Failure

On Not Being a Cog in the Machine

Feb 9, 2021 By Fred Hebert In Honeycomb

This is my first week here as the first dedicated SRE for Honeycomb, and in a welcoming gesture, I was asked if I wanted to write a blog post about my first impressions and what made me decide to join the team. I’ve got a ton of personal reasons for joining Honeycomb that may not be worth being all public about, but after thinking for a while, I realized that many of the things I personally found interesting could point towards attitudes that result in better software elsewhere.

Read Post

Honeycomb

Read more about On Not Being a Cog in the Machine

Communication Tool Down? Here are 3 Ways to Handle it

Feb 8, 2021 By Emily Arnott In Blameless

January 4th, 2021, the communication service Slack suffered a major outage. Teams working remotely found their primary communication method unavailable. The incident lasted over 4 hours, during which some customers had intermittent or delayed service, and others had no service at all. It was a reminder that even the most established tools are susceptible to downtime. This is a core lesson of SRE: that failure is inevitable.

Read Post

Blameless

Read more about Communication Tool Down? Here are 3 Ways to Handle it

Beginners Guide to Incident Postmortems

Feb 7, 2021 By Camille Hodoul In Rootly

Successful and blameless postmortems can turn incidents into a gift of learning and prevent repeat mistakes.

Read Post

Rootly

Read more about Beginners Guide to Incident Postmortems

"I'm Just Doing my Job," An SRE Myth

Feb 2, 2021 By Darrell Pappa In Blameless

"Sorry, but I'm just doing my job." I heard this recently from a customer service representative. What they were saying made sense (afterall, we don’t have total control over our work environments), but it felt wrong. As a customer, I was left dissatisfied with our interaction. However, the representative assured me that they were simply following protocol. This got me thinking: can established practices and protocols sometimes get in the way of excellent customer experience?

Read Post

Blameless

Read more about "I'm Just Doing my Job," An SRE Myth

Who Else Wants to Increase Development Velocity?

Jan 26, 2021 By Emily Arnott In Blameless

Implementing SRE is fundamentally about shifting culture, but it often means adding new tooling and processes to your team's workflows to support that cultural change. Teams add new steps and checks to incident response procedures. Incident responders write retrospectives and create new meetings to review them. Engineers consult new tools like monitoring dashboards and SLOs. In other words, SRE creates another layer of consideration in development and operations.

Read Post