Latest Posts

4 Things you Need to Know about Writing Better Production Readiness Checklists

Feb 16, 2021 By Emily Arnott In Blameless

When we think of reliability tools, we may overlook the humble checklist. While tools like SLOs represent the cutting edge of SRE, checklists have been recommended in many industries such as surgery and aviation for almost a century. But checklists owe this long and widespread adoption to their usefulness. Checklists can also help limit errors when deploying code to production. In this blog post, we’ll cover: Production checklists should be holistic.

Read Post

Blameless

Read more about 4 Things you Need to Know about Writing Better Production Readiness Checklists

4 Tips on Preparing for a [Great] Failure

Feb 9, 2021 By Emily Arnott In Blameless

The most essential lesson of SRE is that failure is inevitable. This shouldn’t be a cause for despair. SRE shows how embracing failure is empowering. By celebrating failure, you can accelerate development and foster a culture of learning. Rather than hoping to prevent failure, SRE prepares you to respond well to it. It can be difficult, if not impossible, to anticipate where failure will occur in complex systems given unknown unknowns.

Read Post

Blameless

Read more about 4 Tips on Preparing for a [Great] Failure

Communication Tool Down? Here are 3 Ways to Handle it

Feb 8, 2021 By Emily Arnott In Blameless

January 4th, 2021, the communication service Slack suffered a major outage. Teams working remotely found their primary communication method unavailable. The incident lasted over 4 hours, during which some customers had intermittent or delayed service, and others had no service at all. It was a reminder that even the most established tools are susceptible to downtime. This is a core lesson of SRE: that failure is inevitable.

Read Post

Blameless

Read more about Communication Tool Down? Here are 3 Ways to Handle it

"I'm Just Doing my Job," An SRE Myth

Feb 2, 2021 By Darrell Pappa In Blameless

"Sorry, but I'm just doing my job." I heard this recently from a customer service representative. What they were saying made sense (afterall, we don’t have total control over our work environments), but it felt wrong. As a customer, I was left dissatisfied with our interaction. However, the representative assured me that they were simply following protocol. This got me thinking: can established practices and protocols sometimes get in the way of excellent customer experience?

Read Post

Blameless

Read more about "I'm Just Doing my Job," An SRE Myth

Who Else Wants to Increase Development Velocity?

Jan 26, 2021 By Emily Arnott In Blameless

Implementing SRE is fundamentally about shifting culture, but it often means adding new tooling and processes to your team's workflows to support that cultural change. Teams add new steps and checks to incident response procedures. Incident responders write retrospectives and create new meetings to review them. Engineers consult new tools like monitoring dashboards and SLOs. In other words, SRE creates another layer of consideration in development and operations.

Read Post

Blameless

Read more about Who Else Wants to Increase Development Velocity?

Have a Cloud Transition you can be Proud Of

Jan 25, 2021 By Emily Arnott In Blameless

In the reliability era, many services are migrating from in-house servers to the cloud. The cloud model allows your service to capitalize on the benefits of large hosting providers such as AWS, Microsoft Azure, or Google Cloud. These servers can be more reliable than in-house servers for reasons including: However, as with all things, cloud providers present their own risks and challenges as well. Teams will want to take advantage of the benefits while accounting for these limitations.

Read Post

Blameless

Read more about Have a Cloud Transition you can be Proud Of

The Secret of Communicating Incident Retrospectives

Jan 19, 2021 By Emily Arnott In Blameless

In the world of SRE, incidents are unplanned investments in reliability. Why? Because they are valuable opportunities to learn and grow. This perspective can be difficult to communicate to other stakeholders. Some may be upset about the cost incurred or the affected customers. Others might not understand why incidents happen in the first place. It is important to show how the lessons of an incident are relevant to each stakeholder role.

Read Post

Blameless

Read more about The Secret of Communicating Incident Retrospectives

SREview Issue #9 January 2021

Jan 19, 2021 By Blameless Community In Blameless

New year, new SRE! We’ve said goodbye to 2020 and hello to 2021. Here’s some of the most exciting Tweets, content, and events happening in the SRE and resilience engineering community so far this year.

Read Post

Blameless

Read more about SREview Issue #9 January 2021

Top Reliability and Scaling Practices from Experts at Citrix, Greenlight Financial, and Incognia

Jan 18, 2021 By Blameless Community In Blameless

Downtime costs more than dollars. It also costs customer happiness and trust. So how do teams maximize for reliability while scaling? Tooling, communication, observability, and more all play into a complete reliability strategy. In a recent industry leaders’ roundtable hosted by Blameless, top experts discussed best practices for responding to incidents, scaling for reliability, and how to engineer with the customer in mind.

Read Post

Blameless

Read more about Top Reliability and Scaling Practices from Experts at Citrix, Greenlight Financial, and Incognia

This Is the Most Underappreciated Skill for SREs

Jan 12, 2021 By Emily Arnott In Blameless

Delivering great software and sustainable systems is a team sport. Without the support of all stakeholders, adoption initiatives often fail. In successful initiatives, SREs are responsible for bringing together all resources and team members to help resolve reliability-related issues. But getting together these resources takes much more effort than people think. SREs engage in lots of glue work to ensure these collaborative efforts happen.

Read Post

Blameless

Read more about This Is the Most Underappreciated Skill for SREs

Operations | Monitoring | ITSM | DevOps | Cloud

Latest Posts

4 Things you Need to Know about Writing Better Production Readiness Checklists

4 Tips on Preparing for a [Great] Failure

Communication Tool Down? Here are 3 Ways to Handle it

"I'm Just Doing my Job," An SRE Myth

Who Else Wants to Increase Development Velocity?

Have a Cloud Transition you can be Proud Of

The Secret of Communicating Incident Retrospectives

SREview Issue #9 January 2021

Top Reliability and Scaling Practices from Experts at Citrix, Greenlight Financial, and Incognia

This Is the Most Underappreciated Skill for SREs

Monthly Archive

Follow Us