Operations | Monitoring | ITSM | DevOps | Cloud

Latest Posts

4 Things you Need to Know about Writing Better Production Readiness Checklists

When we think of reliability tools, we may overlook the humble checklist. While tools like SLOs represent the cutting edge of SRE, checklists have been recommended in many industries such as surgery and aviation for almost a century. But checklists owe this long and widespread adoption to their usefulness. Checklists can also help limit errors when deploying code to production. In this blog post, we’ll cover: Production checklists should be holistic.

4 Tips on Preparing for a [Great] Failure

The most essential lesson of SRE is that failure is inevitable. This shouldn’t be a cause for despair. SRE shows how embracing failure is empowering. By celebrating failure, you can accelerate development and foster a culture of learning. Rather than hoping to prevent failure, SRE prepares you to respond well to it. It can be difficult, if not impossible, to anticipate where failure will occur in complex systems given unknown unknowns.

Communication Tool Down? Here are 3 Ways to Handle it

January 4th, 2021, the communication service Slack suffered a major outage. Teams working remotely found their primary communication method unavailable. The incident lasted over 4 hours, during which some customers had intermittent or delayed service, and others had no service at all. It was a reminder that even the most established tools are susceptible to downtime. This is a core lesson of SRE: that failure is inevitable.

"I'm Just Doing my Job," An SRE Myth

"Sorry, but I'm just doing my job." I heard this recently from a customer service representative. What they were saying made sense (afterall, we don’t have total control over our work environments), but it felt wrong. As a customer, I was left dissatisfied with our interaction. However, the representative assured me that they were simply following protocol. This got me thinking: can established practices and protocols sometimes get in the way of excellent customer experience?

Who Else Wants to Increase Development Velocity?

Implementing SRE is fundamentally about shifting culture, but it often means adding new tooling and processes to your team's workflows to support that cultural change. Teams add new steps and checks to incident response procedures. Incident responders write retrospectives and create new meetings to review them. Engineers consult new tools like monitoring dashboards and SLOs. In other words, SRE creates another layer of consideration in development and operations.

Have a Cloud Transition you can be Proud Of

In the reliability era, many services are migrating from in-house servers to the cloud. The cloud model allows your service to capitalize on the benefits of large hosting providers such as AWS, Microsoft Azure, or Google Cloud. These servers can be more reliable than in-house servers for reasons including: However, as with all things, cloud providers present their own risks and challenges as well. Teams will want to take advantage of the benefits while accounting for these limitations.

The Secret of Communicating Incident Retrospectives

In the world of SRE, incidents are unplanned investments in reliability. Why? Because they are valuable opportunities to learn and grow. This perspective can be difficult to communicate to other stakeholders. Some may be upset about the cost incurred or the affected customers. Others might not understand why incidents happen in the first place. It is important to show how the lessons of an incident are relevant to each stakeholder role.

Top Reliability and Scaling Practices from Experts at Citrix, Greenlight Financial, and Incognia

Downtime costs more than dollars. It also costs customer happiness and trust. So how do teams maximize for reliability while scaling? Tooling, communication, observability, and more all play into a complete reliability strategy. In a recent industry leaders’ roundtable hosted by Blameless, top experts discussed best practices for responding to incidents, scaling for reliability, and how to engineer with the customer in mind.

This Is the Most Underappreciated Skill for SREs

Delivering great software and sustainable systems is a team sport. Without the support of all stakeholders, adoption initiatives often fail. In successful initiatives, SREs are responsible for bringing together all resources and team members to help resolve reliability-related issues. But getting together these resources takes much more effort than people think. SREs engage in lots of glue work to ensure these collaborative efforts happen.