Operations | Monitoring | ITSM | DevOps | Cloud

Latest Posts

Getting SRE Buy-in from C-Levels for Error Budgets and SLOs, Part 3

You now have postmortems properly implemented, automated, and well-structured. You’re generating reports and data automatically based on all your incidents. Two levels of management have agreed to your SRE buy-in efforts. That is a huge accomplishment! If you’re here, you’re making great traction adopting SRE best practices, but the battle is not won yet. The hardest but most strategic, important effort will be proving to your C-levels why they should buy into SRE.

Thought Leadership Panel: What is a "real" SRE?

Blameless recently had the privilege of hosting SRE leaders Craig Sebenik, David Blank-Edelman, and Kurt Andersen to discuss how can SREs approach work as done vs work as imagined, how to define SRE and DevOps and the complementary nature of the two, the ethics of purchasing packaged versions of open source software, and more. The transcript below has been lightly edited, and if you’re interested in watching the full panel, you can do so here.

Getting SRE Buy-in from a VP or Director for Automated Metrics and Continuous Learning, Part 2

After getting managerial approval for incident management, your SRE buy-in program is well underway. How can you prove that it’s effective, and that adopting more best practices is necessary? In part 2 of this blog series, we’re going to share how to convince a VP or director to invest in additional SRE practices to strategically improve business results: automated metrics and continuous learning.

Getting SRE Buy-in from a Manager or Lead for Incident Response, Part 1

Adopting SRE best practices can be difficult, especially when you need approval from managers, VPs, CTOs, and everything in between. In this blog series, we will walk you through how to come up with a winning pitch for each level of leadership to ensure that SRE buy-in will succeed in your organization. Let’s start at the beginning with your team lead or manager.

Resilience in Action, Episode 1: Narratives in Incidents with Lorin Hochstein

Resilience in Action is a podcast about all things resilience, from SRE to software engineering, to how it affects our personal lives, and more. Resilience in Action is hosted by Blameless Staff SRE Amy Tobey. Amy has been an SRE and DevOps practitioner since before those names existed. She cares deeply about her community of SREs and wants to take what she’s learned over the 20+ years of her career to help others. In our very first episode, Amy chats with Netflix software engineer Lorin Hochstein.

Technology Innovation Snapshot: How Blameless Accelerates Team Performance

In Digital Enterprise Journal’s March Edition of its Technology Innovation Snapshot, Blameless was listed among 11 other companies as promising vendors. Blameless is honored to be alongside companies such as Gremlin, Catchpoint, and Moogsoft, and excited about the future DEJ sees for the SRE space.

How SRE's can Embrace Resilience During Crises

Blameless recently had the privilege of hosting SRE leaders Liz Fong-Jones, Dave Rensin, and Alex Hidalgo to discuss how SREs can embrace resilience during pandemic, and how the principles of SRE intersect with global trends. The transcript below has been lightly edited, and if you’re interested in watching the full panel, you can do so here.

Best Practices for Pragmatic Incident Command

The goal of this piece is to provide some practical advice on how teams can coordinate and respond to complex, dynamic incidents. After all, incidents are unplanned investments that surface valuable learnings for improvement. For the purposes of this blog, we define incidents as situations where there is a need for coordination among multiple people working on the same problem. There will be incidents where this is not the case.

SRE for Business Continuity in the Face of Uncertainty

No, it won’t be possible to continue operating business-as-usual. For the unforeseeable future, teams across the world will be dealing with cutbacks, infrastructure instability, and more. However, with SRE best practices, your team can embrace resilience and adapt through this difficult time.

Our Top 5 On-Call Practices

On-call: you may see it as a necessary evil. When responding to incidents quickly can make or break your reputation, designating people across the team to be ready to react at all hours of the day is a necessity, but often creates immense stress while eating into personal lives. It isn’t a surprise that many engineers have horror stories about the difficulty of carrying a pager around the clock. But does on-call have to be so dreadful? We think not.