Operations | Monitoring | ITSM | DevOps | Cloud

Latest Posts

How SLOs Enable Fast, Reliable Application Delivery

Application delivery is getting harder each day with the rise in complexity, the demand for services to be always-available, and the increasing pressure on teams to innovate. Service level objectives, or SLOs, can help. In this blog, we’ll discuss how SLOs are the key to modern application delivery, how to manage and measure them, the importance of observability for your SLO solution, and how to begin the journey to reliable application delivery today.

What is a Kubernetes Operator and Why it Matters for SRE

Kubernetes is an open-source project that “containerizes” workloads and services and manages deployment and configurations. Released by Google in 2015, Kubernetes is now maintained by the Cloud Native Computing Foundation. Since its release, it has become a worldwide phenomenon. The majority of cloud native companies use it, SaaS vendors offer commercial prebuilt versions, and there’s even an annual convention!

Here are the Metrics you Need to Understand Operational Health

In recent polls we’ve conducted with engineers and leaders, we’ve found that around 70% of participants used MTTA and MTTR as one of their main metrics. 20% of participants cited looking at planned versus unplanned work, and 10% said they currently look at no metrics. While MTTA and MTTR are good starting points, they're no longer enough. With the rise in complexity, it can be difficult to gain insights into your services’ operational health.

Resilience in Action, E5: Tammy Bryant and Eric Roberts The Importance of Glue Work

Resilience in Action is a podcast about all things resilience, from SRE to software engineering, to how it affects our personal lives, and more. Resilience in Action is hosted by Blameless Staff SRE Amy Tobey. Amy has been an SRE and DevOps practitioner since before those names existed. She cares deeply about her community of SREs and wants to take what she’s learned over the 20+ years of her career to help others.

Choosing the Right SRE Tools

Implementing SRE practices and culture can be challenging. Fortunately, there are a variety of tools for each aspect of SRE: monitoring, SLOs and error budgeting, incident management, incident retrospectives, alerting, chaos engineering, and more. In this blog, we’ll talk about what to look for in an SRE tool, and how they’ll help you on your journey to reliability excellence.

Look Upstream to Solve your Team's Reliability Issues

In “Upstream” by Dan Health, we explore a variety of different problems ranging from homelessness, to high school graduation rates, to the state of sidewalks in different neighborhoods within the same city. In each of these examples, Dan discusses how upstream thinking decreased downstream work. Upstream thinking is characterized as proactive, collective actions to improve outcomes rather than reactions after an issue has already occurred.

The Importance of Reliability Engineering

If you’ve spent any time in tech circles lately, there are three letters you’ve surely heard: SRE. Site Reliability Engineering is the defining movement in tech today. Giants like Google and Amazon market their ability to provide reliable service and startups are now investing in reliability as an early priority. But what makes reliability engineering so important?

Improving Postmortems from Chores to Masterclass with Paul Osman

In our 2019 Blameless Summit, Paul Osman spoke about how to take postmortems or incident retrospectives to a new level. ‍The following transcript has been lightly edited for clarity. Slides from this talk are available here. Paul Osman: I lead the SRE team at Under Armour. Who here knows about Under Armour as a tech company? Does anybody think about Under Armour as a tech company? Under Armour makes athletic attire, shirts and shoes.

How to Bring Operational Experience to your Development with Github's Lauren Rubin

At the 2019 Blameless Summit, Lauren Rubin spoke about how to bring operational expertise to development teams. The following transcript has been lightly edited for clarity. Lauren Ruben: I was going to ask for a show of hands of how many people here who are on call right this minute right now. I am actually on call right this minute. I like to live dangerously. If my phone beeps, the specific noise that means I have been paged, I'm sorry, I am going to look at it.