Operations | Monitoring | ITSM | DevOps | Cloud

Latest Posts

SREview Issue #14 June 2021

Hoping you're headed towards a fun summer season and some time without masks. Let's avoid a new kind of tan-line! This newsletter shares useful industry content and an exciting Blameless product announcement. Find our fave tweets and events in the SRE and resilience engineering community. We're hiring! Check out the job openings here.

Complete Guide to Service Level Objectives (SLOs) That Work

Wondering what Service Level Objectives (SLOs) are? In this article, we will explain service level objectives and how they relate to SLAs, SLIs, and error budgets. A Service Level Objective (SLO) is a reliability target, measured by a Service Level Indicator (SLI) and sometimes serves as a safeguard for a Service Level Agreement (SLA). SLOs represent customer happiness and guide the development team’s velocity.

Here's what SLIs AREN'T

SLIs, or service level indicators, are powerful metrics of service health. They’re often built up from simpler metrics that are monitored from the system. SLIs transform lower level machine data into something that captures user happiness. Your organization might already have processes with this same goal. Techniques like real-time telemetry and using synthetic data also build metrics that meaningfully represent service health.

Are you an MS Teams shop? We've got you Covered with Blameless Incident Resolution

We have an exciting announcement. Blameless is providing early access to our Microsoft Teams integration. SRE and engineering teams can now resolve incidents faster without leaving the comfort of their favorite messaging tool. With the Blameless incident resolution product, Microsoft Teams users can now reduce toil in routine incident response processes through automation, codify processes with checklists, and craft retrospectives with the ‘add to timeline’ command.

Error Budgets Explained (And How to Make One for Your Team)

Wondering what error budgets (EBs) are and how they are useful? We explain what they are, how they are defined, and how they can help your team. An error budget is the amount of acceptable unreliability a service can have before customer happiness is impacted. If a service is well within its budget, the developers can take more risks in their releases. If not, developers need to make safer choices.

The 7 SRE Principles [And How to Put Them Into Practice]

Whether you're just adopting SRE or optimizing your current processes, we can help. We’ll explain the 7 key principles of SRE and how to put them into practice. So, what are the SRE principles? The fundamental SRE principles are: SRE is a method that operates through principles. Instead of prescribing specific solutions, it guides you with best practices. These SRE principles help organizations decide what's best for them. Once you understand the principles, you can apply them in many areas.

What do site reliability engineers do?

Are you considering adopting SRE? We will explain the roles and responsibilities of an SRE team within your organization, and how to start building one. So what does an SRE team do? An SRE team is responsible for building software that improves the resiliency of systems, implementing fixes, responding to incidents, and automating processes whenever possible. Site reliability engineering is a holistic practice that incorporates various types of work.

Blameless Runbook Documentation is Now Generally Available!

At Blameless, our mission is to provide teams with the tools they need to operationalize SRE and embrace a culture of resilience. We help teams automate toil and adopt best practices across integrated incident management, comprehensive retrospectives, service level objectives, reliability insights, and more. We are very excited to announce that Blameless Runbook Documentation is now generally available for all customers.

Resilience in Action Episode 7: Killing Ops with Tony Hansmann

Resilience in Action is a podcast about all things resilience, from SRE to software engineering, to how it affects our personal lives, and more. Resilience in Action is hosted by Kurt Andersen. Kurt is a practitioner and an active thought leader in the SRE community. He speaks at major DevOps & SRE conferences and publishes his work through O'Reilly in quintessential SRE books such as Seeking SRE, What is SRE?, and 97 Things Every SRE Should Know.