Operations | Monitoring | ITSM | DevOps | Cloud

Blameless

What's the Difference between Observability and Monitoring?

Wondering what the difference is between observability and monitoring? In this post, we explain how they are related, why they are important, and some suggested tools that can help. The difference between observability and monitoring is that observability is the ability to understand a system’s state from its outputs, often referred to as understanding the “unknown unknowns”.

What is a Blameless Postmortem?

Do blameless retrospectives (or postmortems) help your team? We will explain what they are, if they really work, and how to do them right. A blameless postmortem (or retrospective) is a post-incident document that helps teams figure out why an incident happened, and brainstorm how to improve the process to prevent similar incidents from happening again. In most engineering organizations, everyone agrees that in complex systems, failure is inevitable.

Error Budgets That Work for You. Plus Support for New Relic Metrics and NR Query Language

Error Budgets That Work for You. Plus Support for New Relic Metrics and NR Query Language Did you know that error budget policy is the key to making SLOs actionable? In fact, Twitter’s engineering team did not successfully adopt SLOs until they introduced error budgets. SLOs enable teams to quantify customer happiness, and error budgets enable teams to make data-backed tradeoffs between reliability and feature velocity. We believe that teams optimizing for reliability must adopt both.

Elephant in the Blameless War Room: Accountability

We’ve always advocated that every company can benefit from a blameless culture . Fostering a blameless culture can profoundly boost your organization in powerful ways, from employee retention to developer velocity and innovation. However, there’s an elephant in the room when we talk about blamelessness with executives: accountability. When things go wrong, people still need to get fired, right?

Resilience in Action E8: Vanessa Yiu on Crafting Enterprise Architecture

‍Resilience in Action is a podcast about all things resilience, from SRE to software engineering, to how it affects our personal lives, and more. Resilience in Action is hosted by Kurt Andersen. Kurt is a practitioner and an active thought leader in the SRE community. He speaks at major DevOps & SRE conferences and publishes his work through O'Reilly in quintessential SRE books such as Seeking SRE, What is SRE?, and 97 Things Every SRE Should Know.

SREview Issue #14 June 2021

Hoping you're headed towards a fun summer season and some time without masks. Let's avoid a new kind of tan-line! This newsletter shares useful industry content and an exciting Blameless product announcement. Find our fave tweets and events in the SRE and resilience engineering community. We're hiring! Check out the job openings here.

Complete Guide to Service Level Objectives (SLOs) That Work

Wondering what Service Level Objectives (SLOs) are? In this article, we will explain service level objectives and how they relate to SLAs, SLIs, and error budgets. A Service Level Objective (SLO) is a reliability target, measured by a Service Level Indicator (SLI) and sometimes serves as a safeguard for a Service Level Agreement (SLA). SLOs represent customer happiness and guide the development team’s velocity.

Here's what SLIs AREN'T

SLIs, or service level indicators, are powerful metrics of service health. They’re often built up from simpler metrics that are monitored from the system. SLIs transform lower level machine data into something that captures user happiness. Your organization might already have processes with this same goal. Techniques like real-time telemetry and using synthetic data also build metrics that meaningfully represent service health.

Are you an MS Teams shop? We've got you Covered with Blameless Incident Resolution

We have an exciting announcement. Blameless is providing early access to our Microsoft Teams integration. SRE and engineering teams can now resolve incidents faster without leaving the comfort of their favorite messaging tool. With the Blameless incident resolution product, Microsoft Teams users can now reduce toil in routine incident response processes through automation, codify processes with checklists, and craft retrospectives with the ‘add to timeline’ command.