Operations | Monitoring | ITSM | DevOps | Cloud

SRE

The latest News and Information on Service Reliability Engineering and related technologies.

What are MTTx Metrics Good For? Let's Find Out.

Data helps best-in-class teams make the right decisions. Analyzing your system’s metrics shows you where to invest time and resources. A common type of metric is Mean Time to X, or MTTx. These metrics detail the average time it takes for something to happen. The “x” can represent events or stages in a system’s incident response process. Yet, MTTx metrics rarely tell the whole story of a system’s reliability.

Having On-call Nightmares? Runbooks can Help you Wake Up.

You aren't sure how long you've been here, but the view outside the window sure is soothing. Before you can fully take in your surroundings, a siren rips you back into the conscious world. Slowly, you begin to piece together that you exist, and you are on call. The ringing, much louder now, pierces through your skull as you begin to open your bleary eyes. You turn over your pillow, grab your phone, and click through the PagerDuty notification.

How Netflix Uses Fault Injection To Truly Understand Their Resilience

Distributed systems such as microservices have defined software engineering over the last decade. The majority of advancements have been in increasing resilience, flexibility, and rapidity of deployment at increasingly larger scales. For streaming giant Netflix, the migration to a complex cloud based microservices architecture would not have been possible without a revolutionary testing method known as fault injection. With tools like chaos monkey, Netflix employs a cutting edge testing toolkit.

So you Want an SRE Tool. Do you Build, Buy, or Open Source?

As your organization’s reliability needs grow, you may consider investing in SRE tools. Tooling can make many processes more efficient, consistent, and repeatable. When you decide to invest in tooling, one of the major decisions is how you’ll source your tools. Will you buy an out-of-the-box tool, build one in-house, or work with an open source project? This is a big decision. Switching methods half-way through adoption is costly and can cause thrash.

How to Analyze Incidents Better with the Right Metrics

An important SRE best practice is analyzing and learning from incidents. When an incident occurs, you shouldn’t think of it as a setback, but as an opportunity to grow. Good incident analysis involves building an incident retrospective. This document will contain everything from incident metrics to the narrative of those involved. These metrics aren’t the whole story, but they can help teams make data-driven decisions. But choosing which metrics are best to analyze can be difficult.