Operations | Monitoring | ITSM | DevOps | Cloud

SRE

The latest News and Information on Service Reliability Engineering and related technologies.

[Webinar] Unlock self-service infrastructure monitoring with the Sensu Integration Catalog

Introducing the Sensu Integration Catalog — a marketplace-like UX for simplifying new user onboarding, and deploying production-ready monitoring in a matter of minutes. The Sensu Integration Catalog is also an open marketplace that new and existing users can contribute to by sharing Sensu configurations. Backed by industry-leading monitoring as code solution, Sensu provides new users with a point-and-click interface to get started quickly, while facilitating DevOps and SRE automation best practices.

Are your SLOs realistic? How to analyze your risks like an SRE

Setting up Service Level Objectives (SLOs) is one of the foundational tasks of Site Reliability Engineering (SRE) practices, giving the SRE team a target against which to evaluate whether or not a service is running reliably enough. The inverse of your SLO is your error budget — how much unreliability you are willing to tolerate.

How to Achieve Measurable Reliability Results

Reliability is more important than ever. As users depend on services more and more, and competition in every sector grows, a great digital experience becomes the baseline for expectations, not the ceiling. It’s crucial to invest in making your software reliable enough to keep customers happy. ‍ But what does investing in reliability look like?

The Reverse Red Herring

During an incident, time is fungible. At points it seems to go way too fast, and at times it seems like an eternity for a command to complete. More importantly, however, is how it feels to be in an incident. It’s a heightened state of being, where any and every piece of information could be “the one” that helps crack open what is really going on. Likewise, there is an inherent distrust of incoming information.

NewsKit API: The journey of building reliability into our systems at News UK

Starting small and currently serving billions of requests per month is never an easy path. Stoyan Yanev, Principal Engineer and Krasimir Petrov, Senior Software Engineer at News UK will show how they built their infrastructure and the decisions and compromises that had to be made along the way. The talk will be centered around NewsKits API and the importance of Reliability before opening up a discussion among the group.

How To Reduce Technical Debt

Technical debt is the implied cost of the additional work that is required when a team chooses a quick, easy solution that is limited, instead of a more well-thought-out, higher-quality solution that would take longer. Essentially, it’s what happens when teams prioritize speed over quality. Examples of technical debt include untested code, unreadable code, dead code, duplicated code, or outdated documentation.

Objectively Speaking: Understanding the Power of Objectives

Objectives help monitor different aspects of your services and systems such as latencies, error rates, PRs that are open, the age of a bug, and more. These are examples of things that drift away from what we think is good; which is essentially what an objective is. Objectives help us to define what ‘good’ looks like.

How Do You Measure Technical Debt?

Technical debt is one of the trade-offs today’s software teams make to speed up development, which helps go-to-market time in return. That is mission-critical for most start-ups. Instead of dwelling on implementation details, or trying to cover edge cases that may affect a small fraction of the end-users in an early development stage, agile teams prioritize early and continuous delivery.

Post-Incident Review | Why It's Important & How It's Done

Curious about the post-incident review process? We give a complete explanation of post-incident reviews and why they are important and discuss best practices. What is a post-incident review? A post-incident review is an evaluation of the incident response process. The goal of the process is to have clear actions to improve the incident response process and to also help prevent further incidents.