Operations | Monitoring | ITSM | DevOps | Cloud

SRE

The latest News and Information on Service Reliability Engineering and related technologies.

Sponsored Post

Scaling Site Reliability Engineering Teams the Right Way

Most SRE teams eventually reach a point in their existence where they appear unable to meet all the demands placed upon them. This is when these teams may need to scale. However, it's important to understand that increasing team capacity is not the same as increasing the number of people on the team. Let's unpack what scaling a team is all about, what are the indicators, what are steps you can take, and how you know if you're done.

SRE vs. DevOps vs. Platform Engineering: What's The Difference?

SRE, DevOps and Platform Engineering are important concepts in today's world of software development. There are dedicated teams to manage these areas, each with a unique primary focus, set of responsibilities, tools and metrics used to gauge their performance requirements. This article explains SRE, DevOps, and Platform Engineering, including similarities and differences, and, most importantly, how these teams help streamline modern software development, delivery, and maintenance processes.

2023 SRE Report

Now in its fifth year, The SRE Report has become the trusted source of trends and insights for reliability-as-a-feature practices. This year in partnership with Blameless, the report contains special contributions from Adrian Cockcroft and Steve McGhee and highlights findings from a global community of reliability practitioners, including SREs, managers, architects, and executives. As ever, we found some familiar trends and some thought-provoking anti-patterns.

Install Prometheus on Kubernetes: Tutorial & Examples

As one of the most popular open-source Kubernetes monitoring solutions, Prometheus leverages a multidimensional data model of time-stamped metric data and labels. The platform uses a pull-based architecture to collect metrics from various targets. It stores the metrics in a time-series database and provides the powerful PromQL query language for efficient analysis and data visualization.

Incident Response Guide

Site reliability engineering (SRE) is a critical discipline that focuses on ensuring the continuous availability and performance of modern systems and applications. One of the most vital aspects of SRE is incident response, a structured process for identifying, assessing, and resolving system incidents that can lead to downtime, revenue loss, and brand reputation damage.