%term

The latest News and Information on Service Reliability Engineering and related technologies.

Who should define Reliability - Engineering, or Product?

May 11, 2023 By Piyush Verma In Last9

Whoever owns Reliability should define its parameters. But who owns the Reliability of a Product? Engineering? Product Management? Or the Customer success team?

Read Post

Last9

Read more about Who should define Reliability - Engineering, or Product?

What do self-driving cars tell us about Site Reliability Engineering?

May 9, 2023 By Mohan Dutt Parashar In Last9

From Robocars to Reliability — SRE with self-driving cars; mapping out where the Observability space is in conjunction with self-driving cars.

Read Post

Last9

Read more about What do self-driving cars tell us about Site Reliability Engineering?

Squadcast's Improved Slack (V2) Integration | Better Collaboration & Incident Management | Squadcast

May 5, 2023 By Squadcast In Squadcast

This video will give you an overview of the latest improvements supported by the Squadcast-Slack integration, which we hope will help in better collaboration and Incident Management.

View Video

Squadcast

Read more about Squadcast's Improved Slack (V2) Integration | Better Collaboration & Incident Management | Squadcast

Observability-OSS vs Paid vs Managed OSS

May 3, 2023 By Satyajeet Jadhav In Last9

The Reliability industry needs a managed, non-vendor lock-in answer to spiraling costs, high cardinality and the toil of managing a tsdb.

Read Post

Last9

Read more about Observability-OSS vs Paid vs Managed OSS

Scaling Site Reliability Engineering Teams the Right Way

Apr 28, 2023 By Biju Chacko In Squadcast

Most SRE teams eventually reach a point in their existence where they appear unable to meet all the demands placed upon them. This is when these teams may need to scale. However, it's important to understand that increasing team capacity is not the same as increasing the number of people on the team. Let's unpack what scaling a team is all about, what are the indicators, what are steps you can take, and how you know if you're done.

Read Post

Squadcast

Read more about Scaling Site Reliability Engineering Teams the Right Way

Learnings integrating jmxtrans

Apr 25, 2023 By Saurabh Hirani In Last9

JMX metrics give solid insights into the workings of your application. Integrating them with Levitate (our time series data warehosue) required us to jump some hoops with vmagent.

Read Post

Last9

Read more about Learnings integrating jmxtrans

Install Prometheus on Kubernetes: Tutorial & Examples

Apr 20, 2023 By Squadcast Community In Squadcast

As one of the most popular open-source Kubernetes monitoring solutions, Prometheus leverages a multidimensional data model of time-stamped metric data and labels. The platform uses a pull-based architecture to collect metrics from various targets. It stores the metrics in a time-series database and provides the powerful PromQL query language for efficient analysis and data visualization.

Read Post

Squadcast

Read more about Install Prometheus on Kubernetes: Tutorial & Examples

DevOps vs. SRE

Apr 20, 2023 By Sematext In Sematext

What is the difference between DevOps and SRE? In Short, DevOps should be an all-encompassing term for connecting the development team and operations team. However, DevOps tends to focus more on Deployment, whereas SRE focuses on Reliability.

View Video

Sematext

Read more about DevOps vs. SRE

What is SRE?

Apr 18, 2023 By Sematext In Sematext

SRE stands for Site Reliability Engineering and focuses on making sure your systems are always up and running. SRE teams are very similar to DevOps But have a few noticeable differences.

View Video

Sematext

Read more about What is SRE?

Incident Response Guide

Apr 17, 2023 By Squadcast Community In Squadcast

Site reliability engineering (SRE) is a critical discipline that focuses on ensuring the continuous availability and performance of modern systems and applications. One of the most vital aspects of SRE is incident response, a structured process for identifying, assessing, and resolving system incidents that can lead to downtime, revenue loss, and brand reputation damage.

Read Post