SRE

The latest News and Information on Service Reliability Engineering and related technologies.

Non-Abstract Large System Design (NALSD): The Ultimate Guide

Jan 13, 2024 By Anjali Udasi In Zenduty

Non-Abstract Large System Design (NALSD) is an approach where intricate systems are crafted with precision and purpose. It holds particular importance for Site Reliability Engineers (SREs) due to its inherent alignment with the core principles and goals of SRE practices. It improves the reliability of systems, allows for scalable architectures, optimizes performance, encourages fault tolerance, streamlines the processes of monitoring and debugging, and enables efficient incident response.

Read Post

Zenduty

Read more about Non-Abstract Large System Design (NALSD): The Ultimate Guide

Prometheus Federation Scaling Prometheus Guide

Jan 10, 2024 By Tripad Mishra In Last9

We discuss the nuances of Federation in Prometheus, address Prometheus Scaling Challenges along with alternatives to Prometheus federation.

Read Post

Last9

Read more about Prometheus Federation Scaling Prometheus Guide

SLOs with Prometheus done wrong, wrong, wrong, wrong, then right

Jan 10, 2024 By Last9 In Last9

We have Carson Anderson, Sr. DevOps Engineer at Weave HQ, talking about how they implemented SLOs using Prometheus, what went wrong, and how they fixed it. This talk was given at "Last9 of Reliability" Discord community on 13th December. Talk Description: First thing's first: Yes, it really did take us 5 tries to implement our SLOs with Prometheus. While that may seem embarrassing, we are very happy to be able to share our SLO journey so that we can hopefully help you avoid the same mistakes.

View Video

Last9

Read more about SLOs with Prometheus done wrong, wrong, wrong, wrong, then right

Introducing Squadcast's Intelligent Alert Grouping and Snooze Notifications

Jan 8, 2024 By Rahul Jagdish In Squadcast

Maintaining system reliability amidst a deluge of alerts remains a formidable challenge for complex infrastructure environments. To address this critical need, Squadcast is happy to introduce Intelligent Alert Grouping - designed and developed based on in-depth discussions and feedback from our enterprise customers. This innovative solution is designed to streamline Incident Management, ensuring that Incident Response teams can focus on what truly matters.

Read Post

Squadcast

Read more about Introducing Squadcast's Intelligent Alert Grouping and Snooze Notifications

The SRE Report 2024 Reveals State of Site Reliability Engineering

Jan 8, 2024 By Catchpoint In Catchpoint

Annual Report by Catchpoint Reveals New Insights into Control, Learning from Incidents, Artificial Intelligence and Beyond.

Read Post

Catchpoint

Read more about The SRE Report 2024 Reveals State of Site Reliability Engineering

The SRE Report 2024: Essential Considerations for Readers

Jan 8, 2024 By Leo Vasiliou In Catchpoint

If you Google, “What is the shortest, complete sentence in American English?”, then you may get, “I am” as the first answer. However, “Go” is also considered a grammatically correct sentence, and is shorter than, “I am”.

Read Post

Catchpoint

Read more about The SRE Report 2024: Essential Considerations for Readers

How Squadcast's Workflows Enhance Incident Management Automation?

Jan 5, 2024 By Chitra Bisht In Squadcast

One of the daily challenges for Incident Response teams is the pressure to resolve incidents swiftly and effectively. However, manual processes often hinder this objective, leading to delays, oversight, and potential miscommunication. In this blog, we’ll learn the practical aspects of workflow automation in Incident Management using Squadcast, exploring how it streamlines processes, eliminates manual tasks, and enhances overall efficiency.

Read Post

Squadcast

Read more about How Squadcast's Workflows Enhance Incident Management Automation?

How to Calculate and Minimize Downtime Costs

Jan 5, 2024 By Anjali Udasi In Zenduty

Downtime is an unwelcome reality. But, beyond the immediate disruption, outages carry a significant financial burden, impacting revenue, customer satisfaction, and brand reputation. For SREs and IT professionals, understanding the cost of downtime is crucial to mitigating its impact and building a more resilient infrastructure.

Read Post

Zenduty

Read more about How to Calculate and Minimize Downtime Costs

Why your monitoring costs are high

Jan 4, 2024 By Aniket Rao In Last9

If you want to bring down your monitoring costs, you need to shake up a decision paralysis in engineering.

Read Post

Last9

Read more about Why your monitoring costs are high

Runbook vs Playbook: What's the difference?

Dec 29, 2023 By Chitra Bisht In Squadcast

What's the difference between Runbook and Playbook?- for once and all we'll end this confusion today. If you find yourself worrying about forgetting the detailed process of the incident your team just resolved, you're not alone. This is where documentations like Runbooks and Playbooks come into play. Runbooks and playbooks serve as the organizational guides, providing essential information and instructions for teams to navigate through tasks and processes effectively. They not only help your team help themselves but also frees up your time for your ever-growing to-do list.

Read Post