Operations | Monitoring | ITSM | DevOps | Cloud

SRE

The latest News and Information on Service Reliability Engineering and related technologies.

How to Set up SLOs and Configure SLIs in Squadcast | Tracking Error Budget & Burn Rates | Squadcast

This video will help you define and monitor Service Level Objects for your services and also set up and track error budget burn rates in Squadcast. A Service Level Objective (SLO) is a reliability target, measured by a Service Level Indicator (SLI), and sometimes serves as a safeguard for a Service Level Agreement (SLA). SLOs represent customer happiness and guide the development team’s velocity.

5 Best practices for developing a culture of continuous improvement

How do you create a great engineering team? Exclusively hire brilliant, tenured computer science PhDs. There we solved it. You can skip the next 400 words. (I can hear my college professor in my head saying “Humor might not be your strong suit”) Building a great engineering team isn’t easy. Understatement of the year. It’s not even a problem to be solved per se. We need to think about it as preparation to solve an infinite set of constantly evolving problems.

Suppression Rules in Squadcast | Minimise Alert fatigue | Suppress Non-Actionable Alerts | Squadcast

This video talks about Alert suppression in Squadcast. Alert Suppression helps you avoid alert fatigue by suppressing notifications for non-actionable alerts. Squadcast will suppress the incidents that match any of the Suppression Rules you create for your Services. These incidents will go into the Suppressed state and you will not get any notifications for them.

Announcing our improved Schedules & On-Call Rotations

Hey folks! We are super excited to announce that our schedules feature has gone through a bit of an update. Well, more than a bit 🙂. We’ve gone through the feature with a fine-toothed comb and introduced a bunch of UI and functional improvements which we hope will help you achieve one thing: set up, edit and manage your on-call schedules at scale in a matter of minutes (Yes, that was three things but it was tough to condense it to ONE thing)

SRE Report 2023: Findings From the Field - Toil

Toil. Few other words have the same visceral impact for SREs as their four-letter nemesis: toil. Although pretty much everyone recognizes and agrees that toil is bad, it is a term that is frequently misused in colloquial use. In common English usage, toil is defined as “long strenuous fatiguing labor”. As a term of art in the SRE profession, “toil” has several very specific characteristics which distinguish it from other sorts of work which people spend time on.

[SRE: From Theory to Practice] What's difficult about problem detection?

In this episode of FTTP, Kurt Andersen and Matt Davis are joined by Joanna Mazgaj and Laura Nolan to talk about the implications of and considerations for problem detection. Watch the full episode and hear them share personal stories about the types of challenges you might face. Ultimately, how do we explain and address the socio-technical concepts behind problem detection?

[SRE: From Theory to Practice] What's difficult about incident command?

Welcome back to our mini series of fireside chats with SRE experts talking about the realities of their day-to-day. Episode 2 gets intimate — What’s difficult about incident command? We invited Alyson van Hardenberg, Engineering Manager at Honeycomb.io, and Varun Pal, Staff SRE at Procore, to chat with Jake Englund and Matt Davis from the Blameless team. Watch the full conversation where they cover everything from methodologies and technical expertise to the human and social aspects of reliability engineering.

Using Tagging and Routing Rules in Squadcast I Incident Classification I Event Tagging I Squadcast

Event Tagging is a rule-based, auto-tagging system with which you can define customized tags based on incident payloads, that get automatically assigned to incidents when they are triggered. This video explains how to create Tagging rules for efficient Incident Classification.

Adding Incident Watchers in Squadcast | Incident Notifications and Updates | Squadcast

This video talks about Squadcast's Incident Watchers Feature. In Squadcast, any user/stakeholder can subscribe to an Incident and act as a Watcher for an incident. Incident Watchers can choose to receive notifications for all the updates of an incident. This allows any user/stakeholder to act as an observer of the incident, even if they are not active responders. You can customize your watch options for the incident and receive notifications only for those updates.

SRE Vs. DevOps: A Simple Breakdown Of The Differences

You know this already. Regardless of your size, you must keep up with technological developments in your industry — and, increasingly, in other industries, even those that seem unrelated. Embracing disruption can enable you to increase your market share, revenue, and profit margins. Delegating some development and operations responsibilities to Site Reliability Engineering (SRE) experts allows developers to innovate and create new solutions faster.