SRE

The latest News and Information on Service Reliability Engineering and related technologies.

Announcing our improved Schedules & On-Call Rotations

Feb 7, 2023 By Nakul Shetty In Squadcast

Hey folks! We are super excited to announce that our schedules feature has gone through a bit of an update. Well, more than a bit 🙂. We’ve gone through the feature with a fine-toothed comb and introduced a bunch of UI and functional improvements which we hope will help you achieve one thing: set up, edit and manage your on-call schedules at scale in a matter of minutes (Yes, that was three things but it was tough to condense it to ONE thing)

Read Post

Squadcast

Read more about Announcing our improved Schedules & On-Call Rotations

SRE Report 2023: Findings From the Field - Toil

Feb 7, 2023 By Kurt Andersen In Catchpoint

Toil. Few other words have the same visceral impact for SREs as their four-letter nemesis: toil. Although pretty much everyone recognizes and agrees that toil is bad, it is a term that is frequently misused in colloquial use. In common English usage, toil is defined as “long strenuous fatiguing labor”. As a term of art in the SRE profession, “toil” has several very specific characteristics which distinguish it from other sorts of work which people spend time on.

Read Post

Catchpoint

Read more about SRE Report 2023: Findings From the Field - Toil

[SRE: From Theory to Practice] What's difficult about problem detection?

Feb 7, 2023 By Blameless In Blameless

In this episode of FTTP, Kurt Andersen and Matt Davis are joined by Joanna Mazgaj and Laura Nolan to talk about the implications of and considerations for problem detection. Watch the full episode and hear them share personal stories about the types of challenges you might face. Ultimately, how do we explain and address the socio-technical concepts behind problem detection?

View Video

Blameless

Read more about [SRE: From Theory to Practice] What's difficult about problem detection?

[SRE: From Theory to Practice] What's difficult about incident command?

Feb 7, 2023 By Blameless In Blameless

Welcome back to our mini series of fireside chats with SRE experts talking about the realities of their day-to-day. Episode 2 gets intimate — What’s difficult about incident command? We invited Alyson van Hardenberg, Engineering Manager at Honeycomb.io, and Varun Pal, Staff SRE at Procore, to chat with Jake Englund and Matt Davis from the Blameless team. Watch the full conversation where they cover everything from methodologies and technical expertise to the human and social aspects of reliability engineering.

View Video

Blameless

Read more about [SRE: From Theory to Practice] What's difficult about incident command?

Using Tagging and Routing Rules in Squadcast I Incident Classification I Event Tagging I Squadcast

Feb 7, 2023 By Squadcast In Squadcast

Event Tagging is a rule-based, auto-tagging system with which you can define customized tags based on incident payloads, that get automatically assigned to incidents when they are triggered. This video explains how to create Tagging rules for efficient Incident Classification.

View Video

Squadcast

Read more about Using Tagging and Routing Rules in Squadcast I Incident Classification I Event Tagging I Squadcast

Adding Incident Watchers in Squadcast | Incident Notifications and Updates | Squadcast

Feb 6, 2023 By Squadcast In Squadcast

This video talks about Squadcast's Incident Watchers Feature. In Squadcast, any user/stakeholder can subscribe to an Incident and act as a Watcher for an incident. Incident Watchers can choose to receive notifications for all the updates of an incident. This allows any user/stakeholder to act as an observer of the incident, even if they are not active responders. You can customize your watch options for the incident and receive notifications only for those updates.

View Video

Squadcast

Read more about Adding Incident Watchers in Squadcast | Incident Notifications and Updates | Squadcast

SRE Vs. DevOps: A Simple Breakdown Of The Differences

Feb 3, 2023 By CloudZero In CloudZero

You know this already. Regardless of your size, you must keep up with technological developments in your industry — and, increasingly, in other industries, even those that seem unrelated. Embracing disruption can enable you to increase your market share, revenue, and profit margins. Delegating some development and operations responsibilities to Site Reliability Engineering (SRE) experts allows developers to innovate and create new solutions faster.

Read Post

CloudZero

Read more about SRE Vs. DevOps: A Simple Breakdown Of The Differences

SRE Principles for Edge Management and Improving Resiliency Using the Best of Kubernetes

Feb 3, 2023 By Kirti Apte and Gabry (Maria Gabriella) Brodi In VMware Tanzu

This post was co-written by Kirti Apte and Gabry (Maria Gabriella) Brodi. Over the last couple of years, customers have been adopting Kubernetes and microservice-based application deployment models for various technology and business reasons. In fact, there is a trend that customers are now looking to the next set of use cases that include applications across multiple clouds, as well as edge clouds.

Read Post

VMware Tanzu

Read more about SRE Principles for Edge Management and Improving Resiliency Using the Best of Kubernetes

Blameless Announces New Opsgenie Integration to Help Engineers Simplify and Speed Incident Management Workflow

Feb 1, 2023 By Blameless In Blameless

Assemble the Right Team to Resolve Incidents Fast by Integrating Alerting and Service Catalog Functions.

Read Post

Blameless

Read more about Blameless Announces New Opsgenie Integration to Help Engineers Simplify and Speed Incident Management Workflow

Announcing: Blameless + OpsGenie Integration

Feb 1, 2023 By Aaron Lober In Blameless

In the opening moments of an engineering incident, the most important aspect of a response plan is speed. Getting out of the gate quickly by leveraging automation to assemble the team can save precious moments during a critical engineering incident and make the difference between happy and unhappy customers downstream. This is why we’re excited to announce the integration of Blameless with OpsGenie.

Read Post