SRE

The latest News and Information on Service Reliability Engineering and related technologies.

Blameless CommsAssist - 3 Tips on Making Incident Communication Easy

Jan 25, 2024 By Emily Arnott In Blameless

When you’re in the thick of an incident, communication is both essential and challenging. A wide variety of stakeholders will need timely updates on the situation in order to respond effectively. At the same time, breaking away from the actual diagnostic and resolving work to send these updates can massively slow progress.

Read Post

Blameless

Read more about Blameless CommsAssist - 3 Tips on Making Incident Communication Easy

Failing an AWS Availability Zone with Reliably's AI Assistant

Jan 24, 2024 By Reliably In Reliably

In this video, we're going to ask Reliably Assistant to create an experiment that takes down a full AWS AZ.#chaosengineering #aws #reliability.

View Video

Reliably

Read more about Failing an AWS Availability Zone with Reliably's AI Assistant

Ask Reliably Assistant to build and run a chaos engineering experiment on Kubernetes

Jan 24, 2024 By Reliably In Reliably

In this video we're asking Reliably Assistant to create an experiment that adds latency between two Kubernetes services.

View Video

Reliably

Read more about Ask Reliably Assistant to build and run a chaos engineering experiment on Kubernetes

How Squadcast Helps With Flapping Alerts

Jan 23, 2024 By Chitra Bisht In Squadcast

Often we receive a series of alerts that get auto-resolved within a short period of time. Such alerts are called flapping or transient alerts. In this blog, we'll explore Auto Pause transient alert (APTA) feature that detects flapping alerts and temporarily pause incident notifications hence reducing alert fatigue.

Read Post

Squadcast

Read more about How Squadcast Helps With Flapping Alerts

Simplifying Service Dependency With Squadcast's Service Graph

Jan 22, 2024 By Chitra Bisht In Squadcast

Microservices are fantastic for agility and innovation, but the trade-off is complex service management and ownership. With hundreds of interconnected services, troubleshooting and Incident Response can become a potential blocker. The traditional siloed approach to service ownership and the increasing deployment makes service management more complex.

Read Post

Squadcast

Read more about Simplifying Service Dependency With Squadcast's Service Graph

Understanding Cardinality with Levitate's Cardinality Explorer

Jan 22, 2024 By Last9 In Last9

Predicting the future is hard, especially with metrics-based monitoring systems, because metrics cardinality can snowball. This is important because it affects query performance adversely. Having visibility into what’s happening now and workflows to manage cardinality is crucial. Because the answers depend on the quality of questions, a system allows you to ask. The questions one may have is —

View Video

Last9

Read more about Understanding Cardinality with Levitate's Cardinality Explorer

Does Every Incident Need a Retrospective? Here's What the Experts Have to Say

Jan 17, 2024 By Ryan McDonald In Rootly

Every quarter, we host a roundtable discussion centered around the challenges encountered by incident responders at the world’s leading organizations. These discussions are lightly facilitated and vendor-agnostic, with a carefully curated group of experts. Everyone brings their own unique perspective and experience to the group as we dive deep into the real-world challenges incident responders are facing today.

Read Post

Rootly

Read more about Does Every Incident Need a Retrospective? Here's What the Experts Have to Say

From Amazon to Apple: Key Strategies for Operational Excellence in Tech

Jan 17, 2024 By Blameless In Blameless

Jim Gochee, CEO of Blameless with a history at New Relic and Apple, Ken Gavranovic, COO of Blameless and an Amazon Best Selling Author with experiences at Cox, Web.Com, and Unqork, and Lee Atchison, Chief Reliability Officer at Blameless, noted for his work on Amazon BeanStalk and as the author of "Architecting for Scale," with roles at AWS, HP, and New Relic, will guide this session.

View Video

Blameless

Read more about From Amazon to Apple: Key Strategies for Operational Excellence in Tech

8 Strategies for Reducing Alert Fatigue

Jan 16, 2024 By Anjali Udasi In Zenduty

Site Reliability Engineers (SREs) and DevOps teams often deal with alert fatigue. It's like when you get too alert that it's hard to keep up, making it tougher to respond quickly and adding extra stress to the current responsibilities. According to a study, 62% of participants noted that alert fatigue played a role in employee turnover, while 60% reported that it resulted in internal conflicts within their organization.

Read Post

Zenduty

Read more about 8 Strategies for Reducing Alert Fatigue

The Catchpoint 2024 SRE Report - Five Key Takeaways

Jan 16, 2024 By Emily Arnott In Blameless

Only emerging into the mainstream in the 2010s, SRE is a relatively new discipline in tech. It’s been rapidly adopted by a widening variety of organizations, implementing constantly evolving practices. For the last six years, Catchpoint has been running a survey to take the temperature of the latest developments and trends. Check out the full report here, and read on to see our analysis on five key takeaways.

Read Post