Operations | Monitoring | ITSM | DevOps | Cloud

SRE

The latest News and Information on Service Reliability Engineering and related technologies.

Beyond the Blue Screen: Insights from the Microsoft-CrowdStrike Incident

In the wake of the Microsoft-CrowdStrike incident on July 19, 2024, Squadcast community has been actively reflecting on the lessons learned from this disruptive event. This global outage, affecting 8.5 million Windows machines, has served as a critical case study for incident management and operational resilience.

Implementing SLOs in Microservices: A Comprehensive Guide to Reliability and Performance

Microservices are revolutionizing modern enterprise architectures. They allow businesses to scale quickly and innovate without the constraints of monolithic systems. However, this transformation isn't without its challenges. Maintaining reliability across a web of interconnected services can be complex. Each microservice is a vital component, and a single failure can disrupt the entire system.

On-Call Rotations and Schedules: A Guide for 2024

In an increasingly connected world where businesses operate around the clock, the importance of having an effective on-call system cannot be stressed enough. With technological advances and the expectation of immediate attention to business-critical issues, creating a reliable on-call rotation and schedule is essential for ensuring operational continuity. This comprehensive guide will walk you through the various aspects of on-call rotations and schedules that you need to consider for 2024.

A Day in the Life of a Mezmo SRE

What keeps an SRE at the top of his game? I had an insightful conversation with Jon Duarte, a Site Reliability Engineer (SRE) at Mezmo and he walked me through his role and the various tasks he manages on a typical day. Here’s Jon offering a brief glimpse into the challenges he faces, the thought processes behind his approach, and the innovative solutions SREs come up with.
Sponsored Post

9 Critical Challenges in Enterprise Incident Management (And How to Overcome Them)

In an era where businesses are deeply intertwined with complex digital ecosystems, robust enterprise incident management has attained utmost importance. With businesses relying heavily on complex, interconnected systems, the stakes are high when things go wrong. According to PagerDuty's State of Digital Operations 2024 report, 65% of organizations experienced an increase in total incidents over the past year, with an average cost of $3,936 per minute of downtime for enterprise companies.

Creating Effective SLO Dashboards: A Comprehensive Guide

In modern software engineering, the concept of Service Level Objectives (SLOs) has become a cornerstone of reliable service delivery. SLOs define the acceptable level of service that a system must deliver, serving as a benchmark for both internal teams and external users. However, setting SLOs is only half the battle; effectively tracking and managing these objectives is crucial to ensure that services remain within the desired thresholds. This is where SLO dashboards come into play.

Customer Advisory Boards: How to Make Them Work

If you’ve been wondering about setting up a Customer Advisory Board (CAB) at your company, you’re not alone. Many companies, including our product team here at Zenduty, have found them incredibly valuable for getting direct input from clients, shaping product roadmaps, and building stronger relationships. Let’s dive into what makes a CAB effective, drawing from some real-world experiences shared by some of the best in the business.