Operations | Monitoring | ITSM | DevOps | Cloud

SRE

The latest News and Information on Service Reliability Engineering and related technologies.

System Reliability Metrics: A Comparative Guide to MTTR, MTBF, MTTD, and MTTF

In the ever-evolving landscape of technology, where systems and applications play a pivotal role in our daily lives, ensuring their reliability has become a critical concern for organizations. Unforeseen incidents and downtime can lead to significant financial losses, damage to reputation, and decreased customer satisfaction. In the realm of incident management and site reliability engineering (SRE), understanding and leveraging key reliability metrics is essential.

Reliability At Your Fingertips | Squadcast

Reliability Automation Platform from Squadcast! Squadcast helps global teams streamline Incident Management with a unified platform for on-call and incident response. We help teams at over 500 businesses around the world to automate tasks, get notified of critical events, and work together to resolve incidents and minimize impact to business. Key Features of Our Reliability Automation Platform.

How Organizations Hire SRE's- Laterals or Internal?

Securing reliable system operation necessitates building a formidable Site Reliability Engineering (SRE) team. However, a critical strategic decision confronts every organization: do we cultivate SRE talent internally or venture into the external talent pool? Both approaches possess distinct advantages and disadvantages, each impacting the composition, skillset, and overall effectiveness of the SRE team.

Role of Human Oversight in AI-Driven Incident Management and SRE

In the fast-paced landscape of technology, AI-driven Incident Management and Site Reliability Engineering (SRE) have emerged as critical components in ensuring the seamless functioning of digital systems. AI algorithms are increasingly employed to detect, diagnose, and resolve incidents with unprecedented speed and efficiency, revolutionizing the traditional approaches to reliability.

Blameless CommsAssist - 3 Tips on Making Incident Communication Easy

When you’re in the thick of an incident, communication is both essential and challenging. A wide variety of stakeholders will need timely updates on the situation in order to respond effectively. At the same time, breaking away from the actual diagnostic and resolving work to send these updates can massively slow progress.

How Squadcast Helps With Flapping Alerts

Often we receive a series of alerts that get auto-resolved within a short period of time. Such alerts are called flapping or transient alerts. In this blog, we'll explore Auto Pause transient alert (APTA) feature that detects flapping alerts and temporarily pause incident notifications hence reducing alert fatigue.

Simplifying Service Dependency With Squadcast's Service Graph

Microservices are fantastic for agility and innovation, but the trade-off is complex service management and ownership. With hundreds of interconnected services, troubleshooting and Incident Response can become a potential blocker. The traditional siloed approach to service ownership and the increasing deployment makes service management more complex.

Understanding Cardinality with Levitate's Cardinality Explorer

Predicting the future is hard, especially with metrics-based monitoring systems, because metrics cardinality can snowball. This is important because it affects query performance adversely. Having visibility into what’s happening now and workflows to manage cardinality is crucial. Because the answers depend on the quality of questions, a system allows you to ask. The questions one may have is —

Does Every Incident Need a Retrospective? Here's What the Experts Have to Say

Every quarter, we host a roundtable discussion centered around the challenges encountered by incident responders at the world’s leading organizations. These discussions are lightly facilitated and vendor-agnostic, with a carefully curated group of experts. Everyone brings their own unique perspective and experience to the group as we dive deep into the real-world challenges incident responders are facing today.

From Amazon to Apple: Key Strategies for Operational Excellence in Tech

Jim Gochee, CEO of Blameless with a history at New Relic and Apple, Ken Gavranovic, COO of Blameless and an Amazon Best Selling Author with experiences at Cox, Web.Com, and Unqork, and Lee Atchison, Chief Reliability Officer at Blameless, noted for his work on Amazon BeanStalk and as the author of "Architecting for Scale," with roles at AWS, HP, and New Relic, will guide this session.