Latest News

What SREs Can Learn from Capt. Sully: When to Follow Playbooks

Mar 4, 2022 By Andre King In Rootly

When are you smarter than your playbooks, and when are your playbooks smarter than you? That’s a question that engineers rarely step back to consider. The rational, disciplined parts of our minds tell us that the playbooks we are supposed to follow were carefully designed and tested, and that we should stick to them at all costs.

Read Post

Rootly

Read more about What SREs Can Learn from Capt. Sully: When to Follow Playbooks

Incident Response Lifecycle | A Complete Explanation

Mar 3, 2022 By Emily Arnott In Blameless

Wondering about the incident response lifecycle? We explain what it is, and how each phase helps lead to effective incident resolution. What is the incident response lifecycle? The incident response lifecycle is an organization’s framework for responding to an incident that disrupts service. The incident response lifecycle contains the following phases.

Read Post

Blameless

Read more about Incident Response Lifecycle | A Complete Explanation

Golden Signals - Monitoring from first principles

Mar 2, 2022 By Safeer CM In Squadcast

Building a successful monitoring process for your application is essential for high availability. In the first of this three-part blog series, Safeer discusses the four key SRE Golden Signals for metrics-driven measurement, and the role it plays in the overall context of Monitoring. Monitoring is the cornerstone of operating any software system or application effectively. The more visibility you have into the software and hardware systems, the better you are at serving your customers. It tells you whether you are on the right track and, if not, by how much you are missing the mark.

Read Post

Squadcast

Read more about Golden Signals - Monitoring from first principles

Kubernetes Health Check Using Probes

Mar 2, 2022 By Squadcast Community In Squadcast

Kubernetes is an open source container orchestration platform that significantly simplifies an application's creation and management. Distributed systems like Kubernetes can be hard to manage, as they involve many moving parts and all of them must work for the system to function. Even if a small part breaks, it needs to be detected, routed and fixed. These actions also need to be automated. Kubernetes allows us to do that with the help of readiness and liveness probes.

Read Post

Squadcast

Read more about Kubernetes Health Check Using Probes

Postmortems Now Called Retrospectives in Blameless

Mar 2, 2022 By Blameless In Blameless

Something big happened at Blameless this month — our “Postmortem” feature was updated to its new name, “Retrospective”. To the naysayer, I suppose you’re thinking, This seems trivial. Different teams call it different names anyway, so why bother making the change? First let me say, thank you for reading our blog and I hope you finish this one through to the end. Now, allow me to explain our reasoning and why we’re excited about this update.

Read Post

Blameless

Read more about Postmortems Now Called Retrospectives in Blameless

Alert Fatigue in SRE: What It Is & How To Avoid It

Mar 1, 2022 By Emily Arnott In Blameless

Wondering about alert fatigue? We describe what it is, how it affects software development teams, and how to avoid it. What is alert fatigue? Alert fatigue is the phenomenon of employees becoming desensitized to alert messages because of the overwhelming volume they receive, and the number of false positives they receive. The risk with alert fatigue is that important information will be overlooked or ignored.

Read Post

Blameless

Read more about Alert Fatigue in SRE: What It Is & How To Avoid It

Quickly troubleshoot application errors with Error Reporting

Feb 28, 2022 By Eyamba Ita In Google Operations

Are you familiar with the four golden signals of Site Reliability Engineering (SRE): latency, traffic, errors, and saturation? Whether you’re a developer or an operator, you’ve likely been responsible for collecting, storing, or analyzing the data associated with these concepts. Much of this data is captured in application and infrastructure logs, which provide a rich history of what is happening behind the scenes in your workloads.

Read Post

Google Operations

Read more about Quickly troubleshoot application errors with Error Reporting

Traditional vs Modern Incident Response

Feb 24, 2022 By Kristijan Mitevski In Squadcast

An incident is an event (network outage, system failure, data breach, etc.) that can lead to loss of, or disruption to, an organization's operations, services or functions. Incident Response is an organization’s effort to detect, analyze and correct the hazards caused due to an incident. In the most common cases, when an incident response is mentioned, it usually relates to security incidents. Sometimes incident response and incident management are more or less used interchangeably.

Read Post