Operations | Monitoring | ITSM | DevOps | Cloud

Latest News

What SREs Can Learn from Capt. Sully: When to Follow Playbooks

When are you smarter than your playbooks, and when are your playbooks smarter than you? That’s a question that engineers rarely step back to consider. The rational, disciplined parts of our minds tell us that the playbooks we are supposed to follow were carefully designed and tested, and that we should stick to them at all costs.

Incident Response Lifecycle | A Complete Explanation

Wondering about the incident response lifecycle? We explain what it is, and how each phase helps lead to effective incident resolution. What is the incident response lifecycle? The incident response lifecycle is an organization’s framework for responding to an incident that disrupts service. The incident response lifecycle contains the following phases.
Sponsored Post

Golden Signals - Monitoring from first principles

Building a successful monitoring process for your application is essential for high availability. In the first of this three-part blog series, Safeer discusses the four key SRE Golden Signals for metrics-driven measurement, and the role it plays in the overall context of Monitoring. Monitoring is the cornerstone of operating any software system or application effectively. The more visibility you have into the software and hardware systems, the better you are at serving your customers. It tells you whether you are on the right track and, if not, by how much you are missing the mark.

Kubernetes Health Check Using Probes

Kubernetes is an open source container orchestration platform that significantly simplifies an application's creation and management. Distributed systems like Kubernetes can be hard to manage, as they involve many moving parts and all of them must work for the system to function. Even if a small part breaks, it needs to be detected, routed and fixed. These actions also need to be automated. Kubernetes allows us to do that with the help of readiness and liveness probes.

Postmortems Now Called Retrospectives in Blameless

Something big happened at Blameless this month — our “Postmortem” feature was updated to its new name, “Retrospective”. To the naysayer, I suppose you’re thinking, This seems trivial. Different teams call it different names anyway, so why bother making the change? First let me say, thank you for reading our blog and I hope you finish this one through to the end. Now, allow me to explain our reasoning and why we’re excited about this update.

Alert Fatigue in SRE: What It Is & How To Avoid It

Wondering about alert fatigue? We describe what it is, how it affects software development teams, and how to avoid it. What is alert fatigue? Alert fatigue is the phenomenon of employees becoming desensitized to alert messages because of the overwhelming volume they receive, and the number of false positives they receive. The risk with alert fatigue is that important information will be overlooked or ignored.

Quickly troubleshoot application errors with Error Reporting

Are you familiar with the four golden signals of Site Reliability Engineering (SRE): latency, traffic, errors, and saturation? Whether you’re a developer or an operator, you’ve likely been responsible for collecting, storing, or analyzing the data associated with these concepts. Much of this data is captured in application and infrastructure logs, which provide a rich history of what is happening behind the scenes in your workloads.

Traditional vs Modern Incident Response

An incident is an event (network outage, system failure, data breach, etc.) that can lead to loss of, or disruption to, an organization's operations, services or functions. Incident Response is an organization’s effort to detect, analyze and correct the hazards caused due to an incident. In the most common cases, when an incident response is mentioned, it usually relates to security incidents. Sometimes incident response and incident management are more or less used interchangeably.