Operations | Monitoring | ITSM | DevOps | Cloud

SRE

The latest News and Information on Service Reliability Engineering and related technologies.

Analytics in Squadcast | Incident Management | On-call | SRE | Squadcast

Analyzing incident data plays a key role to do better SRE. Squadcast's Analytics Dashboard helps you analyze the performance of your Organization/ Team, for a given time period. It also gives you more insight into past outages that affected your systems.
Sponsored Post

Classifying Severity Levels for Your Organization

Major outages are bound to occur in even the most well-maintained infrastructure and systems. Being able to quickly classify the severity level also allows your on-call team to respond more effectively. Imagine a scenario where your on-call team is getting critical alerts every 15 minutes, user complaints are piling up on social media, and since your platform is inoperative revenue losses are mounting every minute. How do you go about getting your application back on track? This is where understanding incident severity and priority can be invaluable. In this blog we look at severity levels and how they can improve your incident response process.

Site Reliability Engineering (SRE) explained

Google has introduced so many innovations that it’d be impossible to list them all. And we’re not just talking about the obvious things like search engine algorithms or nearly-ubiquitous programs and apps (Google Maps, Docs, Gmail) — not even self-driving cars. Today, we’re going to talk about one such innovation: Site Reliability Engineering. In a nutshell, SRE it’s a practical framework for software development that improves on even giants like DevOps. Wait, what?

Managing the Looker ecosystem at scale with SRE and DevOps practices

Many organizations struggle to create data-driven cultures where each employee is empowered to make decisions based on data. This is especially true for enterprises with a variety of systems and tools in use across different teams. If you are a leader, manager, or executive focused on how your team can leverage Google's SRE practices or wider DevOps practices, definitely you are in the right place!

Setting up Runbooks in Squadcast | SRE Best Practices | Squadcast

A Runbook is a compilation of routine procedures and operations that are documented for reference while working on a critical incident. Sometimes, it can also be referred to as a Playbook. From this video, learn to create, attach, reference and mark progress for incident resolution using Runbooks.

Introduction to reliability management

Ensuring your digital customer experiences are exceptional is a goal of any modern business. However, managing the reliability of ever more complex applications is a challenge. Developers are releasing new capabilities in fast-moving sprints and the business wants maximum velocity with minimal risk. SRE teams create a structure of continuous improvement that focuses on ensuring the application is reliable above all else.

Introducing Our Newest Integration with ServiceNow

Blameless just released a new integration to ServiceNow’s incident management ticketing solution. If you are a modern DevOps team moving towards SRE practices and you want to speed the time to incident resolution through streamlined, automated workflows, this is worth investigating.

How Retrospective Data Enhances Reliability Insights

When things go wrong, we try to learn for the next time. Every incident should be a learning opportunity to make your system more reliable for the future. Luckily with Blameless Reliability Insights, you can see patterns in incidents at a glance, right out of the box. In fact, the ability to tag incidents makes reliability data even more helpful by allowing you to collect granular details about reliability, especially as they pertain to your unique business needs. ‍

Top 12 Site Reliability Engineering (SRE) Tools

Ben Treynor Sloss, then VP of Engineering at Google, coined the term “Site Reliability Engineering” in 2003. Site Reliability Engineering, or SRE, aims to build and run scalable and highly available systems. The philosophy behind Site Reliability Engineering is that developers should treat errors as opportunities to learn and improve. SRE teams constantly experiment and try new things to enhance their support systems.

Why SREs Need to Embrace Chaos Engineering

Reliability and chaos might seem like opposite ideas. But, as Netflix learned in 2010, introducing a bit of chaos—and carefully measuring the results of that chaos—can be a great recipe for reliability. Although most software is created in a tightly controlled environment and carefully tested before release, the production environment is harsher and much less controlled.