SRE

The latest News and Information on Service Reliability Engineering and related technologies.

10 Reasons You Need A Service Level Agreement & Why It's important

May 26, 2022 By Mbaoma Mary In Reliably

A Service Level Agreement (SLA) consists of many service commitments. It is an essential part of a contract to outsource software development or software support between two or more parties, specifying the duties and the quality and type of service a company would provide for a fee to a customer.

Read Post

Reliably

Read more about 10 Reasons You Need A Service Level Agreement & Why It's important

5 Key Requirements of Modern Enterprise Monitoring & Observability Platforms

May 25, 2022 By Heather Miller In Circonus

Monitoring is an essential function of enterprise SRE teams and a critical component of business service deliverability. Its importance has only grown as enterprise environments and technologies continue to evolve at a rapid pace. Unfortunately, traditional monitoring is no longer enough.

Read Post

Circonus

Read more about 5 Key Requirements of Modern Enterprise Monitoring & Observability Platforms

SRE: From Theory to Practice | What's difficult about incident command

May 24, 2022 By Emily Arnott In Blameless

A few weeks ago we released episode two of our ongoing webinar series, SRE: From Theory to Practice. In this series, we break down a challenge facing SREs through an open and honest discussion. Our topic this episode was “what’s difficult about incident command?” When things go wrong, who is in charge? And what does it feel like to do that role?

Read Post

Blameless

Read more about SRE: From Theory to Practice | What's difficult about incident command

Shift Left Reliability meetup - May Fifteen minutes or bust

May 20, 2022 By Reliably In Reliably

There is a yawning gap opening up between the best and the rest — the elite top few percent of engineering teams are making incredible gains year on year in velocity, reliability and human compatibility, whilst the bottom 50% are actually losing ground. The loss has nothing to do with engineering ability. Take an engineer out of an elite-performing team and place them in the bottom 50%, and they become subpar too; take an engineer out of a mediocre team and embed them in an elite team, and they are pulling their weight within the year.

View Video

Reliably

DevOps
SRE

Read more about Shift Left Reliability meetup - May Fifteen minutes or bust

Severity vs. Priority | Understanding the Differences

May 19, 2022 By Myra Nizami In Blameless

Wondering about severity vs. priority? We explain severity and priority and discuss their differences and their impact on the incident management process.

Read Post

Blameless

Read more about Severity vs. Priority | Understanding the Differences

Is It Really An Incident?

May 18, 2022 By Kurt Andersen In Blameless

At first glance, people tend to think that incidents are cut-and-dried, relatively objective occurrences. But if you look closely, incidents are highly varied, often require unique handling, and often defy clear answers to something as seemingly simple as knowing when they even start.

Read Post

Blameless

Read more about Is It Really An Incident?

A Chat with Lex Neva of SRE Weekly

May 17, 2022 By Emily Arnott In Blameless

Since 2015, Lex Neva has been publishing SRE Weekly. If you’re interested enough in reading about SRE to have found this post, you’re probably familiar with it. If not, there’s a lot of great articles to catch up on! Lex selects around 10 entries from across the internet for each issue, focusing on everything from SRE best practices to the socio- side of systems to major outages in the news. ‍ I had always figured Lex must be among the most well-read people in SRE, and likely #1.

Read Post

Blameless

Read more about A Chat with Lex Neva of SRE Weekly

The Journey Of Building Reliability And Scaling Your Systems

May 14, 2022 By Stoyan Yanev In Reliably

Starting small and scaling your systems to serve billions of requests per month is never an easy path, so how do you build an infrastructure whilst making the right decisions and compromises for your services? Choosing the right technology stack and ensuring your CI/CD pipeline is reliable are two key steps towards this which we will explore.

Read Post

Reliably

Read more about The Journey Of Building Reliability And Scaling Your Systems

What Does It Mean To Build Resilient Service Applications?

May 14, 2022 By Yan Cui In Reliably

Resilience is the capability to recover quickly from difficulties or toughness. It is not about preventing failures, but being able to recover from them quickly. As Amazon’s CTO Werner Vogels famously said ‘everything fails all the time’. It’s a fact of life that failures will inevitably happen but what we can do is build applications that can withstand different kinds of failures. For example, in a data center, hardware is going to fail all the time.

Read Post

Reliably

Read more about What Does It Mean To Build Resilient Service Applications?

What SREs Can Learn from the Atlassian Nightmare Outage of 2022

May 13, 2022 By Weihan Li In Rootly

What happens when the tools and services you depend on to drive Site Reliability Engineering turn out to be susceptible to reliability failures of their own? That’s the question that teams at about 400 businesses have presumably had to ask themselves this month in the wake of a major outage in Atlassian Cloud.

Read Post