%term

The latest News and Information on Service Reliability Engineering and related technologies.

What SREs Can Learn from the Atlassian Nightmare Outage of 2022

May 13, 2022 By Weihan Li In Rootly

What happens when the tools and services you depend on to drive Site Reliability Engineering turn out to be susceptible to reliability failures of their own? That’s the question that teams at about 400 businesses have presumably had to ask themselves this month in the wake of a major outage in Atlassian Cloud.

Read Post

Rootly

Read more about What SREs Can Learn from the Atlassian Nightmare Outage of 2022

How Sumo SREs manage and monitor SLOs as Code with OpenSLO

May 10, 2022 By Drew Horn In Sumo Logic

At Nobl9’s annual SLOconf—the first conference dedicated to helping SREs quantify the reliability of their applications through service level objectives (SLOs)—Sumo Logic shared our contribution of slogen to the OpenSLO community, as well as our commitment to OpenSLO as an emerging standard for expressing SLOs as Code. slogen is an open source, SLO-as-code CLI tool based on the OpenSLO specification.

Read Post

Sumo Logic

Read more about How Sumo SREs manage and monitor SLOs as Code with OpenSLO

[Webinar] Unlock self-service infrastructure monitoring with the Sensu Integration Catalog

May 5, 2022 By Sensu In Sensu

Introducing the Sensu Integration Catalog — a marketplace-like UX for simplifying new user onboarding, and deploying production-ready monitoring in a matter of minutes. The Sensu Integration Catalog is also an open marketplace that new and existing users can contribute to by sharing Sensu configurations. Backed by industry-leading monitoring as code solution, Sensu provides new users with a point-and-click interface to get started quickly, while facilitating DevOps and SRE automation best practices.

View Video

Sensu

Read more about [Webinar] Unlock self-service infrastructure monitoring with the Sensu Integration Catalog

Are your SLOs realistic? How to analyze your risks like an SRE

May 4, 2022 By Ayelet Sachto In Google Operations

Setting up Service Level Objectives (SLOs) is one of the foundational tasks of Site Reliability Engineering (SRE) practices, giving the SRE team a target against which to evaluate whether or not a service is running reliably enough. The inverse of your SLO is your error budget — how much unreliability you are willing to tolerate.

Read Post

Google Operations

Read more about Are your SLOs realistic? How to analyze your risks like an SRE

SRE vs DevOps: What's The Difference?

Apr 22, 2022 By Stephen Watts In Splunk

Whether you’ve heard of or fully jumped on the DevOps or SRE bandwagon, you may have also wondered how the two relate. What’s the difference? Are they really just different ways of looking at the same problem? The term DevOps hit the market first, but SRE wasn’t too far behind. And though they have different origin stories, they both focus on autonomy, automation, and iteration. So why do these paradigms exist? And why do we need both? Let’s look at this further.

Read Post

Splunk

Read more about SRE vs DevOps: What's The Difference?

Site Reliability Chats (Apr 20, 2022)

Apr 20, 2022 By Gremlin In Gremlin

In this episode Julie and Jason share updates on the Atlassian outage, a new incident at Cerner, and problems at the IRS. They also cover post-incident investigations from Cloudflare and Datadog.

View Video

Gremlin

Read more about Site Reliability Chats (Apr 20, 2022)

Site Reliability Chats (Apr 13, 2022)

Apr 13, 2022 By Gremlin In Gremlin

In this episode, Julie and Jason cover recent outages of the Dutch NS trains, American Express, and the on-going, long-running incident at Atlassian. In positive news, they cover the acquisitions of Puppet by Perforce and Chaos Native by Harness, and Grafana Lab's series D funding.

View Video

Gremlin

Read more about Site Reliability Chats (Apr 13, 2022)

The Pros and Cons of Embedded SREs

Apr 12, 2022 By Quentin Rousseau In Rootly

To embed or not to embed: That is the question. At least, that’s one of the questions that companies have to answer as they decide how to implement Site Reliability Engineering. They can either embed SREs into existing teams, or they can build a new, separate SRE team. Both approaches have their pros and cons. The right strategy for your company or team depends, of course, on your needs and priorities.

Read Post

Rootly

Read more about The Pros and Cons of Embedded SREs

Freshdesk + Squadcast: Enabling Streamlined Incident Response for Enterprises

Apr 5, 2022 By Nir Sharma In Squadcast

Freshdesk is a cloud-based customer service platform used by enterprises that provides a centralized help desk(with the help of support tickets) across multiple channels, including email, phone, chat, and social media. Squadcast is an incident management platform that integrates with major monitoring, ChatOps and project management tools to provide a centralized place for reliability.

Read Post

Squadcast

Read more about Freshdesk + Squadcast: Enabling Streamlined Incident Response for Enterprises

Site Reliability Engineering: An Imperative in Today's Enterprise IT

Apr 5, 2022 By Pepperdata In Pepperdata

Site reliability engineering (SRE) is fast becoming an essential aspect of modern IT operations, particularly in highly scaled, big data environments. As businesses and industries shift to the digital and embrace new IT infrastructures and technologies to remain operational and competitive, the need for a new approach for IT teams to find and manage the balance between launching new systems and features and ensuring these are intuitive, reliable, and friendly for end users has intensified as well.

Read Post