SRE

The latest News and Information on Service Reliability Engineering and related technologies.

How Netflix Uses Fault Injection To Truly Understand Their Resilience

Apr 6, 2021 By Thomas Russell In Coralogix

Distributed systems such as microservices have defined software engineering over the last decade. The majority of advancements have been in increasing resilience, flexibility, and rapidity of deployment at increasingly larger scales. For streaming giant Netflix, the migration to a complex cloud based microservices architecture would not have been possible without a revolutionary testing method known as fault injection. With tools like chaos monkey, Netflix employs a cutting edge testing toolkit.

Read Post

Coralogix

Read more about How Netflix Uses Fault Injection To Truly Understand Their Resilience

So you Want an SRE Tool. Do you Build, Buy, or Open Source?

Apr 5, 2021 By Emily Arnott In Blameless

As your organization’s reliability needs grow, you may consider investing in SRE tools. Tooling can make many processes more efficient, consistent, and repeatable. When you decide to invest in tooling, one of the major decisions is how you’ll source your tools. Will you buy an out-of-the-box tool, build one in-house, or work with an open source project? This is a big decision. Switching methods half-way through adoption is costly and can cause thrash.

Read Post

Blameless

Read more about So you Want an SRE Tool. Do you Build, Buy, or Open Source?

Product Update: Upgrade to Exporting your Retrospectives

Apr 2, 2021 By Blameless Community In Blameless

Blameless is excited to announce an enhancement to our Incident Retrospective tool! The Export feature now allows for customizable retrospectives.

Read Post

Blameless

Read more about Product Update: Upgrade to Exporting your Retrospectives

How SREs Can React to COVID-19's Impact on Incident Management

Apr 2, 2021 By Quentin Rousseau In Rootly

By adding new complexity to reliability engineering and making physical war rooms a thing of the past, COVID-19 has imposed permanent changes on incident management. Here’s how SREs can respond.

Read Post

Rootly

Read more about How SREs Can React to COVID-19's Impact on Incident Management

How to Analyze Incidents Better with the Right Metrics

Mar 30, 2021 By Emily Arnott In Blameless

An important SRE best practice is analyzing and learning from incidents. When an incident occurs, you shouldn’t think of it as a setback, but as an opportunity to grow. Good incident analysis involves building an incident retrospective. This document will contain everything from incident metrics to the narrative of those involved. These metrics aren’t the whole story, but they can help teams make data-driven decisions. But choosing which metrics are best to analyze can be difficult.

Read Post

Blameless

Read more about How to Analyze Incidents Better with the Right Metrics

Coffee Break Webinar Series: Intelligent Observability for SRE

Mar 30, 2021 By David Conner In Moogsoft

A selection of live questions and answers from the audience of our recent webinar on how site reliability engineers can best leverage intelligent observability to monitor SLIs and SLOs, prioritize reliability over functionality, and more.

Read Post

Moogsoft

Read more about Coffee Break Webinar Series: Intelligent Observability for SRE

SRE Thought Leader Panel: SRE Adoption as Organizational Transformation

Mar 25, 2021 By Blameless In Blameless

SRE adoption can be difficult. It’s more than just new tooling; it requires a change of process and mindset as well. So how can we go about convincing our organizations that SRE is worthwhile? How can we drive this change? Learn from experts who have done this in our latest SRE Thought Leader Panel “SRE Adoption as Organizational Transformation.” Panelists include: Kurt Andersen, SRE Architect at Blameless Vanessa Yiu, Executive Director, Enterprise Architecture at Goldman Sachs Tony Hansmann, Former Global CTO at Pivotal Software, Inc. Chris Hendrix (Host), Staff Software Engineer at Blameless.

View Video

Blameless

Read more about SRE Thought Leader Panel: SRE Adoption as Organizational Transformation

SREview Issue #11 March 2021

Mar 23, 2021 By Blameless Community In Blameless

Is it spring yet? Or spring still? Time sure is strange nowadays. At least we have a ton to look forward to in the next few weeks! Here are some of the most exciting Tweets, content, and events happening in the SRE and resilience engineering community this month.

Read Post

Blameless

Read more about SREview Issue #11 March 2021

A Day in the Life: Intelligent Observability at Work with a Super SRE

Mar 23, 2021 By Helen Beal In Moogsoft

After we’d fixed Aparna’s network issue, James came to see me at my desk. Masks on, socially distanced and all that, but it was nice to have some face-to-face time. James is cool – that dry British humor and not your classic IT Ops dude. He’s been here forever and mentored me when the CIO, Charlie, hired me as the first SRE here a year or so ago. I lucked out really.

Read Post

Moogsoft

Read more about A Day in the Life: Intelligent Observability at Work with a Super SRE

How to Scale for Reliability and Trust

Mar 22, 2021 By Blameless Community In Blameless

As more people depend on your product, reliability expectations tend to grow. For a service to continue succeeding, it has to be one customers can rely upon. At the same time, as you bring on more customers, the technical demands put on your service increase as well. Dealing with both the increased expectations and challenges of reliability as you scale is difficult. You’ll need to maintain your development velocity and build customer trust through transparency.

Read Post