SRE

The latest News and Information on Service Reliability Engineering and related technologies.

Most frequently asked questions surrounding Google's Cloud Operations Sandbox

Jul 29, 2021 By Nir Sharma In Squadcast

Cloud Operations Sandbox serves as a simulation tool for budding SREs to learn the best practices from Google and apply them to real cloud services. In this blog, we have compiled a list of FAQs surrounding the use of Google's Cloud Operations Sandbox. The Google SRE sandbox provides an easy way to get started with the core skills you need to become a SRE.

Read Post

Squadcast

Read more about Most frequently asked questions surrounding Google's Cloud Operations Sandbox

What are the Four Golden Signals?

Jul 29, 2021 By Blameless In Blameless

SRE’s Golden Signals are four key metrics used to monitor the health of your service and underlying systems. We will explain what they are, and how they can help you improve service performance.

Read Post

Blameless

Read more about What are the Four Golden Signals?

Reliability Matters. Blameless is Growing with Series B $30M Funding

Jul 27, 2021 By Lyon Wong In Blameless

When Blameless started in 2018, the team set out on a mission to help all engineers achieve reliability with less toil and risk. Three years in, that mission has become more important than ever. What has changed is the rate of SRE adoption, now the fastest growing team and practice inside engineering. This represents a clear recognition of the many upsides that an SRE practice brings with its combination of continuous learning, velocity, and resilience.

Read Post

Blameless

Read more about Reliability Matters. Blameless is Growing with Series B $30M Funding

How to Notify Your Team of Errors: Email vs. Slack vs. PagerDuty

Jul 26, 2021 By LogDNA In Mezmo

Site Reliability Engineering (SRE) and Operations (Ops) teams heavily rely on notifications. We use them to know what’s going on with application workloads and how applications are performing. Notifications are critical to ensuring SREs and Ops teams can resolve errors and reduce downtime. They’re also crucial when monitoring environments — not only when running in production but also during the dev-test or staging phase.

Read Post

Mezmo

Read more about How to Notify Your Team of Errors: Email vs. Slack vs. PagerDuty

What's the Difference between Observability and Monitoring?

Jul 21, 2021 By Blameless In Blameless

Wondering what the difference is between observability and monitoring? In this post, we explain how they are related, why they are important, and some suggested tools that can help. The difference between observability and monitoring is that observability is the ability to understand a system’s state from its outputs, often referred to as understanding the “unknown unknowns”.

Read Post

Blameless

Read more about What's the Difference between Observability and Monitoring?

When You Do DevSecOps, Don't Forget the SREs

Jul 21, 2021 By Quentin Rousseau In Rootly

It's time to break down the silos separating SREs from security engineers.

Read Post

Rootly

Read more about When You Do DevSecOps, Don't Forget the SREs

SRE's Guide to Chaos & Observability

Jul 20, 2021 By Gremlin In Gremlin

Today’s distributed, cloud-based environments are incredibly complex. Not only does each component depend on many others, but modern systems are also highly dynamic—changing frequently as teams push new code or make updates to infrastructure. Taming this complexity to ensure reliability requires end-to-end observability to understand how components depend on each other. Additionally, proactive Chaos Engineering combined with AI-driven observability lets you uncover “unknown unknowns” that impact how your system will respond to different failure scenarios.

View Video

Gremlin

Read more about SRE's Guide to Chaos & Observability

Upcoming trends in DevOps and SRE

Jul 15, 2021 By Biju Chacko In Squadcast

DevOps and SRE are domains with rapid growth and frequent innovations. With this blog you can explore the latest trends in DevOps, SRE and stay ahead of the curve. The past decade has seen widespread adoption of DevOps methodologies in software development. Unsurprisingly, as the needs of users change, DevOps techniques have evolved as well. In this blog we will look at the trends that are most likely to have a significant impact in the coming years.

Read Post

Squadcast

Read more about Upcoming trends in DevOps and SRE

De-Siloing Incident Management: How to Make Reliability Engineering Everyone's Job

Jul 15, 2021 By JJ Tang In Rootly

4 best practices for breaking down silos and establishing a culture of shared responsibility toward reliability.

Read Post

Rootly

Read more about De-Siloing Incident Management: How to Make Reliability Engineering Everyone's Job

Pragmatic Incident Response: 3 Lessons Learned from Failures

Jul 15, 2021 By Robert Ross In FireHydrant

In my past experience as an SRE I’ve learned some valuable lessons about how to respond and learn from incidents. Declare and run retros for the small incidents. It's less stressful, and action items become much more actionable. Decrease the time it takes to analyze an incident. You'll remember more, and will learn more from the incident. Alert on pain felt by people — not computers. The only reason we declare incidents at all is because of the people on the other side of them.

Read Post