Operations | Monitoring | ITSM | DevOps | Cloud

SRE

The latest News and Information on Service Reliability Engineering and related technologies.

Top 13 Site Reliability Engineer (SRE) Tools

The role and responsibilities of a site reliability engineer (SRE) may vary depending on the size of the organization. For the most part, a site reliability engineer is focused on multiple tasks and projects at one time, so for most SREs, the various tools they use reflect their eve-evolving responsibilities. A typical SRE is busy automating, cleaning up code, upgrading servers, and continually monitoring dashboards for performance, etc., so they are going to see more tools in that toolbelt.

How Important is SaaS Reliability? 90% of Business Leaders Say "Very Important"

A couple of weeks back, Blameless attended SaaStr 2021, the go-to event for any business Go-to-Market (GTM) team which has been running since 2012. Our decision to sponsor was made in early 2020. Back then, we had no idea how long the pandemic would last or that it would be a full 18 months before we’d be able to do a physical event.

Site Reliability Engineering: Top SRE Tools As Voted On By SREs

Catchpoint is proud to present the top SRE tools as voted on by SREs. In our fourth annual SRE Survey, compiled in partnership with VMware Tanzu Observability and DevOps Institute, we simply asked, “What are a few tools that every SRE should have available in their toolbelt?” Today, we are excited to share the findings with you. While some of the answers were not strictly tools, the analysis gives us valuable insight into the mindset of an SRE.

What SREs Can Learn from Facebook's Largest Outage

Facebook’s October 2021 outage was the type of event that gives SREs nightmares: A series of critical business apps crashed in minutes and remained unavailable for hours, disrupting more than 3.5 billion users around the world and costing about 60 million dollars. As incidents go, this was a pretty big one.

What is a Site Reliability Engineer (SRE)?

A site reliability engineer, or SRE, is a role that that encompasses aspects of both software engineering and operations/infrastructure. It also encompasses a strategy and set of practices and principles across service offerings and is closely tied to DevOps and operations. The term site reliability engineering first came into existence at Google in 2003 when a site reliability team was created. At that time, the team was made up of software engineers.

4 xMatters Use Cases That May Surprise You

xMatters is part technology, part service reliability, and a little bit of magic. If you’ve spent time on the xMatters website, you’ll likely have seen a number of valuable use cases for the platform—it can alert SREs when there’s a website outage, it can accelerate product development for DevOps teams, it can manage on-call schedules and alerts for support teams.

Incident Response: A Step-by-Step Guide to Managing Incidents

Looking into Incident Response? We explain incident response, the end-to-end process, the teams involved, and steps to take to avoid friction and slow-down. The goal is to manage the incident as efficiently as possible in order to restore or resume the service to its expected operational state.