Operations | Monitoring | ITSM | DevOps | Cloud

SRE

The latest News and Information on Service Reliability Engineering and related technologies.

Getting AWS CloudTrail alerts via SNS Endpoint

Logging and auditing have been an essential part of troubleshooting application and infrastructure performance. You can instantly spot areas of risk to ensure quick correction and prevention of issues. In this blog, we will explore the AWS CloudTrail service and discuss how integrating it with Squadcast can help you route alerts to the right users for quick and efficient incident response. Let's get started.
Sponsored Post

Simplifying SLO and Error Budget tracking for SRE teams

Service level objectives (SLOs), and the subsequent service level indicators (SLIs) are the foundation to establishing a strong SRE culture and how they promote accountability, trust and timely innovation. We are on a mission to simplify SLO and Error Budget tracking and with that aim in mind, we have added the SLO Tracker feature to the Squadcast platform. SLO Tracker seeks to provide a simple and effective way to keep track of your error budget burn rate without the hassle of configuring and aggregating multiple data sources.

5 Tips If You're the 1st SRE Hire by Instacart's First SRE

Site Reliability Engineers (SREs) have a considerable set of tasks to juggle no matter where they work or how long their company has had an SRE practice. But if you’re the very first SRE to join an organization – as many SREs are these days, given that the SRE trend is trickling down into smaller and smaller companies – you face a special group of challenges. You may find it difficult to get buy-in for SRE from other technical teams.

SRE: From Theory to Practice | What's difficult about incident command

A few weeks ago we released episode two of our ongoing webinar series, SRE: From Theory to Practice. In this series, we break down a challenge facing SREs through an open and honest discussion. Our topic this episode was “what’s difficult about incident command?” When things go wrong, who is in charge? And what does it feel like to do that role?

A Chat with Lex Neva of SRE Weekly

Since 2015, Lex Neva has been publishing SRE Weekly. If you’re interested enough in reading about SRE to have found this post, you’re probably familiar with it. If not, there’s a lot of great articles to catch up on! Lex selects around 10 entries from across the internet for each issue, focusing on everything from SRE best practices to the socio- side of systems to major outages in the news. ‍ I had always figured Lex must be among the most well-read people in SRE, and likely #1.

What SREs Can Learn from the Atlassian Nightmare Outage of 2022

What happens when the tools and services you depend on to drive Site Reliability Engineering turn out to be susceptible to reliability failures of their own? That’s the question that teams at about 400 businesses have presumably had to ask themselves this month in the wake of a major outage in Atlassian Cloud.