Operations | Monitoring | ITSM | DevOps | Cloud

SRE

The latest News and Information on Service Reliability Engineering and related technologies.

SRE Leaders Panel: Testing in Production

Blameless recently had the privilege of hosting some fantastic leaders in the SRE and resilience community for a panel discussion. Our panelists discussed testing in production, how feature flagging and testing can help us do that, and how to get managers to be on board with testing in production. The transcript below has been lightly edited, and if you’re interested in watching the full panel, you can do so here.

SRE + Honeycomb: Observability for Service Reliability

As a Customer Advocate, I talk to a lot of prospective Honeycomb users who want to understand how observability fits into their existing Site Reliability Engineering (SRE) practice. While I have enough of a familiarity with the discipline to get myself into trouble, I wanted to learn more about what SREs do in their day-to-day work so that I’d be better able to help them determine if Honeycomb is a good fit for their needs.

How to Build Your SRE Team

As you implement SRE practices and culture at your organization, you’ll realize everyone has a part to play. From engineers setting SLOs, to management upholding the virtue of blamelessness, to marketing teams conducting retrospectives on email campaigns, there’s no part of an organization that doesn’t benefit from the SRE mentality.

What is a Kubernetes Operator and Why it Matters for SRE

Kubernetes is an open-source project that “containerizes” workloads and services and manages deployment and configurations. Released by Google in 2015, Kubernetes is now maintained by the Cloud Native Computing Foundation. Since its release, it has become a worldwide phenomenon. The majority of cloud native companies use it, SaaS vendors offer commercial prebuilt versions, and there’s even an annual convention!

8 Tips for SRE Wellness

When planning the SRE from home virtual even last month, one of the central themes was wellness and the need for self-care for SREs, especially during this time of crisis. Knowing how stressful an SRE’s day can be, combined that with a global pandemic and new working conditions, we knew we needed programming around SRE and IT wellness for SRE from Home. We’re all looking for ways to maintain a healthy work-life, but hearing this from your peers was especially important.

Choosing the Right SRE Tools

Implementing SRE practices and culture can be challenging. Fortunately, there are a variety of tools for each aspect of SRE: monitoring, SLOs and error budgeting, incident management, incident retrospectives, alerting, chaos engineering, and more. In this blog, we’ll talk about what to look for in an SRE tool, and how they’ll help you on your journey to reliability excellence.

Nishant Singh shares his thoughts on being an SRE

Nishant Singh is an SRE at LinkedIn based in Bangalore. Currently, he is working towards building and maintaining applications that improve the overall MTTD (Mean time to detect) and MTTR (Mean time to recover) of the site. He likes to build services and play with the latest technologies. Before LinkedIn, Nishant worked for a few companies in the security and e-commerce domain as a DevOps engineer where he was primarily responsible for building infrastructure, deployment pipelines and security.

"Things get SREious": SRE from Home Recap

Without SRECon happening this year and the world turned upside down from COVID-19, we set out to hold a virtual event to bring SREs together to share their experiences of what has changed. Last week’s SRE from Home was exactly that. With 1900 registrants, 20 lively Slack channels, six illuminating and entertaining talks from a diverse range of experts in the field and our #askanSRE panel answering attendees’ questions with a candid generosity, it was an amazing, jam-packed day.

Evan Niedojadlo from Peddle shares his thoughts on being an SRE

Evan Niedojadlo is an SRE at Peddle based in Austin, TX. He is currently on a small team and works on the SRE, Ops, and Security area of the organization. In his free time, he enjoys building communities, reading, music, helping others learn, and being outside.