Battle-Tested Reliability Strategies - Incidentally Reliable with Abhishek Ghosh

Battle-Tested Reliability Strategies - Incidentally Reliable with Abhishek Ghosh

Aug 16, 2024

We dive into the trenches with Abhishek Ghosh, a veteran who has led SRE teams at Pinterest, and now at Cribl. He shares gripping war room stories from Pinterest, strategies for maintaining uptime, insights into the role of AI in observability, and more! Discover the future of SRE and learn how to navigate the challenges of digital reliability. Tune in to gain valuable lessons from one of the industry's leading experts.

Exclusively on The Incidentally Reliable podcast, which is made by SREs for SREs and hosted by Zenduty.

https://www.zenduty.com/podcast/
Zenduty is an advanced incident management platform that gives you greater control and automation over the incident management lifecycle.
Learn more at www.zenduty.com

00:00 - His journey from Microsoft to Google and how it shaped his career into heading SREs at Pinterest, and now at Cribl

06:12 - Site reliability practices at Pinterest - What does Pinner uptime metric mean, what is meaningful availability?

10:12 - What does it mean for SREs to be data driven? What are north star metrics for SREs?

13:00 - What are user and business impact of incidents? How do you measure the cost of an incident?

17:00 - Build Vs. Buy

20:10 - Contributing to Open source

22:30 - Cribl

32:00 - Role of customers in evolving SRE & Devops systems

35:03 - Balance between quantitative and qualitative data

39:26 - Human Vs. AI in Observability

44:42 - Is Alert Fatigue for real?

48:15 - Work-Life Balance: The role of a leader?

53:13 - War Room Stories