Battle-Tested Reliability Strategies - Incidentally Reliable with Abhishek Ghosh
We dive into the trenches with Abhishek Ghosh, a veteran who has led SRE teams at Pinterest, and now at Cribl. He shares gripping war room stories from Pinterest, strategies for maintaining uptime, insights into the role of AI in observability, and more! Discover the future of SRE and learn how to navigate the challenges of digital reliability. Tune in to gain valuable lessons from one of the industry's leading experts.
Exclusively on The Incidentally Reliable podcast, which is made by SREs for SREs and hosted by Zenduty.
https://www.zenduty.com/podcast/
Zenduty is an advanced incident management platform that gives you greater control and automation over the incident management lifecycle.
Learn more at www.zenduty.com
00:00 - His journey from Microsoft to Google and how it shaped his career into heading SREs at Pinterest, and now at Cribl
06:12 - Site reliability practices at Pinterest - What does Pinner uptime metric mean, what is meaningful availability?
10:12 - What does it mean for SREs to be data driven? What are north star metrics for SREs?
13:00 - What are user and business impact of incidents? How do you measure the cost of an incident?
17:00 - Build Vs. Buy
20:10 - Contributing to Open source
22:30 - Cribl
32:00 - Role of customers in evolving SRE & Devops systems
35:03 - Balance between quantitative and qualitative data
39:26 - Human Vs. AI in Observability
44:42 - Is Alert Fatigue for real?
48:15 - Work-Life Balance: The role of a leader?
53:13 - War Room Stories