Operations | Monitoring | ITSM | DevOps | Cloud

Code Is Cheap, Reliability Isn't: Owning Production in the AI era w/ Swizec Teller

In this episode, Swizec Teller, author of the bestselling Scaling Fast, makes a bold claim: code is cheap, reliability is not. As AI coding tools accelerate feature development, the real competitive advantage shifts to operating systems reliably in production. We explore the hidden complexity of SRE work, the addictive nature of agentic coding, and why ownership — not automation — remains at the core of modern software engineering.

Democratizing Reliability: Giving Non-Engineers Real Operational Power with Dileshni Jayasinghe

Many companies don’t invest in incident management until something goes wrong. commonsku took a different path. In this episode of Humans of Reliability, Sylvain sits down with Dileshni Jayasingha, VP of Technology at commonsku, to talk about what it really takes to introduce incident management in a mature, profitable SaaS that had never formalized it. From rolling out observability and incident tooling to practicing internal status updates before going public, Dileshni shares how her team built the right muscles before they were forced to.

99%+ Accuracy on a Moving Target: Model Deprecation and Reliability with Not Diamond

Shipping systems powered by LLMs would be hard enough if the models stayed the same. But in reality, they don’t. Models get updated and deprecated at a pace traditional software wouldn’t. All while teams are still expected to hit reliability targets that look a lot like traditional SLAs.

It's Never Different This Time: LLM Reliability Without the Hype with Julien Simon

In this episode, Julien Simon, longtime voice in the open-source ML world, reminds us that even in the era of GenAI, reliability fundamentals haven’t changed. Julien breaks down why calling “the same model” from different providers can produce wildly different results, how deployment choices introduce hidden variability, and why reliability teams need to think of LLM systems as distributed systems.

How Rootly works with Slack | An end-to-end demo.

Rootly is the AI-native on-call and incident management platform that helps you resolve incidents faster, improve system resilience, and streamline on-call operations. It’s your always-on SRE copilot that automates root cause analysis and identifies patterns that drive continuous improvement—trusted by thousands of companies like LinkedIn, NVIDIA, Replit, Elastic, Canva, Clay, Tripadvisor, and Grammarly.