Untested AI is unshippable AI, with Laurie Voss of Arize

Jun 18, 2026

Most AI applications in production right now were shipped on vibes. A developer ran a few queries, liked what they saw, and pushed to prod. Laurie Voss argues that's the core reason so many AI products feel broken, and the fix is simpler to name than it is to do: write the tests.

Laurie is Head of Developer Relations at Arize AI. He's spent his career watching developer behavior at scale, and now building the case for rigorous eval practices in AI engineering. In this episode, he makes the case that evals aren't a specialized ML concept but a natural extension of software testing discipline that most engineers already understand.

Rob and Laurie dig into what makes evals hard (non-determinism, the need for LLM-as-judge), how to keep costs manageable without sacrificing coverage, and the emerging pattern of capability evals that drive software improvement without any human in the loop. They also get into context engineering, context graphs, and whether software development is a narrow enough domain for agents to one-shot it anytime soon.

Topics covered:

  • Why "evals are just tests" is the reframe that unlocks adoption
  • LLM-as-judge: how it works and how to tune it
  • Regression evals vs. capability evals, and why the distinction matters
  • How to run evals cost-effectively without going to Opus for everything
  • Context graphs as a compression strategy for large domains
  • Whether non-technical builders can ship reliable software with agents today

Subscribe wherever you get your podcasts!