Untested AI is unshippable AI, with Laurie Voss of Arize

CircleCI

Jun 18, 2026

Most AI applications in production right now were shipped on vibes. A developer ran a few queries, liked what they saw, and pushed to prod. Laurie Voss argues that's the core reason so many AI products feel broken, and the fix is simpler to name than it is to do: write the tests.

Laurie is Head of Developer Relations at Arize AI. He's spent his career watching developer behavior at scale, and now building the case for rigorous eval practices in AI engineering. In this episode, he makes the case that evals aren't a specialized ML concept but a natural extension of software testing discipline that most engineers already understand.

Rob and Laurie dig into what makes evals hard (non-determinism, the need for LLM-as-judge), how to keep costs manageable without sacrificing coverage, and the emerging pattern of capability evals that drive software improvement without any human in the loop. They also get into context engineering, context graphs, and whether software development is a narrow enough domain for agents to one-shot it anytime soon.

Topics covered: