Beyond Transcription: How Synthetic Rooms Are Solving Voice AI's Hardest Open Problem

By Ethan Brooks

Dec 18, 2024

3 minutes

OpsMatters

Since Google researchers introduced the Transformer architecture in Attention Is All You Need in 2017, language models have become dramatically better at processing and generating text. After ChatGPT’s public release in late 2022, the ambition around voice systems expanded quickly. Meeting software, clinical scribes, speech-infrastructure companies, and voice-agent startups all began chasing a larger idea: spoken conversation could become structured, searchable, and actionable.

By 2024, voice AI was attracting an influx of money and talent, with ElevenLabs raising an $80 million Series B at a $1.1 billion valuation in January. The next month, Abridge raised $150 million at a reported $850 million valuation to turn clinical conversations into medical notes. Across healthcare, AI scribe companies raised roughly $800 million over the course of the year. Around them, meeting assistants like Granola, infrastructure companies like Deepgram, and voice-agent startups like Vapi and Retell became part of the new vocabulary of AI.

But beneath the funding rounds and product launches was a stubborn technical bottleneck: voice AI could only become useful if it could first answer a simple question - who is speaking?

That task is called diarization: determining “who spoke when.” In a clean recording, where people take turns and speak close to separate microphones, diarization is a much easier problem. However in a real conference room, people interrupt each other, voices arrive from different distances and angles, and a single microphone has to convert a messy physical scene into an orderly record of speakers and words.

For Bar Mazuz, that gap became an obsession.

Mazuz, a veteran of Unit 8200, the Israeli military’s elite cyber-intelligence unit, had spent his early career in offensive cyber and vulnerability research. He left the military in 2023, just as large language models were moving from research breakthrough to product platform. After a brief period working on computer vision and AI agents, he found himself drawn toward voice AI. At the time, he was working at a company where meetings took up much of the day, and the company had begun using AI notetakers. What bothered him was what happened after the meeting: the systems could transcribe speech, but when multiple people spoke, they struggled to reliably separate who had said what. As Mazuz put it, “nowadays ChatGPT has its VoiceAI agent... but it sucks at discerning between speakers, especially ones that speak on top of each other.”

The Missing Data

For years, diarization research followed a familiar pattern: build speaker embeddings, then cluster them. Systems based on x-vectors, ECAPA-TDNN-style embeddings, and related methods performed well on benchmarks such as VoxConverse and DIHARD.

However, real rooms exposed their limits. A single microphone has to handle interruptions, overlapping speech, shifting distance, background noise, and reverberation. A person leaning back can begin to sound like a different speaker and two people talking at once can collapse into an acoustic blur. Newer end-to-end diarization systems pointed toward a more direct solution: learn the structure of the conversation itself. To separate speakers in chaotic real audio, researchers needed chaotic real audio with precise speaker labels.

But that required data did not exist. Real conversations are private, offices are sensitive, and meetings are difficult to label precisely when people talk over one another.

Manufacturing a Room

In early 2024, Mazuz started with audio the internet had in abundance: produced, multi-speaker speech. Podcasts, lectures, audiobooks, interviews, and similar sources contain clean voices where one speaker is usually dominant at a time. The open web is full of this kind of material, but it is far less full of raw conference-room audio with reliable speaker labels. “The further we dove into this, the more it made sense,” said Mazuz. “Real conversations and meetings are private and meetings are difficult to label when people talk over one another. The dataset to train on did not exist. So we made our own.”

The next step was the real invention. Instead of waiting for a perfect diarization dataset to appear, he built a room-acoustics simulator to create one. It was conceptually similar to Pyroomacoustics-style simulation, but pushed into industrial-scale dataset generation.

The simulator generated virtual rooms with different wall, floor, and ceiling materials. It placed speakers and microphones in randomized positions. It modeled acoustic propagation: reverberation, delay, frequency-response coloration, and the way a voice changes as it bounces through physical space before reaching a microphone.

The output was a single-channel audio stream that sounded like a microphone sitting in a real room. But unlike a real office recording, every synthetic second came with perfect ground truth. Mazuz knew which virtual speaker had produced which sound because the system had generated the scene.

Synthetic data gave Mazuz a way to create that world at scale. It turned the room from an obstacle into a training environment: one where distance, overlap, echo, noise, and speaker position could be varied endlessly, while the correct answer remained known. The result was a model trained not just to recognize voices, but to infer the structure of a conversation from the acoustic traces a real microphone would receive.

The promise of voice AI has always been bigger than transcription. It is the possibility that spoken work, from meetings to medical visits to customer calls, can become as usable as written text without losing the human context that gives it meaning. Mazuz’s approach points to a practical path forward: when the real world does not provide the data needed to solve a problem, simulation can create the conditions for progress. If diarization is the layer that turns raw speech into structured conversation, synthetic rooms are a way to train that layer at the scale real meetings cannot provide. By manufacturing rooms, conversations, and acoustic complexity at scale, synthetic data may help voice systems move from hearing words to understanding conversations. For voice AI, that distinction could define the next stage of the market.

Beyond Transcription: How Synthetic Rooms Are Solving Voice AI's Hardest Open Problem

The Missing Data

Manufacturing a Room

Monthly Archive

Follow Us