Voice AI is Not About Speech. It is About Latency and Turn-Taking.

What we learned building voice agents that hold 30-minute conversations.

-

60-Second Summary

Most voice AI demos sound impressive for 60 seconds. Try a 30-minute session. That is where the engineering shows. Voice AI is a latency pipeline: audio in, speech-to-text, LLM reasoning, text-to-speech, audio out. Every component adds delay. If the total exceeds one second, the conversation feels broken. Then add barge-in handling, turn-taking, context management across dozens of turns, and you have a real engineering problem. We build voice agents on LiveKit and this is what we have learned about making them work beyond the demo.

The Real Problem: One Second

Human conversation has a natural turn-taking rhythm. Research shows the average gap between speakers is about 200 milliseconds. People notice latency above 500ms. Above one second, they think the system is broken and start repeating themselves — which triggers a new STT transcription and creates a loop of confusion.

Your total latency budget for a voice agent is under one second. Here is what eats that budget:

ComponentTypical RangeTargetWhat Determines It
STT (speech-to-text)100-400ms<200msModel size, streaming vs batch, provider
LLM (first token)200-1200ms<400msModel size, prompt length, provider load
TTS (first byte)100-400ms<200msVoice model, streaming, synthesis quality
Network RTT20-150ms<100msWebRTC, server location, hops
Total420-2150ms<900ms

The LLM is the biggest bottleneck. A 400ms first-token time with streaming output is workable. An 800ms first-token time is not. This is why model selection for voice agents is different from text agents: you optimize for time-to-first-token, not quality-per-dollar.

Architecture: The Voice Pipeline

We build on LiveKit’s agent framework. Here is how the pieces fit together:

Barge-In: The Feature Nobody Gets Right the First Time

Barge-in is when the user starts talking while the agent is still speaking. In a human conversation, this is natural — interruptions carry meaning. In a voice AI system, it is a race condition.

When the user interrupts, you need to: detect new speech via VAD while TTS audio is playing, cancel the in-flight TTS stream, flush the audio buffer so the user does not hear stale audio, capture the new user utterance cleanly, and send it through STT without contamination from the agent’s own audio. Getting this wrong means the agent either ignores interruptions (frustrating) or triggers on its own audio output (catastrophic — it starts talking to itself).

What we initially got wrong: our barge-in threshold was too sensitive. Background noise — a door closing, someone coughing — would cancel the agent mid-sentence. We tuned the VAD energy threshold and added a minimum speech duration (300ms) before triggering barge-in. Too short and noise cancels the agent. Too long and real interruptions feel ignored. 300ms was our sweet spot, but it depends on the environment.

Long Sessions: Where Everything Drifts

A 2-minute demo is easy. A 30-minute intake interview is a different problem entirely. At turn 40, you are managing:

  • Context pressure. The full transcript of a 30-minute conversation is 5,000-8,000 tokens. That is a significant chunk of your context window. We use rolling summarization: every 10 turns, summarize the conversation so far into a structured summary, and replace the raw history. This loses some nuance but keeps reasoning quality high.
  • Topic tracking. Users jump between topics. The agent needs to know whether “the contract” refers to the one discussed at minute 5 or minute 25. We maintain a topic stack in the conversation state.
  • Persona drift. Over many turns, the agent’s tone and behavior shift as the original system prompt gets diluted by conversation content. We re-inject key persona instructions every N turns as a “reminder” in the system context.

Where Voice Pipelines Break

Two Failures That Taught Us the Most

Failure 1: The agent that talked to itself. Early in development, our echo cancellation was insufficient. The agent’s TTS output was picked up by the microphone, transcribed by STT, and sent back to the LLM as user input. The agent then responded to its own words, creating a loop that consumed a full context window in under 30 seconds. The fix was aggressive echo cancellation in the WebRTC layer plus a semantic similarity check: if the STT output is >90% similar to the last TTS output, discard it.

Failure 2: The 30-minute context collapse. A pilot user ran a 45-minute session. Around minute 30, the agent started contradicting its earlier responses. It forgot a critical piece of information from minute 8. Root cause: we were using a simple sliding window for context, and the early turns — which contained the most important configuration — had been pushed out. Our fix was structured summarization with an “important facts” section that persists regardless of window position. We extract key facts after every 5 turns and pin them in the system context.

Decisions and Trade-offs

DecisionWhat We ChoseWhat We Gave Up
LLM for voiceSmaller, faster model (optimized for TTFT)Reasoning depth. For complex questions mid-call, we route to a larger model and play a filler phrase.
Barge-in threshold300ms minimum speech before cancelSnappy interrupt response. But false-positive interrupts were worse for UX.
Context managementRolling summarization every 10 turns + pinned factsSome conversational nuance. But we keep accurate recall of key information across 45+ minute sessions.
TTS providerElevenLabs streaming with voice cloningLowest possible latency (some providers are ~50ms faster). But voice quality and naturalness won.

Your Voice AI Checklist

  • Total round-trip latency measured and under 1 second at p95
  • Barge-in handling tested: user interrupt mid-agent-speech works cleanly without audio artifacts
  • Echo cancellation prevents the agent from responding to its own TTS output
  • Context management strategy for sessions over 15 minutes: summarization + pinned facts
  • VAD threshold tuned for target environment (quiet office vs noisy call center)
  • Fallback audio for slow LLM responses: filler phrase or “let me think about that”
  • Per-session cost tracking: a 30-minute call has a real token budget

Key Takeaways

  • Voice AI is a latency engineering problem. Your budget is under one second total. The LLM first-token time is the biggest variable.
  • Barge-in handling is the feature that separates a demo from a product. Get the threshold wrong and you have either an agent that ignores users or one that interrupts itself.
  • Long sessions (30+ minutes) require explicit memory management: rolling summarization, pinned facts, and persona re-injection. Sliding window alone fails.
  • Echo cancellation is non-optional. Without it, the agent will talk to itself and burn through your token budget in seconds.
  • Model selection for voice is about time-to-first-token, not benchmark scores. A faster, smaller model often delivers better voice UX than a larger, smarter one.

References

  1. LiveKit Agents Documentation — Agent framework we build on
  2. Deepgram Streaming STT — Low-latency speech recognition
  3. ElevenLabs TTS API — Streaming text-to-speech
  4. Silero VAD — Voice activity detection we use
  5. WebRTC Standard — Real-time communication transport
  6. Stivers et al. — Universals in Turn-Taking — Research on 200ms conversational gap
  7. LiveKit Architecture Overview

CONTENTS

Latest reads

From RAG to GraphRAG: Why Context Structure Matters

Where Vector Search Fails We started with a standard RAG pipeline for contract analysis: chunk…

Voice AI is Not About Speech. It is About Latency and Turn-Taking.

The Real Problem: One Second Human conversation has a natural turn-taking rhythm. Research shows the…

Reliability is the New Credibility in AI Systems

The Demo That Went Too Well We once ran a contract analysis demo that went…

Sign up for more like this