The Real Problem: One Second
Human conversation has a natural turn-taking rhythm. Research shows the average gap between speakers is about 200 milliseconds. People notice latency above 500ms. Above one second, they think the system is broken and start repeating themselves — which triggers a new STT transcription and creates a loop of confusion.
Your total latency budget for a voice agent is under one second. Here is what eats that budget:
| Component | Typical Range | Target | What Determines It |
| STT (speech-to-text) | 100-400ms | <200ms | Model size, streaming vs batch, provider |
| LLM (first token) | 200-1200ms | <400ms | Model size, prompt length, provider load |
| TTS (first byte) | 100-400ms | <200ms | Voice model, streaming, synthesis quality |
| Network RTT | 20-150ms | <100ms | WebRTC, server location, hops |
| Total | 420-2150ms | <900ms |
The LLM is the biggest bottleneck. A 400ms first-token time with streaming output is workable. An 800ms first-token time is not. This is why model selection for voice agents is different from text agents: you optimize for time-to-first-token, not quality-per-dollar.
Architecture: The Voice Pipeline
We build on LiveKit’s agent framework. Here is how the pieces fit together:

Barge-In: The Feature Nobody Gets Right the First Time
Barge-in is when the user starts talking while the agent is still speaking. In a human conversation, this is natural — interruptions carry meaning. In a voice AI system, it is a race condition.
When the user interrupts, you need to: detect new speech via VAD while TTS audio is playing, cancel the in-flight TTS stream, flush the audio buffer so the user does not hear stale audio, capture the new user utterance cleanly, and send it through STT without contamination from the agent’s own audio. Getting this wrong means the agent either ignores interruptions (frustrating) or triggers on its own audio output (catastrophic — it starts talking to itself).
What we initially got wrong: our barge-in threshold was too sensitive. Background noise — a door closing, someone coughing — would cancel the agent mid-sentence. We tuned the VAD energy threshold and added a minimum speech duration (300ms) before triggering barge-in. Too short and noise cancels the agent. Too long and real interruptions feel ignored. 300ms was our sweet spot, but it depends on the environment.
Long Sessions: Where Everything Drifts
A 2-minute demo is easy. A 30-minute intake interview is a different problem entirely. At turn 40, you are managing:
- Context pressure. The full transcript of a 30-minute conversation is 5,000-8,000 tokens. That is a significant chunk of your context window. We use rolling summarization: every 10 turns, summarize the conversation so far into a structured summary, and replace the raw history. This loses some nuance but keeps reasoning quality high.
- Topic tracking. Users jump between topics. The agent needs to know whether “the contract” refers to the one discussed at minute 5 or minute 25. We maintain a topic stack in the conversation state.
- Persona drift. Over many turns, the agent’s tone and behavior shift as the original system prompt gets diluted by conversation content. We re-inject key persona instructions every N turns as a “reminder” in the system context.
Where Voice Pipelines Break

Two Failures That Taught Us the Most
Failure 1: The agent that talked to itself. Early in development, our echo cancellation was insufficient. The agent’s TTS output was picked up by the microphone, transcribed by STT, and sent back to the LLM as user input. The agent then responded to its own words, creating a loop that consumed a full context window in under 30 seconds. The fix was aggressive echo cancellation in the WebRTC layer plus a semantic similarity check: if the STT output is >90% similar to the last TTS output, discard it.
Failure 2: The 30-minute context collapse. A pilot user ran a 45-minute session. Around minute 30, the agent started contradicting its earlier responses. It forgot a critical piece of information from minute 8. Root cause: we were using a simple sliding window for context, and the early turns — which contained the most important configuration — had been pushed out. Our fix was structured summarization with an “important facts” section that persists regardless of window position. We extract key facts after every 5 turns and pin them in the system context.
Decisions and Trade-offs
| Decision | What We Chose | What We Gave Up |
| LLM for voice | Smaller, faster model (optimized for TTFT) | Reasoning depth. For complex questions mid-call, we route to a larger model and play a filler phrase. |
| Barge-in threshold | 300ms minimum speech before cancel | Snappy interrupt response. But false-positive interrupts were worse for UX. |
| Context management | Rolling summarization every 10 turns + pinned facts | Some conversational nuance. But we keep accurate recall of key information across 45+ minute sessions. |
| TTS provider | ElevenLabs streaming with voice cloning | Lowest possible latency (some providers are ~50ms faster). But voice quality and naturalness won. |
Your Voice AI Checklist
- Total round-trip latency measured and under 1 second at p95
- Barge-in handling tested: user interrupt mid-agent-speech works cleanly without audio artifacts
- Echo cancellation prevents the agent from responding to its own TTS output
- Context management strategy for sessions over 15 minutes: summarization + pinned facts
- VAD threshold tuned for target environment (quiet office vs noisy call center)
- Fallback audio for slow LLM responses: filler phrase or “let me think about that”
- Per-session cost tracking: a 30-minute call has a real token budget


