When you build a chatbot, latency is a UX detail. When you build a voice agent that answers phone calls, latency is the product. A human picks up the phone, says "Hi, do you have availability on Friday?" — and every millisecond before your agent responds is silence that screams robot.
I run a voice AI platform that answers real business calls. Here's what production taught me that no demo ever would.
The budget is smaller than you think
A natural conversational gap is 200–500ms. Past about 800ms, callers start saying "hello?" Past 1.5s, they hang up or talk over the agent. Your entire pipeline — speech-to-text, LLM, text-to-speech, network hops — has to fit inside that window.
Now add up a naive pipeline: STT finalization (300ms), LLM time-to-first-token (400–900ms), TTS time-to-first-byte (200–400ms), plus telephony transport. You're over budget before you've done anything clever.
What actually works
Stream everything. Never wait for a full LLM response before starting TTS. Sentence-chunk the stream and synthesize as you go. The caller hears the first words while the model is still writing the last ones.
Pick models by time-to-first-token, not benchmark scores. A slightly dumber model that starts speaking in 300ms beats a brilliant one that thinks for two seconds. For most receptionist-style turns, you don't need frontier reasoning — you need fast and grounded.
Cut the RAG tax. Retrieval adds a hop before generation. Cache aggressively, keep knowledge bases small and well-chunked, and don't retrieve at all for turns that obviously don't need it ("What are your hours?" after you've already injected hours into the system prompt).
Handle barge-in like a human. Callers interrupt. If your agent keeps talking over them, the illusion dies instantly. You need to detect speech, stop TTS playback fast, and — harder — decide what the agent "already said" so the conversation state stays coherent.
Fillers are legitimate engineering. A well-timed "Sure, let me check that…" buys you 1.5 seconds of LLM time and feels more human, not less. Restaurants put bread on the table for a reason.
The unglamorous truth
Most of my latency wins came from boring places: keeping connections warm, pinning regions so audio doesn't cross an ocean twice, measuring p95 instead of demo-day p50, and treating every vendor's "typical latency" claims as fiction until I'd graphed them myself.
If you're building voice AI: instrument first, then optimize. The pipeline you think is slow is rarely the one that is.