The Voice AI Latency Playbook: How to Hit Sub-Second Turns

Voqal TeamJune 11, 2026

To hit sub-second voice turns, you cut latency at every stage of the pipeline and overlap the stages instead of running them in sequence. A naive voice agent runs speech-to-text, then the LLM, then tools, then text-to-speech end to end and lands around 1,000ms or worse. A tuned one streams partial transcripts into the model, streams model tokens into TTS, pools its connections, primes its prompt cache, and prewarms before the user speaks. Done right, a warm turn finishes in under a second and the conversation stops feeling like a bad international call.

This is the playbook we use to engineer voice for production. It is written for CTOs and platform engineers who own latency budgets, not demos.

Why sub-second is the bar, not a vanity metric

Human conversation runs on a clock most people never notice. Analyzing turn-taking across ten languages, researchers found the average gap between one speaker finishing and the next starting is roughly [200 milliseconds](https://pmc.ncbi.nlm.nih.gov/articles/PMC4464110/) — and that pattern holds across every culture studied. The catch: producing a spoken response actually [takes over 600ms](https://pmc.ncbi.nlm.nih.gov/articles/PMC4464110/), so humans predict the end of your turn and start preparing their reply before you finish.

Machines do not get to predict. They wait, process, and speak. That is why latency is the single biggest driver of whether a voice agent feels intelligent. Conversational interfaces are widely understood to lose perceived intelligence once the response gap exceeds the ~200ms turn-taking threshold; beyond it, users interrupt, repeat themselves, or disengage. Most production agents miss this badly — humans take turns at 200–300ms gaps while most voice agents lag at 800–1500ms because they sit and wait for silence.

Sub-second is not perfectionism. It is the floor for a conversation that feels human. If you want the architectural context for why this is a systems problem and not a model problem, see from transcription to agents: the voice layer architecture.

Where the milliseconds actually go

Every voice turn is a relay race across five stages. Here is a realistic breakdown for a cloud agent before optimization, drawn from vendor benchmarks:

StageWhat it doesTypical latency
Endpointing / turn detectionDeciding the user actually stopped talking150–500ms
Speech-to-text (STT)Audio → text184–509ms P50
LLM (time to first token)Reasoning + first response token~500ms target, up to 5s cold
Tool / action callsHitting your APIs, MCP servers, databases200ms–2s+
Text-to-speech (TTS)First audio byte40–280ms TTFB
Network round tripDevice ↔ data center50–150ms

Add those naively and you are well over a second. A commonly cited reference equation puts [STT at 200ms, LLM at 500ms, TTS at 150ms, network at 50ms, and processing at 100ms — about 1,000ms total](https://sayna.ai/blog/sub-second-voice-agent-latency-practical-architecture-guide). The job is to attack each row and to stop adding them sequentially.

Two non-obvious truths up front. First, in most agents the LLM is the primary source of latency, so that is where the leverage is. Second, network often accounts for 20–40% of total inference latency — geography quietly taxes every turn.

The latency budget: a step-by-step method

Treat latency like a financial budget. Allocate, measure, and cut against a target. Here is the process:

1. Set the target at the perceived gap, not the sum of stages. Your goal is end-of-user-speech to first-audio-out under ~1,000ms warm, ideally 300–500ms. Anything over the [200ms turn-taking threshold](https://hamming.ai/resources/voice-ai-latency-whats-fast-whats-slow-how-to-fix-it) erodes perceived intelligence. 2. Measure P95, not just median. Tail latency is what users remember. AssemblyAI notes [P95 is often more telling than the median](https://www.assemblyai.com/blog/best-api-models-for-real-time-speech-recognition-and-transcription) — their Universal-3 Pro streaming sits at ~150ms P50 but 534ms P95. 3. Instrument every stage separately. You cannot optimize a number you cannot see. Log endpointing, STT final, LLM TTFT, tool duration, TTS TTFB, and network per turn. 4. Attack the LLM first. It is usually the biggest line item. A median [TTFT of ~500ms is the practical design target](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4464110/) for real-time voice. 5. Overlap the stages. Stream, do not relay. This is the single highest-leverage change. 6. Kill the cold start. Warm turns and cold turns are different animals; design so the user only ever sees warm ones. 7. Co-locate the pipeline. STT, LLM, and TTS should live in the same data center; spreading them across regions adds needless round trips.

Tactic 1: Stream everything, sequence nothing

The biggest win is structural. Instead of waiting for each stage to finish, you pipeline them. Streaming ASR feeds partial transcripts to the LLM, which streams tokens to TTS, while TTS streams audio to the client — each stage starts before the previous one ends, collapsing the end-to-end gap on a good connection.

Streaming STT is table stakes: real-time transcription returns partial results in 200–500ms, and Deepgram advertises sub-300ms transcription out of the box. On the output side, modern TTS streams the first audio byte fast — Cartesia targets sub-100ms TTFB, and Deepgram's Aura-2 targets ~200ms. The user hears speech start while the model is still generating the rest of the sentence.

Whether you stitch this together yourself or buy it as a single layer is the classic build vs. buy decision for an in-house voice assistant — and the pipelining is exactly the part that is deceptively hard to get right.

Tactic 2: Get turn detection right

Latency you create by waiting too long for silence is self-inflicted. Endpointing decides the user stopped talking — fast enough to feel responsive, but not so fast it truncates them. VAD-only detection is simple but tends to add more latency; model-based and semantic turn detection close the gap to ~300ms without cutting users off mid-thought.

Barge-in matters just as much: when the user starts talking, the agent must stop. A natural-feeling barge-in budget is under 150ms from end-of-user-speech to TTS flush. Get this wrong and the agent talks over people — which feels worse than being slow.

Tactic 3: Prompt caching for LLM TTFT

Since the LLM dominates the budget, the cache is your best friend. Anthropic reports prompt caching can reduce latency by up to 85% for long prompts, with cache reads priced at 0.1x the base input cost. Caching cuts the prefill computation, so TTFT for a cached prompt can drop from ~5 seconds to under 200ms. The gains scale with prompt length — long system prompts with tool schemas and tenant context are exactly the kind of thing you want to cache.

The trick is that a cache only helps if it is warm. The first turn after a cold start still pays the full prefill. Which leads to the next tactic.

Tactic 4: Connection pooling and prewarming

Cold starts are the silent killer. Opening a fresh connection to an MCP server or tool backend, then listing tools, then making the first call can cost several seconds — and that is before the model runs. The fix is to keep connections alive in a pool keyed by user/tenant, so a returning turn reuses an open, authenticated connection instead of paying the handshake again.

Prewarming takes it further: open the connection and prime the prompt cache at app launch, before the user ever taps the mic. By the time they speak, the expensive setup is already done and the first real turn is a warm turn. This is core to how a voice-to-actions SDK hits sub-second turns in production — the architecture, not the model, is doing the work. This is also why architecture determines mobile payment conversion: a slow first turn loses the user before they finish the task.

Tactic 5: Co-locate and shorten the network path

Geography is latency. Round trips between cloud data centers and devices average 50–150ms, and regional or metro data centers can pull that down to 10–50ms RTT. Telnyx reports sub-200ms audio round-trip through regional deployment. Keep STT, LLM, TTS, and your tool backends in the same region as your users, and route by request prefix so production traffic never bounces to a far data center.

Tactic 6: Parallelize tools and trim the response

If a turn needs three API calls, fire them concurrently, not in a loop. And do not make the model narrate before acting — the render-spec / server-driven UI pattern lets the agent emit a short spoken answer plus a structured widget payload in one pass, so the UI updates without a second round trip.

For Arabic and other non-English voice, model and STT choices change the math materially — see the best Arabic voice speech-to-text APIs for 2026 and the complete Arabic voice SDK guide. Building this dialect-aware, low-latency stack in-house is the reason a voice SDK integration ships in days while building Arabic voice in-house takes 6 months.

Frequently asked questions

What is a good end-to-end latency target for a voice agent?

Aim for end-of-user-speech to first-audio-out under ~1,000ms on a warm turn, ideally 300–500ms. Natural conversation requires sub-500ms, and crossing the ~200ms turn-taking threshold starts eroding how intelligent the agent feels.

Which stage of the voice pipeline is slowest?

Usually the LLM. In a full STT+LLM+TTS bot, the LLM is typically the primary source of latency. That is why prompt caching and warm connections give the biggest returns.

Does prompt caching really cut latency that much?

Yes, for long prompts. Anthropic reports up to 85% latency reduction, and a cached prompt's TTFT can drop from ~5s to under 200ms. The cache must be warm to help — prime it before the user speaks.

Why is my first turn always slow but later turns fast?

Cold start. The first turn pays for opening connections, listing tools, and a cold prompt cache; later turns reuse warm ones. Connection pooling plus launch-time prewarming hides this so the user only ever sees warm turns.

Should I measure median or P95 latency?

Both, but P95 is what users remember. AssemblyAI notes P95 is often more telling than the median — a model can show ~150ms P50 and 534ms P95. Optimize the tail.

How much does network location matter?

A lot. Network can be 20–40% of total inference latency. Co-locate STT, LLM, and TTS in one region near your users; metro data centers reach 10–50ms RTT.

The takeaway

Sub-second voice is an architecture decision, not a model upgrade. Stream instead of sequence, tune turn detection, cache the prompt, pool and prewarm connections, and keep the pipeline in one region. Latency is dominated by the seams between stages and by cold starts — close those and a warm turn lands under a second.

If you are deciding whether voice belongs in your product at all, read when voice actually works in mobile apps and when it doesn't and the business case for voice ROI in mobile apps. To build on a stack that ships sub-second warm turns out of the box, see the Voqal docs or join the waitlist.

Related articles