The short answer
A voice feature is working when users finish what they came to do, quickly, and come back. That collapses into seven measurable signals: task completion rate, intent accuracy, latency, containment/deflection, retention, error/abandon rate, and CSAT. No single one is sufficient. Containment without CSAT hides forced resolutions; intent accuracy without fallback rate flatters a model that fails silently on everything outside its scope. The job of a voice analytics program is to pair these metrics so the failure modes surface before a dashboard makes them invisible.
This post is the instrumentation playbook: what each metric means, what to benchmark it against, how to log it, and the traps that make a good-looking number lie. If you are still deciding whether to build at all, start with the business case for voice ROI in mobile apps and when voice actually works in mobile apps (and when it doesn't). This is about proving it once it ships.
The seven metrics, in priority order
Measure these in this order, because each one upstream poisons the ones below it. A misclassified intent corrupts completion; high latency drives abandonment before completion is even possible.
1. Task completion rate (TCR) — the percentage of sessions where the user achieved their actual goal (checked a balance, sent a payment, booked the slot), not just where the bot replied. This is the north star. Everything else is diagnostic. 2. Intent accuracy — the percentage of turns where the system correctly identified what the user wanted. Always report it *alongside fallback rate*, because [reporting intent accuracy without fallback rate](https://www.balto.ai/blog/kpis-for-voice-ai-agents-in-contact-centers/) lets a model score 95% on recognized intents while routinely failing on the 20% of conversations outside its scope. 3. Latency — time-to-first-audio and turn-to-turn gap. Human conversation runs on a ~200ms inter-turn rhythm; cross that budget and the interaction stops feeling like a conversation. 4. Containment / deflection — the share of sessions resolved end-to-end without a human or a fallback to manual UI. The headline cost-savings number, and the most-gamed. 5. Retention — do users come back and use voice again. The honest verdict on whether the feature earned its place on the screen. 6. Error / abandon rate — ASR word error rate, fallback frequency, and the percentage of sessions the user quit mid-flow. 7. CSAT — explicit satisfaction, the human check on every metric above.
Metric definitions and targets
| Metric | Definition | Target / benchmark |
|---|---|---|
| Task completion rate | % of sessions where the user's goal was achieved end-to-end | 70–90% for well-bounded tasks; 40–60% for broad FAQ scope |
| Intent accuracy | % of turns where intent was correctly classified | 90–97% on bounded use cases (booking, order status, balance) |
| Word error rate (ASR) | (Substitutions + Deletions + Insertions) / reference words | <10% clean speech; 15–30% acceptable in noise |
| Time-to-first-audio | Latency from end-of-speech to first response audio | <800ms good; <500ms best-in-class; ~200ms = human rhythm |
| Containment rate | % of sessions resolved without human/manual handoff | 70–90% mature; 20–40% early deployments |
| Deflection rate | % of would-be support contacts never reaching the queue | ~50% industry standard; 80%+ top performers |
| Retention (repeat use) | % of users who use voice again within N days | Track trend; flat or rising = product-market fit |
| Abandon rate | % of sessions quit mid-flow | <6% (telephony avg ~5.91%); lower is better |
| CSAT | Explicit satisfaction score for voice interactions | 78% average for AI support; 85%+ for leaders |
Targets vary by scope. A narrow, action-driven assistant should sit at the high end of every band; a wide-open Q&A bot will not. That gap is exactly why architecture determines mobile payment conversion: a voice-to-actions design that executes confirmed tasks measures completion natively, while a transcription-only design can only measure that words were heard. And targets should be calibrated to your market: see the real voice UI conversion data from banking, delivery, and e-commerce apps in MENA before importing a generic benchmark.
Task completion rate: the only metric that survives scrutiny
Task completion is the percentage of sessions where the user actually accomplished their goal. Contact-center frameworks frame this as whether the user resolved an issue or completed a purchase, not whether the assistant produced a fluent reply. The distinction matters because a bot can answer confidently and still leave the job undone.
Instrument it by defining a completion event per intent. For a voice-to-actions SDK, the natural anchor is the executed action: a payment confirmed, a transfer submitted, a booking written. Log intent → confirmed → executed → result as a single correlated trace keyed by session ID. TCR is then executed_success / sessions_with_intent. If your stack only emits transcripts, you cannot compute this honestly, which is the practical argument in what is a voice-to-actions SDK.
Intent accuracy: never report it alone
Intent recognition accuracy benchmarks at [90–97% on well-bounded use cases](https://hamming.ai/resources/voice-agent-evaluation-metrics-guide) like appointment booking or order status, because a misclassified intent corrupts everything downstream. But the number is dangerous in isolation. Always pair it with fallback rate (how often the system gives up or routes to a catch-all) and out-of-scope rate. A model that handles 80% of traffic at 95% accuracy and silently fails the rest is not a 95% system.
Under the hood, intent accuracy depends on ASR. WER is still the gold standard for ASR accuracy, calculated as (substitutions + deletions + insertions) / reference words. Watch the benchmark-to-production gap: the same API has been measured at 92% on clean headsets, 78% in conference rooms, and 65% on mobile calls with noise. Test on the audio your users actually produce, on the move and in noise, not on studio reads. For non-English deployments this gap is wider still, which is why dialect and accent coverage gets its own treatment in the Arabic voice SDK guide.
Latency: the budget that breaks everything else
Latency is the metric users feel before they can rate anything. Research across ten languages found an average inter-turn gap of around 200ms in human conversation, and that sets the bar. The practical tiers:
- Sub-500ms — best-in-class, approaches human rhythm.
- Sub-800ms — good; the enterprise standard for conversational flow.
- Above 1,500ms — user experience degrades sharply and latency correlates directly with rising abandonment and lower CSAT.
Measure time-to-first-audio (TTFA), not just total response time, because the user's patience clock starts at end-of-speech. Decompose the budget: endpointing and ASR typically eat 150–300ms, TTS another 100–200ms, leaving the LLM only a few hundred milliseconds to keep total TTFA under a second. Log each stage as a span so you can see which one blew the budget. The full decomposition and mitigation tactics live in the sub-second voice AI latency guide.
Containment and deflection: the cost story, and the trap
Containment is the share of sessions resolved end-to-end without escalation; deflection is the share of would-be contacts that never reach the queue. Most chatbots start at 20–40% containment and mature implementations reach 70–90%; on deflection, most teams hit 20–30% while top performers reach 80%+, with ~50% as a common industry midpoint.
The trap: most teams [track containment alone and call it done](https://www.balto.ai/blog/kpis-for-voice-ai-agents-in-contact-centers/), which is exactly how deployments end up *force-resolving* calls that should have escalated. A 90% containment rate paired with a falling CSAT is not a win; it is a queue of frustrated users you stopped counting. Split escalation into planned vs. forced, and read containment only next to CSAT and repeat-contact rate. For the support-specific framing, see voice AI customer support deflection.
Retention: the honest verdict
Every operational metric can be tuned; retention is the one users vote on with their behavior. Track repeat-use rate (share of users who invoke voice again within 7 or 30 days), sessions per active user, and the trend over time. A useful nuance from the research: users forgive failures and keep using voice for simple tasks even while trust on complex tasks is still being repaired. So segment retention by task complexity. Retention on high-value actions (payments, transfers) is the signal that the feature is load-bearing, not a novelty. This is the long-horizon argument behind voice-first as the next platform shift, and it is the metric that ultimately decides when voice actually works in mobile apps.
Error and abandon rate: how it fails matters more than that it fails
Graceful failure is a feature. Track fallback rate, ASR error rate (WER on production audio), and abandon rate, the percentage of sessions the user quits mid-flow. Telephony abandonment averaged about 5.91% in 2024, and abandonment chips directly at NPS, raises churn risk, and inflates repeat-contact volume. Latency spikes are a leading cause, so correlate your abandon events with the latency span that preceded them. WER alone is not enough: it does not tell you how errors affect usability, so weight errors on the words that carry intent (amounts, account names, action verbs) over filler.
CSAT: the human check
CSAT is the explicit satisfaction score, and it keeps the rest of the dashboard honest. The [industry-average CSAT for AI support agents is now 78%, with leaders above 85%](https://www.usefini.com/guides/best-ai-support-tool-containment-csat-benchmarking), and chatbot CSAT typically runs 10–15 points below live-agent CSAT, an acceptable gap on Tier-1 volume and a red flag on anything needing judgment. Collect it with a single post-session prompt, segment by intent, and watch the gap between CSAT and containment. When containment rises while CSAT falls, you are force-resolving.
How to instrument and benchmark
- One correlated trace per session. Emit a session ID through ASR, intent, action, and result so every metric joins on the same key. Without this, TCR and containment are guesses.
- Span the latency budget. Log endpointing, ASR, LLM, and TTS as separate timed spans; alert on TTFA, not just totals.
- Pair every metric with its check. Containment with CSAT; intent accuracy with fallback rate; WER with intent-word weighting.
- Benchmark on production audio. Replay real mobile-in-noise clips, not studio reads, given the 2.8–5.7x benchmark-to-production WER degradation.
- Segment everything by intent and complexity. A blended average hides the simple-task wins and the complex-task failures.
If you want the metrics computed for you out of the box, a voice-to-actions SDK emits completion and confirmation events natively; you can add a voice assistant to any app in a day and have these traces flowing from the first session. See the docs for the event schema, or join the waitlist.
FAQ
What is the single most important voice metric?
Task completion rate. It is the only metric that directly answers "did the user get what they came for." Everything else is diagnostic for why completion is high or low. Anchor it to an executed action, not a fluent reply.
What is a good latency target for a voice feature?
Under 800ms time-to-first-audio for natural flow, with sub-500ms as best-in-class against the ~200ms human inter-turn rhythm. Above roughly 1,500ms, abandonment climbs and CSAT drops sharply. Measure from end-of-speech, not from request start.
Why shouldn't I report intent accuracy on its own?
Because a model can score 95% on the intents it recognizes while silently failing on everything out of scope. Always pair intent accuracy with fallback rate and out-of-scope rate so the blind spots are visible.
What containment rate should I expect?
Early deployments land at 20–40%; mature, well-bounded ones reach 70–90%. But never read containment without CSAT and forced-escalation rate, or you will reward force-resolving calls that should have escalated.
How do I measure ASR quality realistically?
Use word error rate, but compute it on production audio (mobile, in noise) rather than clean studio reads, since the same model can drop from ~92% to ~65% accuracy across conditions. Weight errors on intent-bearing words like amounts and account names.
Does CSAT really matter if my other numbers look good?
Yes, it is the human check on a gamed dashboard. The classic failure pattern is rising containment with falling CSAT, which means you are force-resolving sessions. AI support averages ~78% CSAT, leaders 85%+; track the trend and the gap to live-agent CSAT.