How LLMs Finally Made Voice Assistants Work

Short answer: Voice assistants didn't fail because microphones or speech recognition were bad. They failed because the layer that turned words into actions was a brittle, hand-written intent grammar that broke the moment a user phrased something the engineers hadn't anticipated. Large language models replaced that brittle layer with open-ended intent understanding, structured tool calling, and multi-step reasoning. That single shift — from matching utterances against a fixed list to reasoning about what a user wants and calling the right function — is what finally made voice-to-actions viable in production apps.

I've spent the last few years building voice into mobile apps, and the difference between the pre-LLM era and now is not incremental. It's the difference between a demo and a product. Here's the full story.

The pre-LLM era: voice assistants ran on grammars

If you used Siri or Alexa before ~2023 and walked away thinking "this only works if I say the magic words," your instinct was correct. Classic voice assistants were built as intent classifiers plus slot fillers. The pipeline looked like this: speech-to-text produced a transcript, an NLU model matched that transcript to one of a fixed set of intents, and then a slot filler extracted the variables that intent needed (a date, a contact, a song name). The way NLP powers assistants like Siri and Alexa — intent recognition feeding slot extraction — is exactly where the brittleness lived.

The problem is that every intent had to be defined in advance, and every phrasing had to map cleanly onto it. Research on voice interface failures documents a category researchers literally call "Intent Pattern Match Failure" — where the utterance pattern for an intent requires users to specify slot values in a specific syntax, and anything outside that syntax simply fails ([study on VUI failures and frustration](https://arxiv.org/pdf/2002.03582)). When the system couldn't match, you got the infamous "I'm sorry, I don't understand."

Users adapted, and the adaptations tell the whole story. A study of [dialogue repair in virtual assistants](https://www.frontiersin.org/journals/robotics-and-ai/articles/10.3389/frobt.2024.1356847/full) found people resorted to hyperarticulation (over-enunciating), simplification, starting over with a fresh utterance, settling for a "good enough" result, or just quitting. A broader [scoping review of conversational breakdown repair](https://dl.acm.org/doi/fullHtml/10.1145/3640794.3665558) found that breakdowns directly decreased satisfaction, trust, and willingness to keep using the system — and that initial failures often led users to abandon a feature after a single try. That's the real legacy of the grammar era: not that voice was impossible, but that it was fragile, and fragility trains people to stop trusting it.

Why grammars couldn't scale

Combinatorial explosion. Every new capability meant authoring new intents, sample utterances, and slot schemas by hand. Coverage never caught up with the infinite ways people phrase things.
No reasoning. "Move $50 from savings to checking, but only if rent already cleared" isn't an intent. It's two intents and a condition. Grammars can't compose.
No graceful failure. A missed match returned nothing useful — there was no fallback that could reason about partial understanding.
No real connection to actions. Even when intent was recognized, wiring it to an actual API was bespoke glue code per capability.

What LLMs actually unlocked

Three capabilities arrived, roughly in sequence, and each one removed a load-bearing constraint of the grammar era.

1. Open intent understanding

LLMs don't match against a fixed intent list. They interpret arbitrary natural language and infer goals from messy, incomplete, contextual input. The thing that broke classic assistants — a user saying something nobody scripted — is the thing LLMs are built for. This is why voice-to-actions is fundamentally different from transcription: transcription gives you text; an LLM gives you understood intent you can act on.

2. Tool calling (the real unlock)

The pivotal moment was June 2023, when OpenAI announced function calling. As Simon Willison explained at the time, you send the model JSON schemas describing your functions, and it returns a structured JSON object naming the function to call and the arguments to call it with. Suddenly the model wasn't just talking — it was emitting machine-executable actions. Function calling is the mechanism that makes LLMs interact with real systems: it analyzes natural language, extracts intent, and produces a structured call. That is precisely the job the old slot filler did — except it generalizes to phrasings and combinations nobody pre-defined.

This is the heart of what a voice-to-actions SDK is: the LLM turns spoken intent into a tool call, the tool executes against your backend, and the result comes back as something the user can see and confirm.

3. Reasoning and multi-step orchestration

Modern models don't just pick one function — they chain them, handle conditions, and recover from partial information. They can ask a clarifying question instead of failing silently. This is what makes genuinely agentic voice flows possible: a single spoken request can fan out into a sequence of tool calls with reasoning between each step.

4. A standard for connecting tools

The last piece was integration sprawl. Wiring each model to each tool was custom work — exactly the per-capability glue that doomed grammars. In November 2024 Anthropic introduced the Model Context Protocol, an open standard where tools publish what they can do and any agent discovers and calls them through a consistent JSON-RPC interface. Adoption was fast enough that OpenAI adopted it in its Agents SDK in March 2025. For voice specifically, MCP lets a voice agent reach any backend without bespoke integration — the missing standard that grammars never had.

Era by era: what changed and what was still missing

Era	What it could do	Core limitation
Command era (pre-2015)	Fixed wake-word commands: timers, alarms, "call Mom"	Exact phrasing required; zero flexibility
Intent-grammar era (~2015–2022)	Intent classification + slot filling for scripted domains	Intent-pattern-match failures; no composition or reasoning
Generative chat era (2023)	Open natural-language understanding, fluent responses	Could talk but couldn't reliably act on your systems
Tool-calling era (mid-2023+)	Structured function calls from natural language	Per-tool integration glue; latency in the cascade
Agentic + MCP era (2024–now)	Multi-step reasoning + standardized tool access	Reliability and latency engineering at production scale

Why voice-to-actions is viable now (and wasn't before)

The convergence of open intent, tool calling, reasoning, and a tool standard is why this is finally a real product category rather than a research demo. A few forces made the timing right:

1. Intent is no longer hand-authored. The LLM generalizes to phrasings you never anticipated — the exact failure mode that killed the grammar era. 2. Actions are first-class. Tool calling turns understood intent into executable, type-safe calls against your backend, not freeform text you have to re-parse. 3. Latency dropped into the human range. In 2025, [speech-to-speech models reached sub-200ms latency](https://www.turing.com/resources/voice-llm-trends), and infrastructure work — [co-locating ASR, LLM, and TTS](https://telnyx.com/resources/voice-ai-agents-compared-latency) — is closing the gap to the 300–500ms window humans expect in conversation. Voice that lags feels broken; voice that responds feels alive. 4. The market validated it. Y Combinator reported a 70% rise in vertical voice AI startups between winter and fall 2024. Voice stopped being a novelty and became infrastructure — part of the broader platform shift toward voice-first interaction.

The honest caveat: LLMs aren't magic

The new constraints are reliability and latency, not understanding. When Amazon rebuilt Alexa with generative AI, it described the hard part as "getting LLMs to orchestrate APIs reliably." Leaked reports flagged the new Alexa deflecting answers, giving long or inaccurate responses, and struggling with latency. The lesson: the model unlocked the capability, but a production voice product still needs disciplined engineering — confirmation steps for risky actions, server-driven rendering of results, and tight latency budgets. That's an architecture problem, and architecture determines whether voice actually converts.

This is why we render results as server-driven UI specs instead of trusting the model to free-form everything, and why knowing when voice actually works in mobile apps — and when it doesn't matters more than ever. The capability is real; using it well is the work.

What this means for builders

If you tried to build voice five years ago and gave up on the intent-grammar treadmill, the ground has genuinely shifted. You no longer maintain an ever-growing list of intents and sample utterances. You define the actions your app can take — as tools — and let the model handle the infinite space of how users ask. That inversion is the whole game. It's also why a voice interface can now beat a chatbot for many mobile flows, and why the ROI case for voice in mobile apps finally pencils out. For multilingual products, the same open-intent capability is what makes a robust Arabic voice SDK possible without hand-authoring dialect grammars.

FAQ

Why did Siri and Alexa feel so limited for so long?

Because they ran on intent grammars: a fixed set of intents plus slot filling. Anything outside the scripted patterns triggered an intent-pattern-match failure, which is why users learned to over-enunciate, simplify, or give up entirely, as documented in dialogue-repair research.

What specifically did LLMs change?

Three things: open intent understanding (no fixed intent list), tool/function calling that turns natural language into structured, executable actions, and multi-step reasoning that can compose actions and ask clarifying questions instead of failing silently.

Is tool calling the same as the old slot filling?

It does the same job — extracting structured arguments from language — but it generalizes. Function calling works on phrasings and combinations nobody pre-defined, where slot fillers only worked inside the exact patterns engineers scripted.

Does this mean LLM voice assistants are flawless now?

No. The remaining hard problems are reliability and latency, not understanding. Amazon's own Alexa rebuild hit reliability and latency challenges. Production voice needs confirmation gates, server-driven rendering, and careful latency budgeting on top of the model.

Why does latency matter so much for voice?

Human conversation expects a response within roughly 300–500ms, and delays beyond 500ms feel unnatural. Latency compounds across speech-to-text, the LLM, and text-to-speech, so the whole architecture has to be engineered for speed, not just the model.

How do I add voice-to-actions to my app today?

Define your app's actions as tools, connect them through a standard like MCP, and let an LLM map spoken intent to those tools. That's the model Voqal's SDK is built on — read the docs or join the waitlist to get started.

Voqal is a voice-to-actions SDK that drops into any mobile app: spoken intent in, real actions out, results rendered as native UI. See the documentation or join the waitlist.