Build vs Buy: The True Cost of an In-House Voice Assistant

Verdict first: for the overwhelming majority of product teams, building an in-house voice assistant is a false economy. The visible work — wiring up a speech API and an LLM — is maybe 10% of the job. The other 90% is dialect tuning, intent resolution, action orchestration, UI for every state, RTL/i18n, latency engineering, security and confirmation flows, and the maintenance and on-call tail that never ends. A voice-to-actions SDK collapses that to a drop-in integration measured in minutes, not quarters. Build only when voice is your product. Otherwise, buy.

This post is written for founders and operators who have to defend the line item. It is honest about the rare cases where building is correct, and it gives you a cost framework you can take into a planning meeting.

The trap: the demo is the easy 10%

Any competent engineer can wire a speech-to-text API to an LLM and get a voice demo working in an afternoon. That demo is what fools teams into estimating "a few weeks." The demo doesn't handle a noisy room, a code-switched sentence, a half-finished utterance, a network blip mid-turn, a user who interrupts the assistant, or an Arabic dialect the model was never trained on. It doesn't render a confirmation card before moving money. It doesn't degrade gracefully when the STT vendor has a slow day.

Production voice is a systems problem, not a feature. Here is what "production" actually contains.

STT selection and dialect tuning

You don't get to pick one speech-to-text vendor and move on. You have to evaluate several against your users' actual audio — accents, background noise, domain vocabulary — because per-vendor accuracy varies wildly by language and acoustic conditions. Then you tune. For English this is annoying. For Arabic it is a research project: dialects differ in phonology, morphology, and lexicon; diacritics are usually absent; users code-switch into English mid-sentence; and annotated training data is scarce (Arabic ASR: Challenges and Progress). A generic STT endpoint will mis-hear dialectal Arabic constantly, and "just pick a better model" is months of evaluation, not a config flag. We go deeper on this in why building Arabic voice in-house takes 6 months and the complete Arabic voice SDK guide.

Intent and NLU

Getting clean text is half the battle. Now you have to turn "can you send Ahmed the usual" into a structured, parameterized action with the right entities resolved. Modern LLMs make this far easier than the old intent-classifier era — but only if you invest in the prompt contract, tool schemas, disambiguation, and the long tail of "the model confidently did the wrong thing." That long tail is where the real engineering months go.

Action orchestration

Understanding intent is useless unless it safely does something. That means a tool/function-calling layer, connections to your backend systems, parameter validation, idempotency, error recovery when a downstream call fails, and a confirmation step before anything irreversible. This is a backend subsystem in its own right.

UI for every state

Voice has more states than people expect: idle, listening, thinking, speaking, interrupted (barge-in), error, retry, confirming. Each needs visual feedback, and the assistant needs to render answers — balances, lists, cards, charts — not just speak. Hard-coding a screen per answer type does not scale; a server-driven render-spec approach is what lets the backend describe UI the client renders generically. Building that yourself is a meaningful frontend investment on top of everything above.

RTL and i18n

If you serve Arabic (or Hebrew, Farsi, Urdu), right-to-left layout, bidirectional text mixing, mirrored components, and locale-correct number/date formatting are not optional polish — they are correctness. Bolting RTL onto a UI built LTR-first is a painful retrofit, and bidirectional text rendering has its own edge cases.

Latency engineering

A voice turn that takes eight seconds feels broken. Hitting a sub-three-second warm turn means connection pooling to your backend and tool servers, prompt-cache warming, prewarming the pipeline at app launch, streaming partial responses, and caching read-heavy tool calls. In our own measurements, latency is dominated by external services — a cold connection or a prompt-cache miss can add several seconds — and engineering it down to feel instant is ongoing work, not a one-time fix.

Security and confirmation

The moment voice can move money or change account state, you inherit a security surface: tiered confirmation (tap-to-confirm vs biometric for high-risk actions), proof-of-possession so requests can't be replayed, session security, and the discipline that the assistant must never name or leak the auth mechanism in its spoken reply. Get this wrong and a voice feature becomes a liability.

Maintenance and on-call

None of the above is build-once. STT and LLM vendors change models, deprecate endpoints, and have bad days. Dialect accuracy drifts. New tools get added. Someone has to own the pager when voice breaks at 2am. This recurring cost is the line teams most consistently forget — and it never goes to zero.

The cost comparison

The numbers below are illustrative ranges, not quotes. They assume a US fully-loaded senior engineer in the rough $150k–$200k/year range (base plus benefits and overhead; base salaries land near $104k–$138k depending on source), and usage-based API costs in the published range for speech-to-text (roughly $0.003–$0.036 per minute) plus LLM tokens and TTS. Independent estimates put a custom voice agent anywhere from $20k to $300k+ depending on scope. Plug in your own loaded rate.

Line item	Build in-house	Buy a voice-to-actions SDK
Time to first working integration	4–8 months (1–3 engineers)	Minutes to a few days
STT selection + dialect tuning	1–3 months of evaluation + ongoing	Handled; tuned for the target dialects
Intent / NLU layer	1–2 months + long-tail fixes	Built in
Action orchestration + confirmation	1–2 months	Built in (tiered confirm, render spec)
UI for every state + RTL/i18n	1–2 months	Drop-in themed UI, RTL included
Latency engineering	Ongoing, weeks of tuning	Pooling + prewarm shipped
Security (PoP, session, biometrics)	Weeks + audit risk	Built in
Upfront engineering cost (illustrative)	~$120k–$400k+	Integration time only (hours–days)
Ongoing cost	Maintenance + on-call (≈0.5–1 FTE) + usage APIs	Subscription/usage + your existing API keys
Opportunity cost	Quarters not spent on your core product	Ship now, iterate on product

The asymmetry is the point. Building converts product-roadmap quarters into infrastructure quarters. Buying converts a multi-quarter project into an afternoon, and lets you spend those engineers on the thing that actually differentiates you. For the revenue side of that trade, see the business case for voice ROI in mobile apps.

When building is the right call (honestly)

Buying is not always correct. Build in-house when at least one of these is genuinely true:

Voice is your product, not a feature. If your company is the voice layer — you sell ASR, you're a voice-agent platform, the assistant is the whole value proposition — then this stack is your moat and you must own it.
You have a dedicated, funded voice/ML team that wants control of every layer and the appetite to run it for years. Not "we'll have an engineer look at it" — a real, staffed, ongoing commitment.
You have a hard requirement no vendor can meet — fully on-device/offline inference for a regulated environment, a proprietary acoustic model on data you can't share, or a latency floor below what any hosted pipeline allows.
Extreme scale economics. At very high, sustained voice volume, owning the pipeline can eventually beat per-minute pricing — but only after you've already paid the build-and-maintain cost, so this is a late-stage optimization, not a starting point.

Notice how narrow that list is. If you're nodding at one of these, build with eyes open. If you're reaching to justify one, that's your signal to buy.

Self-qualify in 30 seconds

Answer honestly:

1. Is voice the core of what we sell, or a feature on top of our real product? 2. Do we have a staffed team ready to own STT, NLU, latency, security, and on-call for years? 3. Do we have a requirement (offline, proprietary model, regulatory) that no SDK can satisfy?

If you answered "feature," "no," and "no" — buy. You'll ship this quarter instead of next year, and your engineers stay on the work only you can do. Read the integration docs to see how fast a drop-in SDK actually is, or join the waitlist.

Frequently asked questions

How long does it really take to build an in-house voice assistant?

The demo takes an afternoon; production takes 4–8 months with 1–3 engineers, and that's before the maintenance tail. The gap between "it works in the demo" and "it works for real users in a noisy room speaking a dialect" is where the time goes. A drop-in SDK ships in minutes to days.

Isn't building cheaper since we just pay for APIs?

No. API costs (STT, LLM tokens, TTS) are usage-based and small per turn, but they are the smallest part of total cost of ownership. The dominant costs are engineering build time and ongoing maintenance plus on-call — roughly 0.5–1 FTE indefinitely. You pay for the same usage APIs whether you build or buy; building just adds the engineering bill on top.

Why is Arabic voice specifically so hard to build?

Arabic combines several hard problems: many mutually distinct dialects, near-universal absence of diacritics in everyday text, frequent code-switching into English, and scarce annotated training data (research overview). A generic STT model mis-hears dialectal Arabic constantly, and fixing that is a research effort, not a config change. Plus you inherit full RTL UI work.

Can't we start with a build and switch to an SDK later?

You can, but you'll have paid twice — once for the abandoned build, once for the SDK — and lost the quarters in between. If you're not certain voice is core to your product, start by buying; you can always invest in a custom stack later once volume and requirements justify it.

What does "voice-to-actions" mean versus a plain voice assistant?

A plain voice assistant transcribes and replies. Voice-to-actions means the spoken request safely executes something in your backend — check a balance, send a payment link, file a request — with parameter validation and a confirmation step before anything irreversible. That action and confirmation layer is one of the most expensive parts to build correctly, and it's built into a good SDK.

How does an SDK keep latency low if it's calling external services?

Latency is dominated by external services (STT, the LLM, your tool servers), so the wins come from connection pooling, prompt-cache warming, prewarming the pipeline at app launch, and caching read-heavy calls — engineering that's already shipped in the SDK. Building that tuning yourself is weeks of work you'd repeat. See the render-spec architecture for how the response path stays fast.