The short answer
MENA users mix Arabic and English in the same sentence because it is the natural register of bilingual life in the region, not a mistake to be corrected. Most automatic speech recognition (ASR) fails on this because it is built and benchmarked on monolingual audio: studies report a relative 30-50% jump in Word Error Rate when models meet code-switched speech versus clean single-language input. To handle it well, you need a recognizer trained on real code-switched corpora, a fronted transliteration layer for Arabizi, and evaluation metrics that go beyond raw WER. This post explains the linguistics, the failure modes, and a concrete handling and evaluation playbook.
If you are building voice features for an Arabic-speaking audience, start with our complete Arabic voice SDK guide and the Arabic dialects voice recognition guide for the dialect dimension that sits underneath code-switching.
Why MENA users mix languages mid-sentence
Code-switching, the alternation between two languages within a single utterance, is one of the [most challenging and under-studied conditions for ASR](https://arxiv.org/abs/2605.19069). In the Arab world it is also one of the most common conditions. Sociolinguistic work across Egypt, Saudi Arabia, Jordan, Kuwait, Oman, and the UAE finds Arabic-English switching is a dominant everyday pattern, not a fringe behavior.
The reasons are structural and social, not laziness:
- Lexical gaps. A speaker reaches for the English equivalent when the Arabic expression is hard to retrieve, or when a term is simply more idiomatic in English.
- Technical and academic vocabulary. Banking, software, and medical terms are routinely English even inside Arabic sentences, a direct legacy of English and French entering daily speech through education and international business.
- Social signaling. Switching is used to decrease social distance, emphasize a point, grab attention, and signal topic expertise.
- Generational layering. Research notes older speakers code-mix in professional settings while younger speakers switch more selectively and contextually - so your user base contains multiple switching styles at once.
The practical upshot: "check my balance bukra" or "yacomplications kteer fel transaction di" is not an edge case. It is the median utterance. A voice product that only handles clean Modern Standard Arabic or clean English will mis-hear most of its real traffic. For the dialect-specific side of this, see our deep dives on Egyptian Arabic voice recognition and Gulf and Khaleeji Arabic voice recognition.
Arabizi: the written cousin of spoken code-switching
When MENA users *type* the same mixed speech, you get Arabizi (also called Arabish): Arabic written in Latin characters, numerals, and punctuation, where digits stand in for sounds with no Latin equivalent - 3 for ع, 7 for ح, 2 for the glottal stop. So "7abibi, e3mel transfer" is one phrase, two scripts, one speaker.
Arabizi matters for voice apps for two reasons. First, your text fallback (the keyboard the user reaches for when voice fails) is full of it. Second, transliteration research is the closest thing the field has to a code-switch normalization layer:
- The CAMeL Lab, with Columbia and George Washington University, built automatic Arabizi-to-Arabic-script transliteration precisely because downstream NLP tools cannot tag or translate raw Arabizi.
- A Moroccan Darija system reached 92% word-level accuracy and 87 BLEU using phonetic rules plus round-trip consistency checking.
- A Tunisian-dialect Arabizi transliterator achieved a 10.47% character error rate.
The lesson is that script normalization is solvable but dialect-specific, which is exactly why [your users do not want to type](/resources/blog/your-users-dont-want-to-type) Arabizi at all. Friction with the Arabic keyboard is a measurable revenue drain - see [the hidden conversion tax of Arabic keyboard friction](/resources/blog/the-hidden-conversion-tax-how-arabic-keyboard-friction-costs-mena-apps-30-40-in-checkout-completion). Voice is the way out, if the voice layer handles the mix.
Why most ASR fails on code-switching
Four compounding problems break conventional recognizers:
1. Monolingual training and decoding. Standard ASR cannot [deal effectively with words from an unseen language](https://www.researchgate.net/publication/352993302_Arabic_Code-Switching_Speech_Recognition_Using_Monolingual_Data) mid-utterance. The language model expects one vocabulary; it gets two. 2. Data scarcity. Code-switched speech [remains scarce because it is hard to collect](https://arxiv.org/pdf/2506.22143), and Egyptian Arabic is itself under-resourced - a [low-resource, morphologically rich, orthographically unstandardized pair](https://arxiv.org/pdf/2108.12881). 3. No standard orthography. Dialectal Arabic has no fixed spelling, so even the *reference transcripts* disagree, inflating error rates artificially. 4. Spontaneity. Real switching is spontaneous and high-variance; switch points are unpredictable within a single phrase.
The numbers make it concrete. On the ArzEn Egyptian Arabic-English corpus - 12 hours from 38 speakers - a tuned ASR system reached a 30.6% WER and 18.7% CER. And critically, typical WER-improving tricks like in-domain fine-tuning do not close the code-switching gap even with models at Whisper's caliber. Throwing more monolingual data at the problem does not fix it.
How to evaluate code-switching speech (don't trust WER alone)
WER is necessary but misleading on mixed speech. The recent commercial ASR benchmark across Egyptian Arabic, Najdi/Hijazi Saudi Arabic, Persian, and German found that WER inflated quality gaps by roughly 3x by penalizing semantically correct transliteration choices - the model heard the word right but spelled it in a defensible alternative that WER scored as wrong.
Use a layered evaluation instead:
| Metric | What it measures | Why it matters for code-switching |
|---|---|---|
| WER / CER | Raw token / character error | Baseline, but over-penalizes spelling variance |
| Code-Mixing Index (CMI) | Degree of language mixing per utterance | Lets you stratify accuracy by how mixed the input is |
| BERTScore | Semantic similarity to reference | Forgives valid transliteration; agreed on system ranking |
| PIER | Errors split by intra- vs inter-word switching | Pinpoints where the model breaks |
A practical evaluation recipe:
1. Build a switch-heavy test set. The benchmark above selected samples via a [heuristic filter on structural code-switching signals plus a GPT-4o and Gemini ensemble scoring six linguistic dimensions](https://arxiv.org/abs/2605.19069). You do not need that exact pipeline, but you do need a set that is *actually* mixed. 2. Compute CMI per utterance so you can report accuracy at low, medium, and high mixing levels separately - the [ArzEn corpus itself sits around a 0.12-0.17 CMI](https://www.researchgate.net/publication/342715577_ArzEn_A_Speech_Corpus_for_Code-switched_Egyptian_Arabic-English), and your real traffic may be higher. 3. Pair WER with BERTScore to separate true errors from spelling noise. 4. Test the action, not just the transcript. In a voice-to-actions SDK, the question is whether "e3mel transfer 500 ginē" triggered the right transfer - not whether every character matched.
How to actually handle it in production
- Pick a recognizer trained on code-switched audio. Commercial systems vary enormously - the same benchmark saw a top performer at 13.2% overall WER while others trailed badly. Do not assume your default English model degrades gracefully; it does not.
- Augment with synthetic switches. Spliced-audio generation that stitches monolingual clips into synthetic code-switched utterances is a proven way to compensate for scarce real data.
- Front a transliteration / normalization layer so Arabizi text fallbacks and mixed transcripts collapse to a canonical form your intent parser understands.
- Treat intent resolution as bilingual by default. The downstream agent must accept an Arabic verb with an English object (and vice versa) without a language flag.
- Voice-in, voice-out, same language register. Reply in the user's mixed register rather than forcing them into one language.
This is the design philosophy behind a voice-to-actions layer: the recognizer and the action engine are co-designed for the mix. If you are building on mobile, our guide to adding Arabic voice control in React Native walks through the integration, and the broader case for this architecture is in voice-first: the next platform shift. For a full landscape of options, compare the best Arabic voice speech-to-text APIs for 2026.
Frequently asked questions
Is Arabic-English code-switching really that common, or an edge case?
It is the norm, not the edge. Sociolinguistic studies find it dominant across Egypt, the Gulf, and the Levant, driven by lexical gaps, technical vocabulary, and social signaling. A product that ignores it mis-handles the median utterance.
Why does my English ASR get so much worse on mixed speech?
Because it was trained and decoded as monolingual. Models show a 30-50% relative WER increase on code-switched input, and they cannot handle words from a language they were not trained to expect.
What is Arabizi and do I need to support it?
Arabizi is Arabic written in Latin letters and numerals (3 for ع, 7 for ح). If you have any text fallback, yes - it is what users type. A transliteration layer normalizes it; mature systems hit 90%+ word accuracy.
Can I just fine-tune Whisper on my own data to fix this?
Not reliably. Research shows in-domain fine-tuning does not close the code-switching gap even for Whisper-caliber models. You need code-switch-specific data, synthetic augmentation, and the right evaluation - not just more hours.
How should I measure code-switching ASR quality?
Don't rely on WER alone - it over-penalizes valid transliteration by ~3x. Pair it with BERTScore for semantics, stratify by Code-Mixing Index, and use PIER to localize switch-point errors.
Does code-switching support remove the need for the Arabic keyboard?
Largely, yes - that is the point. Arabic keyboard friction is a documented 30-40% conversion drain in checkout, and your users do not want to type. A bilingual voice layer lets them speak the way they already talk.
Build voice that speaks the way MENA actually speaks
Code-switching is not noise to filter out - it is the signal. Voqal is a voice-to-actions SDK designed for Arabic-English mixing from the ground up. See the docs to integrate, or join the waitlist to get early access.