Speech Recognition / Low-Resource ASR

Fine-tuning Nepali ASR on 157 hours for $7

I fine-tuned Qwen3-ASR-1.7B on a single public Nepali speech dataset, evaluated against 8 open-source models across 3 held-out benchmarks, and found a dtype bug that invalidated prior Whisper evaluations for Nepali.

8
Models benchmarked
3
Eval datasets
157h
Training data
~$7
Compute cost

The problem with single-dataset ASR benchmarks

Most Nepali ASR models report WER on a single dataset. That number can look good while hiding severe weaknesses on other speech styles. A model that scores 5% on one dataset but 70% on another is not a 5% WER model.

I evaluated every model on three datasets with different recording conditions, speaker counts, and speech styles. None of these datasets were used during training. Cross-dataset consistency is the metric that matters.

Macro-average WER across all 3 datasets

Lower is better. Excludes anomalous OpenSLR-43 scores from contaminated models.

Qwen3-ASR-Nepali (ours)
41.4%
wav2vec2-nepali (anish)
44.2%
Meta MMS-1B
45.5%
wav2vec2-xlsr-300m
45.5%
Whisper-small-Nepali
48.2%
wav2vec2-xlsr (gagan)
54.0%
Whisper large-v3
98.5%
Qwen3-ASR-0.6B Base
109.6%
#1 on spontaneous speech
55.8%vs 62.4% MMS

IndicVoices-R: 2060 speakers, natural conversational audio. The most realistic test of ASR quality. We beat MMS-1B by 6.6 WER points.

#1 on synthetic speech
31.4%vs 40.5% MMS

OpenSLR-43: TTS-generated speech. We beat MMS-1B by 9.1 WER points. Models with anomalously low scores here are excluded from this comparison.

#2 on clean read speech
37.0%vs 33.6% MMS

FLEURS: Studio-quality recordings. MMS-1B wins by 3.4 WER points. This is MMS's strongest domain given its massive multilingual pretraining.

Best macro-average WER
41.4%vs 44.2% next best

Averaging across all 3 datasets, our model achieves the lowest WER among all tested models. Cross-dataset consistency matters more than any single benchmark.

Cross-dataset evaluation

100 samples per dataset, WER % (lower is better)

ModelFLEURS?IndicVoices-R?OpenSLR-43?
Qwen3-ASR-Nepali (ours)37.0%55.8%31.4%
Meta MMS-1B (npi)33.6%62.4%40.5%
Whisper large-v394.0%96.7%105.8%
Whisper-small-Nepali (amitpant7)64.5%77.7%2.3%*
wav2vec2-xlsr-300m (shniranjan)43.3%59.5%33.9%*
wav2vec2-nepali (anish)54.3%73.7%4.6%*
wav2vec2-xlsr (gagan)70.8%86.1%5.0%*
Qwen3-ASR-0.6B Base116.0%112.5%100.4%

*Models marked with * show anomalously low OpenSLR-43 WER despite high WER on FLEURS and IndicVoices-R, suggesting dataset-specific overfitting or training-set overlap. See contamination analysis below.

Evaluation datasets

Each dataset tests a different failure mode. A model that only works on clean read speech will fail on real conversational audio. A model that only works on synthetic speech may have memorized its training data.

FLEURS

Clean read speech
Source: Google | Speakers: Multiple

Studio-quality recordings with clear pronunciation. The easiest benchmark and the most commonly reported in ASR papers.

IndicVoices-R

Spontaneous conversational
Source: AI4Bharat (NeurIPS 2024) | Speakers: 2,060

Natural speech with hesitations, interruptions, background noise, and diverse accents. The hardest and most realistic benchmark.

OpenSLR-43

TTS-generated speech
Source: OpenSLR | Speakers: Synthetic

Machine-generated speech. Tests whether models generalize beyond human recordings. Several models show anomalously low WER here.

OpenSLR-43 contamination analysis

Three models show WER below 5% on OpenSLR-43 while scoring 54-86% on FLEURS and IndicVoices-R. A model cannot legitimately generalize at 2% on one speech dataset while failing at 65-78% on others of comparable difficulty.

OpenSLR-43 is TTS-generated speech, making it more likely these models were trained on the same synthetic data or near-identical distributions. The cross-dataset performance gap itself is the evidence. Including these scores in a generalization comparison would be more misleading than excluding them.

ModelFLEURSIVRSLR-43Gap?
amitpant7 whisper-small64.5%77.7%2.3%28x
wav2vec2-nepali (anish)54.3%73.7%4.6%12x
wav2vec2-xlsr (gagan)70.8%86.1%5.0%14x

A 12-28x gap between OpenSLR-43 and other datasets indicates these scores reflect dataset memorization, not general Nepali ASR capability.

Finding: Whisper float16 dtype bug in Nepali evaluation

Prior benchmarks reported Whisper large-v3 scoring 100% WER on Nepali, suggesting the model could not produce Nepali output at all. This turned out to be a bug, not a real result.

The standard HuggingFace Whisper inference pattern loads models in float16 for GPU efficiency. But the WhisperProcessor returns float32 input features. This causes a dtype mismatch RuntimeError in the encoder's conv1d layer.

If the evaluation script catches exceptions silently (a common pattern: except: text = ""), every sample produces an empty prediction. Empty predictions against real references give exactly 100% WER and 100% CER.

The fix

Loading the model in float32 instead of float16 resolves the dtype mismatch. After fixing this, Whisper large-v3 produces Nepali output but still has high WER (94% on FLEURS) due to word boundary and Devanagari spelling issues.

# Before (broken)
model = WhisperForConditionalGeneration.from_pretrained(
model_id, torch_dtype=torch.float16)
# After (fixed)
model = WhisperForConditionalGeneration.from_pretrained(
model_id, torch_dtype=torch.float32)

Any Nepali ASR benchmark that reported Whisper results using float16 loading likely has invalid numbers for that model.

Data efficiency

Meta MMS-1B was pretrained on 500K+ hours across 1,100+ languages. This model was fine-tuned on 157 hours of a single language on one A100 GPU for approximately $7.

On spontaneous speech (IndicVoices-R), the focused single-language fine-tune outperformed the multilingual model by 6.6 WER points. Domain-matched training data and language-specific fine-tuning can compensate for orders of magnitude less compute.

The training data (OpenSLR-54) is entirely read speech. Despite never seeing spontaneous conversational audio during training, the model generalized well enough to beat MMS on IndicVoices-R. This suggests the Qwen3-ASR architecture handles domain shift better than expected.

Result

2 / 3
benchmarks where we beat Meta MMS-1B

IndicVoices-R: 55.8% vs 62.4% (ours wins by 6.6 pts)

OpenSLR-43: 31.4% vs 40.5% (ours wins by 9.1 pts)

FLEURS: 37.0% vs 33.6% (MMS wins by 3.4 pts)

Training data comparison
157h
Our training data
500K+h
MMS pretraining data

Training setup

The training data is OpenSLR-54: 157 hours of Nepali read speech with transcripts, approximately 37,000 utterances. A 95/5 train/validation split was used with seed 42. Common Voice Nepali was originally planned but was removed from HuggingFace by Mozilla in October 2025, so the final training used OpenSLR-54 only.

IndicVoices-R, OpenSLR-43, and FLEURS were used exclusively for evaluation. No samples from any evaluation dataset appeared in training.

Base modelQwen3-ASR-1.7B
Training dataOpenSLR-54 (157h Nepali read speech, ~37K utterances)
HardwareA100 80GB, single GPU
Best checkpointStep 2000
Batch size16 (effective 128 with gradient accumulation)
Learning rate2e-5, 3 epochs
Data formatJSONL: {"audio": "path.wav", "text": "language Nepali<asr_text>..."}
Val split5% holdout (2000 samples), seed 42

Limitations

Sample size: 100 samples per dataset is directionally valid but small. Full test-set evaluation would strengthen the results.

Text normalization: WER can vary depending on punctuation, numeral formatting, Unicode normalization, and whitespace handling. No cross-model normalization was applied.

Decoding config: All models used greedy decoding with default settings. Beam search or temperature tuning could change individual results.

Missing baselines: IndicConformer (AI4Bharat) errored during evaluation. Other Nepali-specific models may exist that were not tested.

Base model contamination: The Qwen3-ASR-1.7B base model was pretrained on large-scale data that may include some evaluation data. This applies equally to MMS and Whisper.

Error rate: 37-56% WER is not production-ready. This model needs more diverse training data, code-switching coverage, and noise robustness before real-world deployment.

Published model

The fine-tuned model is published on HuggingFace. The benchmark script, training code, and evaluation pipeline are open-source.

What would improve this

Training on all three datasets (OpenSLR-54 + OpenSLR-43 + IndicVoices-R = ~267 hours) instead of just one.

Full test-set evaluation instead of 100-sample subsets.

Text normalization pipeline standardized across all models.

Additional baselines: IndicConformer, faster-whisper, WhisperX.

Noise augmentation and code-switching data for robustness.