Speech Recognition / Low-Resource ASR

Fine-tuning Nepali ASR on 157 hours for $7

I fine-tuned Qwen3-ASR-1.7B on a single public Nepali speech dataset, evaluated against 8 open-source models across 3 held-out benchmarks, and found a dtype bug that invalidated prior Whisper evaluations for Nepali.

Models benchmarked

Eval datasets

157h

Training data

~$7

Compute cost

The problem with single-dataset ASR benchmarks

Most Nepali ASR models report WER on a single dataset. That number can look good while hiding severe weaknesses on other speech styles. A model that scores 5% on one dataset but 70% on another is not a 5% WER model.

I evaluated every model on three datasets with different recording conditions, speaker counts, and speech styles. None of these datasets were used during training. Cross-dataset consistency is the metric that matters.

Macro-average WER across all 3 datasets

Lower is better. Excludes anomalous OpenSLR-43 scores from contaminated models.

Qwen3-ASR-Nepali (ours)

41.4%

wav2vec2-nepali (anish)

44.2%

Meta MMS-1B

45.5%

wav2vec2-xlsr-300m

45.5%

Whisper-small-Nepali

48.2%

wav2vec2-xlsr (gagan)

54.0%

Whisper large-v3

98.5%

Qwen3-ASR-0.6B Base

109.6%

#1 on spontaneous speech

55.8%vs 62.4% MMS

IndicVoices-R: 2060 speakers, natural conversational audio. The most realistic test of ASR quality. We beat MMS-1B by 6.6 WER points.

#1 on synthetic speech

31.4%vs 40.5% MMS

OpenSLR-43: TTS-generated speech. We beat MMS-1B by 9.1 WER points. Models with anomalously low scores here are excluded from this comparison.

#2 on clean read speech

37.0%vs 33.6% MMS

FLEURS: Studio-quality recordings. MMS-1B wins by 3.4 WER points. This is MMS's strongest domain given its massive multilingual pretraining.

Best macro-average WER

41.4%vs 44.2% next best

Averaging across all 3 datasets, our model achieves the lowest WER among all tested models. Cross-dataset consistency matters more than any single benchmark.

Cross-dataset evaluation

100 samples per dataset, WER % (lower is better)

Model	FLEURS?Clean read speech from Google. Studio-quality recordings with clear pronunciation.	IndicVoices-R?Spontaneous conversational speech from AI4Bharat. 2060 speakers, background noise, natural hesitations.	OpenSLR-43?TTS-generated synthetic speech. Models marked * show anomalously low WER suggesting training-set overlap.
Qwen3-ASR-Nepali (ours)	37.0%	55.8%	31.4%
Meta MMS-1B (npi)	33.6%	62.4%	40.5%
Whisper large-v3	94.0%	96.7%	105.8%
Whisper-small-Nepali (amitpant7)	64.5%	77.7%	2.3%*
wav2vec2-xlsr-300m (shniranjan)	43.3%	59.5%	33.9%*
wav2vec2-nepali (anish)	54.3%	73.7%	4.6%*
wav2vec2-xlsr (gagan)	70.8%	86.1%	5.0%*
Qwen3-ASR-0.6B Base	116.0%	112.5%	100.4%

*Models marked with * show anomalously low OpenSLR-43 WER despite high WER on FLEURS and IndicVoices-R, suggesting dataset-specific overfitting or training-set overlap. See contamination analysis below.

Evaluation datasets

Each dataset tests a different failure mode. A model that only works on clean read speech will fail on real conversational audio. A model that only works on synthetic speech may have memorized its training data.

FLEURS

Clean read speech

Source: Google | Speakers: Multiple

Studio-quality recordings with clear pronunciation. The easiest benchmark and the most commonly reported in ASR papers.

IndicVoices-R

Spontaneous conversational

Source: AI4Bharat (NeurIPS 2024) | Speakers: 2,060

Natural speech with hesitations, interruptions, background noise, and diverse accents. The hardest and most realistic benchmark.

OpenSLR-43

TTS-generated speech

Source: OpenSLR | Speakers: Synthetic

Machine-generated speech. Tests whether models generalize beyond human recordings. Several models show anomalously low WER here.

OpenSLR-43 contamination analysis

Three models show WER below 5% on OpenSLR-43 while scoring 54-86% on FLEURS and IndicVoices-R. A model cannot legitimately generalize at 2% on one speech dataset while failing at 65-78% on others of comparable difficulty.

OpenSLR-43 is TTS-generated speech, making it more likely these models were trained on the same synthetic data or near-identical distributions. The cross-dataset performance gap itself is the evidence. Including these scores in a generalization comparison would be more misleading than excluding them.

Model	FLEURS	IVR	SLR-43	Gap?Ratio between the model's average FLEURS/IVR WER and its OpenSLR-43 WER. Higher means more anomalous.
amitpant7 whisper-small	64.5%	77.7%	2.3%	28x
wav2vec2-nepali (anish)	54.3%	73.7%	4.6%	12x
wav2vec2-xlsr (gagan)	70.8%	86.1%	5.0%	14x

A 12-28x gap between OpenSLR-43 and other datasets indicates these scores reflect dataset memorization, not general Nepali ASR capability.

Finding: Whisper float16 dtype bug in Nepali evaluation

Prior benchmarks reported Whisper large-v3 scoring 100% WER on Nepali, suggesting the model could not produce Nepali output at all. This turned out to be a bug, not a real result.

The standard HuggingFace Whisper inference pattern loads models in float16 for GPU efficiency. But the WhisperProcessor returns float32 input features. This causes a dtype mismatch RuntimeError in the encoder's conv1d layer.

If the evaluation script catches exceptions silently (a common pattern: except: text = ""), every sample produces an empty prediction. Empty predictions against real references give exactly 100% WER and 100% CER.

The fix

Loading the model in float32 instead of float16 resolves the dtype mismatch. After fixing this, Whisper large-v3 produces Nepali output but still has high WER (94% on FLEURS) due to word boundary and Devanagari spelling issues.

# Before (broken)

model = WhisperForConditionalGeneration.from_pretrained(

model_id, torch_dtype=torch.float16)

# After (fixed)

model = WhisperForConditionalGeneration.from_pretrained(

model_id, torch_dtype=torch.float32)

Any Nepali ASR benchmark that reported Whisper results using float16 loading likely has invalid numbers for that model.

Data efficiency

Meta MMS-1B was pretrained on 500K+ hours across 1,100+ languages. This model was fine-tuned on 157 hours of a single language on one A100 GPU for approximately $7.

On spontaneous speech (IndicVoices-R), the focused single-language fine-tune outperformed the multilingual model by 6.6 WER points. Domain-matched training data and language-specific fine-tuning can compensate for orders of magnitude less compute.

The training data (OpenSLR-54) is entirely read speech. Despite never seeing spontaneous conversational audio during training, the model generalized well enough to beat MMS on IndicVoices-R. This suggests the Qwen3-ASR architecture handles domain shift better than expected.

Result

2 / 3

benchmarks where we beat Meta MMS-1B

IndicVoices-R: 55.8% vs 62.4% (ours wins by 6.6 pts)

OpenSLR-43: 31.4% vs 40.5% (ours wins by 9.1 pts)

FLEURS: 37.0% vs 33.6% (MMS wins by 3.4 pts)

Training data comparison

157h

Our training data

500K+h

MMS pretraining data

Training setup

The training data is OpenSLR-54: 157 hours of Nepali read speech with transcripts, approximately 37,000 utterances. A 95/5 train/validation split was used with seed 42. Common Voice Nepali was originally planned but was removed from HuggingFace by Mozilla in October 2025, so the final training used OpenSLR-54 only.

IndicVoices-R, OpenSLR-43, and FLEURS were used exclusively for evaluation. No samples from any evaluation dataset appeared in training.

Base modelQwen3-ASR-1.7B

Training dataOpenSLR-54 (157h Nepali read speech, ~37K utterances)

HardwareA100 80GB, single GPU

Best checkpointStep 2000

Batch size16 (effective 128 with gradient accumulation)

Learning rate2e-5, 3 epochs

Data formatJSONL: {"audio": "path.wav", "text": "language Nepali<asr_text>..."}

Val split5% holdout (2000 samples), seed 42

Limitations

Sample size: 100 samples per dataset is directionally valid but small. Full test-set evaluation would strengthen the results.

Text normalization: WER can vary depending on punctuation, numeral formatting, Unicode normalization, and whitespace handling. No cross-model normalization was applied.

Decoding config: All models used greedy decoding with default settings. Beam search or temperature tuning could change individual results.

Missing baselines: IndicConformer (AI4Bharat) errored during evaluation. Other Nepali-specific models may exist that were not tested.

Base model contamination: The Qwen3-ASR-1.7B base model was pretrained on large-scale data that may include some evaluation data. This applies equally to MMS and Whisper.

Error rate: 37-56% WER is not production-ready. This model needs more diverse training data, code-switching coverage, and noise robustness before real-world deployment.

Published model

The fine-tuned model is published on HuggingFace. The benchmark script, training code, and evaluation pipeline are open-source.

sidskarki/Qwen3-ASR-Nepali github.com/sidskarkii/nepali-asr

What would improve this

Training on all three datasets (OpenSLR-54 + OpenSLR-43 + IndicVoices-R = ~267 hours) instead of just one.

Full test-set evaluation instead of 100-sample subsets.

Text normalization pipeline standardized across all models.

Additional baselines: IndicConformer, faster-whisper, WhisperX.

Noise augmentation and code-switching data for robustness.