Evaluating ASR Systems, Part 2: Accuracy and Robustness

Understanding what accuracy really means for your use case

Author: Ognjen Todic | February 26, 2026

In Part 1, we introduced the "Advertisement Only" problem: claims like "best accuracy" or "trained on diverse data" that sound impressive but lack the transparency to mean anything.

Here's the thing: what's in a training dataset shouldn't matter to you. What matters is how the system performs on your data, in your conditions, for your use case. Understanding how accuracy is measured helps you ask the right questions, interpret vendor claims, and validate that a system actually works for your scenario.

Audio waveform transforming into a target

Now let’s dig into accuracy and robustness: what they mean, how to measure them, and why a single number like “98% accuracy” rarely tells the full story.

The Building Blocks of Evaluation

Before diving into metrics, let’s establish some terminology:

Term	Description
Audio File	A recording of spoken input, ideally captured under real-world conditions (device microphones, background noise, natural speaking style).
Transcript	A human-produced text version of the speech. Serves as ground truth. May include annotations for non-speech events.
Dataset	A collection of audio files paired with transcripts, possibly with additional metadata.
Metadata	Optional but valuable: speaker ID, age, gender, accent, native language, recording conditions. Enables analysis across many dimensions.

The quality and representativeness of your dataset determines whether your accuracy measurements mean anything at all. We’ll return to this point.

Word Error Rate: The Standard Metric

In ASR research, the standard metric for accuracy is Word Error Rate (WER). It quantifies how the ASR output differs from the reference transcript:

WER = Substitutions + Insertions + Deletions Number of Reference Words

Let’s break this down with an example:

Reference	*I LIKE TO EAT APPLES*	5 words
ASR Output	*I LIKE EAT APPLE*
Errors	1 deletion (TO), 1 substitution (*APPLES* → *APPLE*)
WER	2 / 5 = 40%

A 40% WER means that nearly half the words are wrong or missing. That sounds terrible for dictation, but might be perfectly acceptable if your application only needs to recognize the word EAT.

This brings us to a critical point.

The Problem with Single-Number Accuracy

When someone says their ASR system has “95% accuracy,” ask yourself: On what data? For what task? Averaged how? With what system configuration? Clean studio recordings or noisy real-world audio? Children or adults? Full transcription or command recognition? Open vocabulary or constrained to expected phrases?

A system that achieves 95% WER on clean adult speech might drop to 70% on children’s voices or in noisy environments. The single number hides this variability.

Even more fundamentally: WER might not be the right metric for your use case at all.

Functional Accuracy: The Metric That Actually Matters

Your downstream application defines what accuracy means:

Dictation or transcription: Exact word-for-word output matters. WER is appropriate.
Voice commands: Recognizing the command keyword matters more than filler words. PLAY THE NEXT SONG and PLAY NEXT SONG are functionally identical.
Form filling: Extracting specific slot values (names, numbers, dates) is what counts.
Educational applications: You might need pronunciation scores, timing information, or detection of specific reading errors.

We call this Functional Accuracy: how well the ASR performs the actual task your product depends on.

Consider a warehouse voice-picking application where workers say things like PICK FIVE ITEMS FROM BIN A-23. What matters is correctly recognizing the quantity (FIVE) and the bin location (A-23).

If the ASR outputs PICK FIVE FROM BIN A-23 (deleting ITEMS), the WER increases, but the functional accuracy is 100%: the system correctly extracted the quantity and location.

When defining your evaluation criteria, start with your application’s requirements and work backward to the appropriate metric.

Robustness: Accuracy Under Real-World Variability

Accuracy tells you how well the system performs under specific conditions. Robustness tells you how that accuracy holds up when conditions change.

Key dimensions of variability:

Speaker Variability

Age: Children’s speech patterns differ significantly from adults’. Pitch, pronunciation, and vocabulary all vary.
Accents and dialects: Regional and non-native accents can dramatically affect recognition.
Speaking style: Read speech vs. spontaneous speech. Hesitations, false starts, and filler words.

Acoustic Variability

Background noise: Factory floors, busy stores, outdoor environments.
Microphone type and quality: Built-in device mics vs. headsets vs. professional microphones.
Distance and positioning: Close-talk vs. far-field recognition.
Reverberation: Room acoustics can blur speech signals.

Content Variability

Domain vocabulary: Medical terms, product codes, specialized jargon.
Speaking rate: Fast talkers vs. slow, deliberate speech.
Utterance length: Single words vs. full sentences vs. extended passages.

A robust system maintains acceptable accuracy across these dimensions. An accurate-but-fragile system might excel in controlled conditions but fail when any variable shifts.

Dataset Authenticity: Garbage In, Garbage Out

Accuracy metrics are only meaningful when your evaluation data represents real-world conditions.

An ASR system that performs well on clean, adult, native-speaker recordings might fail dramatically on children’s voices, non-native speakers, noisy environments, or low-quality microphones.

This is why dataset authenticity matters. Test data should reflect the actual speakers who will use the product, the actual acoustic conditions they’ll encounter, and the actual vocabulary and speaking patterns they’ll use.

Here’s an example from one of our domains. For oral reading applications, when ASR is configured to recognize only the expected text, recognition is straightforward as long as the reader reads correctly. But here’s the catch: users who need reading instruction or assessment apps are precisely those who don’t read correctly. They hesitate, mispronounce, skip words, or substitute. An evaluation dataset of fluent readers would show excellent accuracy, while real-world performance with struggling readers could be dramatically worse. The benchmark would be “Advertisement Only.”

If you’re collecting your own evaluation dataset, it doesn’t need to be perfect from day one. What matters is understanding what your current data covers and what it doesn’t, and iterating from there.

Beyond the Top-Line Number: Error Analysis

A single WER number tells you how much is wrong, but not what is wrong or why.

Useful breakdowns include:

By error type: Substitutions often indicate acoustic confusion (similar-sounding words). Deletions may suggest the model is missing quiet or fast speech. Insertions might point to noise being interpreted as speech.
By system behavior: Is recognition being cut off prematurely? This often points to VAD or endpointing misconfiguration rather than a recognition problem.
By speaker segment: Does accuracy differ across age groups? Accents? Genders? Are certain speakers consistently problematic?
By content: Do certain words or phrases cause repeated errors? Are domain-specific terms being recognized correctly?
By acoustic condition: How does accuracy change with noise level? Does microphone type affect results?

This analysis turns a single number into actionable insights. You might discover that your system struggles with young children, or with a particular accent, or with certain number ranges. Each finding points toward a specific improvement path. You might also identify edge cases where recognition is unreliable but detectable at runtime; for example, if quiet speech correlates with higher error rates, you can use audio quality metrics from the ASR result to prompt the user to speak louder.

A Note on Reference Transcripts

From our own experience: human transcripts are not always accurate. Transcribers mishear words, make typos, or apply inconsistent conventions. When investigating ASR errors, we regularly find cases where the ASR output was actually correct and the reference transcript was wrong.

This matters for two reasons. First, your error measurements might include “errors” that aren’t real errors. Second, and more importantly, if you’re using transcripts to identify systematic problems, verify a sample of flagged errors before drawing conclusions. What looks like an ASR failure might be a transcription mistake.

Iterating Toward Optimal Performance

At Keen Research, we use this same error analysis process internally to improve our acoustic and language models. We analyze recognition results across diverse datasets, identify systematic patterns, and refine model training accordingly.

If you have an evaluation dataset, you can follow a similar approach to optimize SDK configuration for your use case. The KeenASR SDK offers several options that can significantly impact accuracy:

Decoding graph strategies: Constrained vocabularies, phrase lists, or more open grammars depending on your use case
VAD and endpointing thresholds: Tuning when the recognizer starts and stops listening
Spoken noise probability: Tuning how the recognizer handles out-of-vocabulary speech and background noise
Alternative pronunciations: Adding pronunciation variants for domain-specific terms, regional accents, or common mispronunciations

The cycle looks like this: run your evaluation dataset, analyze the errors, identify whether the issue is acoustic (speaker/environment), linguistic (vocabulary/pronunciation), or configuration-related (VAD thresholds, endpointing), then adjust accordingly and re-test. If needed, refine your evaluation dataset as well, adding samples that better represent edge cases or real-world conditions. And as noted earlier, when you find transcript errors during analysis, fix them; your accuracy measurements become more meaningful with each correction.

Each iteration deepens your understanding of where the system works well and where it needs tuning.

Practical Takeaways

WER is a starting point, not the destination. Understand what it measures and what it misses.
Define functional accuracy for your use case. What does “correct” mean for your application?
Test on representative data. Clean benchmarks are useful baselines, but real-world data reveals real-world performance.
Measure robustness, not just accuracy. How does performance change across speakers, environments, and content?
Analyze errors, not just error rates. Breakdowns by type, speaker, and condition turn numbers into insights.
Iterate. Evaluation is an iterative process, not a one-time event.

Coming Up

In future posts, we’ll explore other dimensions of ASR evaluation, latency, efficiency, and how to set up a practical evaluation workflow.

If you’re evaluating ASR solutions for your project, we’re happy to help you think through what matters most for your use case. Get in touch.