Choosing ASR for Literacy Apps: What Whisper and Other General-Purpose Engines Miss

Why general-purpose speech recognition falls short for reading assessment and instruction

Author: Ognjen Todic | April 24, 2026

Reading proficiency in the United States is in crisis. The 2022 National Assessment of Educational Progress (NAEP) results showed the largest decline in reading scores in over three decades. Only about one-third of fourth graders read at a proficient level, and the trends are moving in the wrong direction.

Teachers and parents play the most important role in helping children learn to read. But the reality is that providing individualized reading practice and assessment at scale is nearly impossible with human resources alone. A teacher with 25 students simply cannot listen to each child read aloud for 15 minutes a day and provide detailed feedback on their progress.

This is where technology can help. Apps that listen to children read, track their progress, and provide feedback can extend the reach of teachers and give every child more practice time. But here’s the catch: these apps need speech recognition that actually understands how children read, not just what they say.

Many development teams reach for familiar tools when building literacy features. OpenAI’s Whisper, Google Cloud Speech-to-Text, and AWS Transcribe are powerful, well-documented, and easy to integrate. But these engines are designed for a fundamentally different purpose, and that mismatch creates real problems for reading and literacy applications.

What general-purpose ASR is designed to do

General-purpose automatic speech recognition engines are built to produce accurate transcriptions of spoken language. They excel at converting audio into text, handling diverse accents, filtering out background noise, and producing clean, readable output.

These engines are trained on massive datasets of adult speech. They use sophisticated language models to correct errors and produce fluent text. When someone says “I’m gonna go to the store,” a general-purpose ASR might output “I’m going to go to the store” because that’s the more standard form.

This behavior is exactly what you want for dictation software, voice assistants, meeting transcription, and accessibility features. The goal is to capture the speaker’s intent and produce useful text.

But literacy apps have a completely different goal.

What reading and literacy apps actually need

When a child reads aloud from a book, the app needs to understand much more than just the words that were spoken. It needs to know:

How the words were read, not just what was said. If a child mispronounces a word, that’s critical information. A general-purpose ASR will often “correct” the mispronunciation to produce clean text. But for a reading app, the mispronunciation is the whole point. You need to capture it, flag it, and potentially provide feedback.

Which word is being read and when. Reading fluency depends on pace, timing, and smoothness. An app that measures fluency needs word-level timestamps, not just a final transcript. It needs to know that the child paused for three seconds after “the,” repeated “was” twice, and sped through the last sentence.

How the reading compares to the expected text. The app knows what the child is supposed to read. It needs to align the spoken audio to that expected text and identify insertions, deletions, and substitutions. This is fundamentally different from open-ended transcription.

The nuances of children’s voices. Children’s speech differs from adult speech in pitch, pronunciation patterns, and vocabulary. Models trained primarily on adult speech often struggle with younger voices, especially when those voices are still developing reading skills.

Privacy and offline operation. Apps serving children must comply with COPPA and similar regulations. Sending children’s voice recordings to cloud servers creates compliance complexity. And classrooms often have unreliable internet, making cloud-dependent solutions impractical.

Where general-purpose ASR struggles

Let’s look at specific examples that illustrate the mismatch.

Example 1: Mispronunciation

The text says: “The water was rough.”

The child reads: “The water was ruff.”

This is a common struggle. The “-ough” pattern in English can be pronounced multiple ways (rough, through, though, cough), and early readers often guess wrong.

A general-purpose ASR, using its language model, will likely output “rough” because that word makes more sense in context. The mispronunciation disappears, and the app has no way to provide corrective feedback.

A literacy-focused ASR preserves the mispronunciation, aligns it to the expected word, and flags the error. The app can now show the child (or teacher) exactly what happened.

Example 2: Substitution

The text says: “She went home after school.”

The child reads: “She went house after school.”

The child substituted “house” for “home.” Both are valid English words, so a general-purpose ASR will accurately transcribe “house.” But without knowledge of the expected text, it has no way to flag this as an error.

A literacy-focused ASR compares the spoken words to the expected text and identifies the substitution. This is essential for reading assessment, where tracking error patterns helps diagnose reading difficulties.

Example 3: Hesitation and self-correction

The child reads: “The boy was… was walking to the park.”

The hesitation and repetition tell us something important. The child struggled with this sentence, paused, and then continued. This affects fluency scores and may indicate difficulty with certain words or sentence structures.

A general-purpose ASR typically cleans this up, producing “The boy was walking to the park.” The struggle disappears from the transcript.

A literacy-focused ASR captures the hesitation, logs the timing, and preserves the repetition. This data feeds into fluency analysis and helps identify where readers need support.

Example 4: Fluency and pacing

Reading fluency is measured in words correct per minute (WCPM), but that metric requires knowing exactly when each word was spoken. It also requires distinguishing between words read correctly and words read incorrectly.

General-purpose ASR typically provides a final transcript, sometimes with utterance-level timestamps. That’s not granular enough for fluency measurement.

Literacy-focused ASR provides word-by-word timestamps, confidence scores, and alignment to expected text. This enables precise WCPM calculation, pause detection, and prosody analysis.

Key capabilities to look for

If you’re building a reading or literacy app, here’s what to look for in a speech recognition solution:

Capability	Why it matters
Mispronunciation detection	Core to reading assessment and feedback
Alignment to expected text	Know which word should have been read
Word-level timestamps	Fluency tracking, read-along sync
Confidence scores per word	Identify uncertain readings
Child voice optimization	Accuracy with young speakers
On-device processing	Privacy, COPPA compliance, offline use
Custom vocabulary	Book-specific words, phonics patterns

When general-purpose ASR is fine

To be clear, general-purpose ASR has its place in educational apps. It works well for:

Voice commands and navigation (“Go to page five”)
Simple yes/no or keyword detection
Adult-facing features like teacher dictation
Voice search in content libraries

The key distinction is whether you need to understand how something was read or just what was said. For the former, you need a purpose-built solution.

Making the right choice

General-purpose ASR engines like Whisper represent remarkable engineering achievements. They’re trained on hundreds of thousands of hours of audio and can transcribe speech with impressive accuracy across many languages and accents.

But they’re solving a different problem. They’re optimized to produce clean, accurate transcriptions of what someone said. Literacy apps need to capture exactly how a child read, including the mistakes, hesitations, and struggles that reveal where they need help.

If your app involves reading assessment, pronunciation feedback, or fluency tracking, look for ASR that’s purpose-built for these use cases. The difference in outcomes is substantial.

At Keen Research, this is exactly what we’ve built. KeenASR SDK is designed from the ground up for reading and literacy applications. It captures mispronunciations, provides word-level timing and alignment, works with children’s voices, and runs entirely on-device for privacy and reliability.

If you’re building in this space, we’d love to talk. You can explore our EdTech solutions, review the developer documentation, or schedule a call to discuss your specific needs.

This post is part of our series on speech recognition for education. See also: Evaluating ASR Systems, Part 1: The Big Picture.