KeenASR SDK provides ASR Bundles with acoustic models specifically optimized for children’s voices (currently available for English). Combined with SDK features designed for educational use cases, this makes KeenASR well-suited for apps targeting young users.
Common use cases include:
- Oral reading: both reading instruction (real-time feedback as children read aloud) and assessment (fluency scoring, accuracy tracking)
- Speech and language therapy: articulation practice, phoneme-level feedback, progress tracking
- Language learning: pronunciation practice with GoP scoring
- Interactive stories and games: voice-driven navigation and character interaction
For young children who cannot yet read or write, voice is often the only practical input method. Traditional UI elements like text buttons or typed responses are not viable for this audience, making speech recognition essential for meaningful interaction.
Oral Reading Instruction
Oral reading instruction apps guide children through reading words, phrases, or passages, providing real-time feedback and tracking progress over time.
Setup
The recognizer will automatically stop listening based on VAD (Voice Activity Detection) thresholds and deliver the final response via the onFinalResponse callback. See VAD Thresholds for details on configuring when the recognizer stops.
When creating decoding graphs for oral reading, use the OralReadingTask option to optimize recognition for this use case:
recognizer.createDecodingGraph(phrases, "page1", task=OralReadingTask)
This option:
- Optimizes the decoding graph structure for oral reading scenarios
- Explicitly models common mistakes made when learning to read
- Models incomplete words (false starts, hesitations)
See Decoding Graphs below for more details on building decoding graphs.
Incomplete words: When a child starts a word but doesn’t finish it (e.g., saying “STRAWBE…” instead of “STRAWBERRY”), the result will include the word with a #INC suffix indicating an incomplete pronunciation:
{
"words": [
{ "text": "STRAWBERRY#INC", "confidence": 0.85 }
]
}
Enabling GoP: To compute GoP scores, pass true as the second parameter when preparing the recognizer:
recognizer.prepareForListeningWithDecodingGraph("page1", true)
Partial Results
Partial results can be used for real-time highlighting, dynamic VAD threshold adjustments, detecting user struggles, and more.
Partial results provide low-latency feedback but do not include GoP scores. For most reading instruction scenarios, it’s better not to interrupt the child’s reading flow with pronunciation feedback. Instead, let them complete the passage and analyze GoP scores from the final result afterward.
Real-time highlighting: Use partial results to highlight words as the child reads, showing either what has been read or what word comes next. For early readers, highlighting serves as a helpful guide to keep them on track. For more advanced readers, this may be more of a distraction; consider making it optional or adjusting based on reading level.
onPartialResult(result, recognizer) {
// alignTexts: compute word-level alignment between recognized and expected text
// Typically implemented using dynamic programming
alignment = alignTexts(result.text, expectedResponse)
highlightWordsRead(alignment, expectedResponse)
}
Dynamic VAD adjustment: Analyze partial results to determine if the child has reached the end of the expected text. If so, you can reduce the endSilence threshold to stop listening sooner rather than waiting for the full timeout. See VAD Thresholds for more details.
Detecting struggles: Implement logic to detect when a child might be struggling. Indicators include:
- Long delays between partial results (e.g., no new words recognized for several seconds)
- Incomplete words (words with
#INCsuffix indicating false starts) <SPOKEN_NOISE>tokens indicating out-of-vocabulary words
When struggles are detected, you could stop the recognizer, provide an intervention (such as a hint or pronouncing the next word), and then restart the recognizer to continue.
Pronunciation Feedback Strategy
We recommend using GoP scores and defining your own logic for determining if a word is pronounced correctly and/or tracking mispronunciations for adaptive content and task selection. Note that word.confidence is not a reliable measure of mispronunciation; it may return high scores even for mispronounced words. See Goodness of Pronunciation (GoP) Scoring for details on using GoP scores effectively.
If modeling mispronunciations explicitly: When using custom WordPronunciation entries to model common mispronunciations, factor those results into your pronunciation feedback logic. If a mispronunciation variant was recognized, you know the child used that incorrect pronunciation regardless of GoP scores. See Alternative Word Pronunciations below for implementation details.
Adaptive Task Difficulty
Match reading content to the child’s ability level. The SDK can help estimate and monitor reading ability:
- Initial assessment: Have the child read a calibration passage and analyze accuracy, fluency (words per minute), and pronunciation scores
- Continuous monitoring: Track these metrics across sessions to detect improvement or struggles
- Content and task selection: Use the ability estimate to select appropriately challenging stories and relevant intervention tasks
// Simple fluency tracking
onFinalResponse(response) {
// all of this could also be done asynchronously if response/result is persisted
// alignTexts: compute word-level alignment between recognized and expected text
// Typically implemented using dynamic programming
alignment = alignTexts(response.result.text, expectedResponse)
wordsReadCorrect = computeReadingAccuracy(alignment, expectedResponse)
accuracy = wordsReadCorrect / expectedResponse.split(" ").len
// Calculate duration based on actual speech (excluding leading/trailing silence)
firstWord = response.result.words.firstElement
lastWord = response.result.words.lastElement
duration = (lastWord.startTime + lastWord.duration) - firstWord.startTime
wordsCorrectPerMinute = (wordsReadCorrect / duration) * 60
updateAbilityEstimate(accuracy, wordsCorrectPerMinute)
// or choose different activity based on more detailed analysis of the profile
// e.g. practice words with specific sounds or do listening activity for specific sounds
selectNextStory(currentAbilityLevel)
}
Language Learning
Language learning apps use GoP scoring to provide pronunciation feedback. See the GoP Scoring section below for implementation details.
The easiest tasks to implement are constrained interactions where the expected response can be defined in advance (e.g., word or phrase repetition, reading exercises). For single-word responses, VAD thresholds can be much shorter than for longer passages. See VAD Thresholds for details.
Interactive Stories and Games
Voice-driven navigation and character interaction enable young children to engage with apps independently. Build decoding graphs containing the expected voice commands, answers, or spoken choices. See Decoding Graphs below.
As with language learning, constrained tasks where the expected responses can be defined (e.g., specific commands, character names, yes/no answers) are the easiest to implement. For short commands, VAD thresholds can be set much shorter for more responsive interaction. See VAD Thresholds for details.
Technical Implementation
VAD Thresholds
Voice Activity Detection (VAD) thresholds control when the recognizer considers speech to have ended. When these thresholds are reached, the recognizer automatically stops listening and calls the onFinalResponse callback. Optimal values for the endSilence parameter are typically driven by the length of the expected response. For longer responses, VAD threshold settings need adjustment because:
- Children read more slowly than adults speak
- There are natural pauses between words and phrases
- Struggling readers may pause mid-word or hesitate before difficult words
If the child is asked to read or repeat a single word, the end silence parameter can be much shorter (e.g., 0.5 or 0.8 seconds). Note that you can also adjust VAD parameters dynamically (for example, in the onPartialResult callback). If expecting a single word, you can start with a longer timeout (e.g., 1.5 seconds) for endSilence, and reduce it to 0.5 seconds once the expected response is recognized. Likewise, for longer text, the initial value can be larger; you can then reduce it in the onPartialResult callback once the child reaches the end of the text (the exact logic depends on your implementation).
Key parameters:
| Parameter | Description | Recommendation for Longer Responses |
|---|---|---|
endSilence |
Silence duration (seconds) before stopping | Increase to 3-5 seconds |
noSpeech |
Maximum initial silence | 5 seconds (but use-case specific) |
maxDuration |
Maximum listening duration | Keep around 30 seconds (see below) |
recognizer.setVADParameters(VADParameter.endSilence, 5.0) // Allow longer pauses
recognizer.setVADParameters(VADParameter.noSpeech, 5.0) // Allow longer pauses
recognizer.setVADParameters(VADParameter.maxDuration, 30) // Balance between passage length and processing time
maxDuration tradeoff: While longer maxDuration values allow children to read longer passages, computing the final response (including GoP scores) takes additional processing time. We recommend keeping maxDuration around 30 seconds. If more speech is expected (e.g., a long passage), restart listening in the onFinalResponse callback after processing the result.
onFinalResponse(response) {
processResult(response.result)
if (moreTextExpected) {
recognizer.startListening() // Continue with next segment
}
}
See Continuous Listening for more details on restarting the recognizer. Note that VAD gating does not need to be enabled for this use case, since the child is expected to continue reading immediately.
Stopping the Recognizer
We recommend using VAD thresholds to stop the recognizer gracefully. You can set endSilence to a very low value, or set maxDuration to 0, at any point (even while recognizer is listening). Note that the latter will stop the recognizer as soon as the current audio buffer is processed, which may result in the last word being cut off if it happens while the user is still talking. This can affect the final word’s recognition accuracy and GoP scores.
Alternatively, calling recognizer.stopListening() will stop the recognizer almost immediately, but you will not receive a final response for that interaction. Use this method when you need to abort recognition entirely, for example when the user navigates away from the page.
Decoding Graphs
Decoding graphs should be built from the expected text (words, phrases, or commands). This constrains recognition to the target vocabulary and improves accuracy significantly compared to open-vocabulary recognition. If you have control over the content, avoid using homophones or acoustically similar words in the same graph/context; children are likely to mispronounce words and it is difficult to distinguish between acoustically close words.
Building a Decoding Graph
For oral reading, create a decoding graph containing the phrases or passages the child is expected to read:
// Build a decoding graph for a reading passage
phrases = [
"The cat sat on the mat.",
"She saw a big red ball."
]
recognizer.createDecodingGraph(phrases, "page1", task=OralReadingTask)
For most EdTech use cases, one graph per interaction (e.g. page presented to the user, in case of oral-reading) provides the best accuracy.
| Approach | Pros | Cons |
|---|---|---|
| One graph per page | Higher accuracy (constrained vocabulary) | Longer time to create, more graphs to manage |
| One large graph for entire book | Simpler management, single graph load | Lower accuracy (larger search space), may misrecognize acoustically similar words from wrong page |
You can either build all decoding graphs ahead of time (e.g. when user opens a book in a reading app), or build them right before the relevant task starts.
| Approach | Pros | Cons |
|---|---|---|
| Prebuild decoding graphs | Faster page transitions (graphs ready to use) | Initial delay when creating all the graphs |
| Build as needed | Lower initial delay | Brief delay when switching pages |
Since decoding graphs are persisted in the file system, you only need to create a specific graph once (assuming input parameters have not changed). The SDK provides methods to check for graph existence:
// Check existence of the decoding graph
if (DecodingGraph.graphWithNameExists("page1")) {
// No need to recreate graph
} else {
// Create graph
}
Using Contextual Graphs
Multiple decoding graphs can be merged into a single contextual graph, where each context represents a distinct interaction. This is useful when you have many related graphs and want to manage them as a single unit. For example, in a reading app, each page of a book can be a separate context within one contextual graph:
// Build contextual decoding graph for a story
contextualPhrases = [
[ // page/context 1 (index 0)
"The cat sat on the mat.",
"She saw a big red ball."
],
[ // page/context 2 (index 1)
"The dog ran fast",
"He found a bone"
]
]
recognizer.createContextualDecodingGraph(contextualPhrases, "story1", task=OralReadingTask)
currentPage = 0
recognizer.prepareForListeningWithContextualDecodingGraph("story1", currentPage, /* compute GoP */ true)
// ...
onFinalResponse(response, result) {
// Check if the current page is completed
alignment = alignTexts(result.text, expectedText[currentPage])
if (isPageComplete(alignment, expectedText)) {
currentPage = currentPage + 1
if (currentPage < contextualPhrases.length) {
// Switch context to the next page
recognizer.prepareForListeningWithContextualDecodingGraph("story1", currentPage, /* compute GoP */ true)
} else {
// Story complete
}
}
}
| Approach | Pros | Cons |
|---|---|---|
| Separate graphs | No runtime penalty | Slower to create (e.g., creating 10 graphs takes much longer than 1 contextual graph with 10 contexts) |
| Contextual graph | Faster to create, easier to manage | Small runtime penalty when switching contexts (few hundred ms at most) |
Goodness of Pronunciation (GoP) Scoring
GoP scores provide per-phoneme pronunciation quality metrics, useful for pronunciation feedback in language learning and reading apps.
GoP scores are computed in reference to the recognized word and its corresponding phoneme sequence. When a word has multiple pronunciations (either in the lexicon or added explicitly via Alternative Word Pronunciations), the recognizer chooses the most acoustically likely pronunciation and computes GoP scores for that sequence.
Enabling GoP
GoP scoring is optional and is enabled via the prepareForListening methods:
// For regular decoding graphs
recognizer.prepareForListeningWithDecodingGraph("page1", /* compute GoP */ true)
// For contextual decoding graphs
recognizer.prepareForListeningWithContextualDecodingGraph("story1", /* context */ 0, /* compute GoP */ true)
Using GoP Scores
After recognition, access GoP scores from the result:
onFinalResponse(response) {
for (word in response.result.words) {
for (phone in word.phones) {
if (phone.pronunciationScore < 0.5) {
highlightMispronunciation(word, phone)
}
}
}
}
GoP scores range from 0 to 1, where higher values indicate closer match to expected pronunciation. Typical thresholds:
- > 0.7: Good pronunciation
- 0.4-0.7: Acceptable, may need practice
- < 0.4: Likely mispronounced
GoP Score Reliability
GoP scores are inherently noisy because phonemes are very short (some as brief as 30-40ms), providing limited acoustic information for scoring. A single low GoP score does not reliably indicate a pronunciation problem.
Best practice: Use a sliding window average over the last N occurrences of a specific phoneme (e.g., last 10 instances of the “CH” phoneme) rather than acting on individual scores or averaging across all phonemes. This smooths out noise and provides a more reliable signal for feedback on each phoneme:
// Track phoneme scores over time
phonemeHistory = {} // phoneme -> list of recent scores
onFinalResponse(response) {
// This may be done asynchronously, assuming you persist the response/result
for (word in response.result.words) {
for (phone in word.phones) {
phoneme = phone.text.split("_")[0] // strip position suffix
phonemeHistory[phoneme].append(phone.pronunciationScore)
phonemeHistory[phoneme] = phonemeHistory[phoneme].slice(-10) // keep last 10
avgScore = average(phonemeHistory[phoneme])
if (avgScore < 0.5 && phonemeHistory[phoneme].length >= 5) {
suggestPractice(phoneme)
}
}
}
}
Additional considerations when interpreting GoP scores:
- A severely mispronounced phoneme can affect adjacent phoneme scores
- When acting on a single response (as opposed to averaging over time), low scores for multiple phonemes more reliably indicate pronunciation issues than a single low score
Alternative Word Pronunciations
The WordPronunciation class allows you to augment the lexicon with custom pronunciations. This is useful for:
- Made-up or fictional words (character names, fantasy terms)
- Domain-specific vocabulary not in the standard lexicon
- Common mispronunciations you want to explicitly recognize (and handle accordingly)
Adding Custom Pronunciations
// Add fictional character names
altWordPronunciations = [
WordPronunciation("zorblax", "Z AO R B L AE K S"),
WordPronunciation("meeka", "M IY K AH")
]
// You can explicitly validate pronunciations before creating decoding graph
for (wordPron in altWordPronunciations) {
if (!wordPron.isValid(recognizer)) {
// Handle invalid pronunciation
}
}
recognizer.createDecodingGraph(phrases, "page1", altWordPronunciations)
isValid() to explicitly validate pronunciations beforehand.Defining Common Mispronunciations
For young readers, you may want to explicitly recognize common mispronunciations so you can handle them appropriately:
// Recognize "liberry" as a mispronunciation of "library"
altWordPronunciations = [
WordPronunciation("library", "L AY B EH R IY", tag="mispronounced")
]
// Validate as shown above
recognizer.createDecodingGraph(phrases, "page1", altWordPronunciations)
The optional tag parameter lets you identify when an alternate pronunciation was recognized, so you can provide appropriate feedback. For new words added to augment the lexicon (e.g., fictional names), a tag is not necessary since the word itself will appear in word.text. Tags are primarily useful when adding alternate pronunciations of existing words, where you need to quickly distinguish which pronunciation variant was recognized without analyzing word.phones[] in the result.
Pronunciation Format
Pronunciations are defined using phonemes specified in the lang/phones.txt file in the ASR Bundle. For English, ARPAbet notation is used. Common phonemes:
| Phoneme | Example |
|---|---|
AH |
but |
AE |
cat |
IY |
see |
OW |
go |
K |
cat |
S |
sun |
The lang/lexicon.txt file in the ASR Bundle contains a lookup table for word pronunciations, which can be used as inspiration when creating custom pronunciations. Note that phonemes in lexicon.txt include positional tags (e.g., AH_B, AH_I, AH_E, AH_S), which should not be included when defining WordPronunciation entries.
Demos
We provide a few EdTech demos on our website, including oral reading and pronunciation scoring examples. While those use KeenASR for Web SDK, you can achieve the same functionality on any platform we support.
Summary
When building EdTech apps with KeenASR SDK:
- Constrain recognition using decoding graphs built from expected content
- Use contextual graphs if you are dealing with many graphs
- Enable GoP scoring for pronunciation feedback (tune thresholds for your audience)
- Add custom pronunciations for made-up words or to detect common mispronunciations
