KeenASR SDK provides ASR Bundles with acoustic models specifically optimized for children’s voices (currently available for English). Combined with SDK features designed for educational use cases, this makes KeenASR well-suited for apps targeting young users.

Common use cases include:

Oral reading: both reading instruction (real-time feedback as children read aloud) and assessment (fluency scoring, accuracy tracking)
Speech and language therapy: articulation practice, phoneme-level feedback, progress tracking
Language learning: pronunciation practice with GoP scoring
Interactive stories and games: voice-driven navigation and character interaction

For young children who cannot yet read or write, voice is often the only practical input method. Traditional UI elements like text buttons or typed responses are not viable for this audience, making speech recognition essential for meaningful interaction.

Oral Reading Instruction

Oral reading instruction apps guide children through reading words, phrases, or passages, providing real-time feedback and tracking progress over time.

Setup

The recognizer will automatically stop listening based on VAD (Voice Activity Detection) thresholds and deliver the final response via the onFinalResponse callback. See VAD Thresholds for details on configuring when the recognizer stops.

When creating decoding graphs for oral reading, use the OralReadingTask option to optimize recognition for this use case:

recognizer.createDecodingGraph(phrases, "page1", task=OralReadingTask)

This option:

Optimizes the decoding graph structure for oral reading scenarios
Explicitly models common mistakes made when learning to read
Models incomplete words (false starts, hesitations)

See Decoding Graphs below for more details on building decoding graphs.

Incomplete words: When a child starts a word but doesn’t finish it (e.g., saying “STRAWBE…” instead of “STRAWBERRY”), the result will include the word with a #INC suffix indicating an incomplete pronunciation:

{
  "words": [
    { "text": "STRAWBERRY#INC", "confidence": 0.85 }
  ]
}

Enabling GoP: To compute GoP scores, pass true as the second parameter when preparing the recognizer:

recognizer.prepareForListeningWithDecodingGraph("page1", true)

Partial Results

Partial results can be used for real-time highlighting, dynamic VAD threshold adjustments, detecting user struggles, and more.

Partial results provide low-latency feedback but do not include GoP scores. For most reading instruction scenarios, it’s better not to interrupt the child’s reading flow with pronunciation feedback. Instead, let them complete the passage and analyze GoP scores from the final result afterward.

Real-time highlighting: Use partial results to highlight words as the child reads, showing either what has been read or what word comes next. For early readers, highlighting serves as a helpful guide to keep them on track. For more advanced readers, this may be more of a distraction; consider making it optional or adjusting based on reading level.

onPartialResult(result, recognizer) {
    // alignTexts: compute word-level alignment between recognized and expected text
    // Typically implemented using dynamic programming
    alignment = alignTexts(result.text, expectedResponse)
    highlightWordsRead(alignment, expectedResponse)
}

Dynamic VAD adjustment: Analyze partial results to determine if the child has reached the end of the expected text. If so, you can reduce the endSilence threshold to stop listening sooner rather than waiting for the full timeout. See VAD Thresholds for more details.

Detecting struggles: Implement logic to detect when a child might be struggling. Indicators include:

Long delays between partial results (e.g., no new words recognized for several seconds)
Incomplete words (words with #INC suffix indicating false starts)
<SPOKEN_NOISE> tokens indicating out-of-vocabulary words

When struggles are detected, you could stop the recognizer, provide an intervention (such as a hint or pronouncing the next word), and then restart the recognizer to continue.

Pronunciation Feedback Strategy

We recommend using GoP scores and defining your own logic for determining if a word is pronounced correctly and/or tracking mispronunciations for adaptive content and task selection. Note that word.confidence is not a reliable measure of mispronunciation; it may return high scores even for mispronounced words. See Goodness of Pronunciation (GoP) Scoring for details on using GoP scores effectively.

If modeling mispronunciations explicitly: When using custom WordPronunciation entries to model common mispronunciations, factor those results into your pronunciation feedback logic. If a mispronunciation variant was recognized, you know the child used that incorrect pronunciation regardless of GoP scores. See Alternative Word Pronunciations below for implementation details.

Note: Be careful not to overuse mispronunciation modeling. Alternative pronunciations that are acoustically close to the correct pronunciation may cause false negatives (the child says a word correctly, but it gets misrecognized as one of the mispronunciation variants).

Adaptive Task Difficulty

Match reading content to the child’s ability level. The SDK can help estimate and monitor reading ability:

Initial assessment: Have the child read a calibration passage and analyze accuracy, fluency (words per minute), and pronunciation scores
Continuous monitoring: Track these metrics across sessions to detect improvement or struggles
Content and task selection: Use the ability estimate to select appropriately challenging stories and relevant intervention tasks

// Simple fluency tracking
onFinalResponse(response) {
    // all of this could also be done asynchronously if response/result is persisted

    // alignTexts: compute word-level alignment between recognized and expected text
    // Typically implemented using dynamic programming
    alignment = alignTexts(response.result.text, expectedResponse)
    wordsReadCorrect = computeReadingAccuracy(alignment, expectedResponse)
    accuracy = wordsReadCorrect / expectedResponse.split(" ").len

    // Calculate duration based on actual speech (excluding leading/trailing silence)
    firstWord = response.result.words.firstElement
    lastWord = response.result.words.lastElement
    duration = (lastWord.startTime + lastWord.duration) - firstWord.startTime
    wordsCorrectPerMinute = (wordsReadCorrect / duration) * 60

    updateAbilityEstimate(accuracy, wordsCorrectPerMinute)
    // or choose different activity based on more detailed analysis of the profile
    // e.g. practice words with specific sounds or do listening activity for specific sounds
    selectNextStory(currentAbilityLevel)
}

Language Learning

Language learning apps use GoP scoring to provide pronunciation feedback. See the GoP Scoring section below for implementation details.

The easiest tasks to implement are constrained interactions where the expected response can be defined in advance (e.g., word or phrase repetition, reading exercises). For single-word responses, VAD thresholds can be much shorter than for longer passages. See VAD Thresholds for details.

Interactive Stories and Games

Voice-driven navigation and character interaction enable young children to engage with apps independently. Build decoding graphs containing the expected voice commands, answers, or spoken choices. See Decoding Graphs below.

As with language learning, constrained tasks where the expected responses can be defined (e.g., specific commands, character names, yes/no answers) are the easiest to implement. For short commands, VAD thresholds can be set much shorter for more responsive interaction. See VAD Thresholds for details.

Technical Implementation

VAD Thresholds

Voice Activity Detection (VAD) thresholds control when the recognizer considers speech to have ended. When these thresholds are reached, the recognizer automatically stops listening and calls the onFinalResponse callback. Optimal values for the endSilence parameter are typically driven by the length of the expected response. For longer responses, VAD threshold settings need adjustment because:

Children read more slowly than adults speak
There are natural pauses between words and phrases
Struggling readers may pause mid-word or hesitate before difficult words

If the child is asked to read or repeat a single word, the end silence parameter can be much shorter (e.g., 0.5 or 0.8 seconds). Note that you can also adjust VAD parameters dynamically (for example, in the onPartialResult callback). If expecting a single word, you can start with a longer timeout (e.g., 1.5 seconds) for endSilence, and reduce it to 0.5 seconds once the expected response is recognized. Likewise, for longer text, the initial value can be larger; you can then reduce it in the onPartialResult callback once the child reaches the end of the text (the exact logic depends on your implementation).

Key parameters:

Parameter	Description	Recommendation for Longer Responses
`endSilence`	Silence duration (seconds) before stopping	Increase to 3-5 seconds
`noSpeech`	Maximum initial silence	5 seconds (but use-case specific)
`maxDuration`	Maximum listening duration	Keep around 30 seconds (see below)

recognizer.setVADParameters(VADParameter.endSilence, 5.0) // Allow longer pauses
recognizer.setVADParameters(VADParameter.noSpeech, 5.0) // Allow longer pauses
recognizer.setVADParameters(VADParameter.maxDuration, 30) // Balance between passage length and processing time

maxDuration tradeoff: While longer maxDuration values allow children to read longer passages, computing the final response (including GoP scores) takes additional processing time. We recommend keeping maxDuration around 30 seconds. If more speech is expected (e.g., a long passage), restart listening in the onFinalResponse callback after processing the result.

onFinalResponse(response) {
    processResult(response.result)

    if (moreTextExpected) {
        recognizer.startListening()  // Continue with next segment
    }
}

See Continuous Listening for more details on restarting the recognizer. Note that VAD gating does not need to be enabled for this use case, since the child is expected to continue reading immediately.

Note: Consider making endSilence adjustable based on the child’s reading ability, giving more time to struggling readers.

Stopping the Recognizer

We recommend using VAD thresholds to stop the recognizer gracefully. You can set endSilence to a very low value, or set maxDuration to 0, at any point (even while recognizer is listening). Note that the latter will stop the recognizer as soon as the current audio buffer is processed, which may result in the last word being cut off if it happens while the user is still talking. This can affect the final word’s recognition accuracy and GoP scores.

Alternatively, calling recognizer.stopListening() will stop the recognizer almost immediately, but you will not receive a final response for that interaction. Use this method when you need to abort recognition entirely, for example when the user navigates away from the page.

Decoding Graphs

Decoding graphs should be built from the expected text (words, phrases, or commands). This constrains recognition to the target vocabulary and improves accuracy significantly compared to open-vocabulary recognition. If you have control over the content, avoid using homophones or acoustically similar words in the same graph/context; children are likely to mispronounce words and it is difficult to distinguish between acoustically close words.

Building a Decoding Graph

For oral reading, create a decoding graph containing the phrases or passages the child is expected to read:

// Build a decoding graph for a reading passage
phrases = [
    "The cat sat on the mat.",
    "She saw a big red ball."
]
recognizer.createDecodingGraph(phrases, "page1", task=OralReadingTask)

Note: Include punctuation-stripped, lowercase versions of phrases. The SDK normalizes text internally, but consistent formatting helps avoid issues.

For most EdTech use cases, one graph per interaction (e.g. page presented to the user, in case of oral-reading) provides the best accuracy.

Approach	Pros	Cons
One graph per page	Higher accuracy (constrained vocabulary)	Longer time to create, more graphs to manage
One large graph for entire book	Simpler management, single graph load	Lower accuracy (larger search space), may misrecognize acoustically similar words from wrong page

You can either build all decoding graphs ahead of time (e.g. when user opens a book in a reading app), or build them right before the relevant task starts.

Approach	Pros	Cons
Prebuild decoding graphs	Faster page transitions (graphs ready to use)	Initial delay when creating all the graphs
Build as needed	Lower initial delay	Brief delay when switching pages

Since decoding graphs are persisted in the file system, you only need to create a specific graph once (assuming input parameters have not changed). The SDK provides methods to check for graph existence:

// Check existence of the decoding graph
if (DecodingGraph.graphWithNameExists("page1")) {
    // No need to recreate graph
} else {
    // Create graph
}

Note: Consider encoding or hashing input values in the graph name to detect when input parameters change and the graph needs to be rebuilt.

Note: Decoding graphs are persisted to disk and are specific to the ASR Bundle used to create them. If the ASR Bundle changes (e.g., app update with new acoustic models), graphs need to be rebuilt. Your app is responsible for managing disk space by removing graphs that are no longer needed.

Using Contextual Graphs

Multiple decoding graphs can be merged into a single contextual graph, where each context represents a distinct interaction. This is useful when you have many related graphs and want to manage them as a single unit. For example, in a reading app, each page of a book can be a separate context within one contextual graph:

// Build contextual decoding graph for a story
contextualPhrases = [
    [ // page/context 1 (index 0)
        "The cat sat on the mat.",
        "She saw a big red ball."
    ],
    [ // page/context 2 (index 1)
        "The dog ran fast",
        "He found a bone"
    ]
]
recognizer.createContextualDecodingGraph(contextualPhrases, "story1", task=OralReadingTask)

currentPage = 0
recognizer.prepareForListeningWithContextualDecodingGraph("story1", currentPage, /* compute GoP */ true)

// ...

onFinalResponse(response, result) {
    // Check if the current page is completed
    alignment = alignTexts(result.text, expectedText[currentPage])

    if (isPageComplete(alignment, expectedText)) {
        currentPage = currentPage + 1

        if (currentPage < contextualPhrases.length) {
            // Switch context to the next page
            recognizer.prepareForListeningWithContextualDecodingGraph("story1", currentPage, /* compute GoP */ true)
        } else {
            // Story complete
        }
    }
}

Approach	Pros	Cons
Separate graphs	No runtime penalty	Slower to create (e.g., creating 10 graphs takes much longer than 1 contextual graph with 10 contexts)
Contextual graph	Faster to create, easier to manage	Small runtime penalty when switching contexts (few hundred ms at most)

Goodness of Pronunciation (GoP) Scoring

GoP scores provide per-phoneme pronunciation quality metrics, useful for pronunciation feedback in language learning and reading apps.

GoP scores are computed in reference to the recognized word and its corresponding phoneme sequence. When a word has multiple pronunciations (either in the lexicon or added explicitly via Alternative Word Pronunciations), the recognizer chooses the most acoustically likely pronunciation and computes GoP scores for that sequence.

Enabling GoP

GoP scoring is optional and is enabled via the prepareForListening methods:

// For regular decoding graphs
recognizer.prepareForListeningWithDecodingGraph("page1", /* compute GoP */ true)

// For contextual decoding graphs
recognizer.prepareForListeningWithContextualDecodingGraph("story1", /* context */ 0, /* compute GoP */ true)

Using GoP Scores

After recognition, access GoP scores from the result:

onFinalResponse(response) {
    for (word in response.result.words) {
        for (phone in word.phones) {
            if (phone.pronunciationScore < 0.5) {
                highlightMispronunciation(word, phone)
            }
        }
    }
}

GoP scores range from 0 to 1, where higher values indicate closer match to expected pronunciation. Typical thresholds:

> 0.7: Good pronunciation
0.4-0.7: Acceptable, may need practice
< 0.4: Likely mispronounced

Note: GoP thresholds should be tuned for your specific use case and target age group. Younger children naturally have more pronunciation variation.

GoP Score Reliability

GoP scores are inherently noisy because phonemes are very short (some as brief as 30-40ms), providing limited acoustic information for scoring. A single low GoP score does not reliably indicate a pronunciation problem.

Best practice: Use a sliding window average over the last N occurrences of a specific phoneme (e.g., last 10 instances of the “CH” phoneme) rather than acting on individual scores or averaging across all phonemes. This smooths out noise and provides a more reliable signal for feedback on each phoneme:

// Track phoneme scores over time
phonemeHistory = {}  // phoneme -> list of recent scores

onFinalResponse(response) {
    // This may be done asynchronously, assuming you persist the response/result
    for (word in response.result.words) {
        for (phone in word.phones) {
            phoneme = phone.text.split("_")[0]  // strip position suffix
            phonemeHistory[phoneme].append(phone.pronunciationScore)
            phonemeHistory[phoneme] = phonemeHistory[phoneme].slice(-10)  // keep last 10

            avgScore = average(phonemeHistory[phoneme])
            if (avgScore < 0.5 && phonemeHistory[phoneme].length >= 5) {
                suggestPractice(phoneme)
            }
        }
    }
}

Additional considerations when interpreting GoP scores:

A severely mispronounced phoneme can affect adjacent phoneme scores
When acting on a single response (as opposed to averaging over time), low scores for multiple phonemes more reliably indicate pronunciation issues than a single low score

Note: If providing phoneme-level feedback based on a single GoP score, keep in mind that individual scores are noisy. Avoid definitive statements like “you pronounced this wrong.” Instead, use softer wording that reflects the uncertainty (e.g., “let’s practice this sound together”).

Alternative Word Pronunciations

The WordPronunciation class allows you to augment the lexicon with custom pronunciations. This is useful for:

Made-up or fictional words (character names, fantasy terms)
Domain-specific vocabulary not in the standard lexicon
Common mispronunciations you want to explicitly recognize (and handle accordingly)

Adding Custom Pronunciations

// Add fictional character names
altWordPronunciations = [
    WordPronunciation("zorblax", "Z AO R B L AE K S"),
    WordPronunciation("meeka", "M IY K AH")
]

// You can explicitly validate pronunciations before creating decoding graph
for (wordPron in altWordPronunciations) {
    if (!wordPron.isValid(recognizer)) {
        // Handle invalid pronunciation
    }
}

recognizer.createDecodingGraph(phrases, "page1", altWordPronunciations)

Warning: When creating decoding graphs, invalid pronunciations will be silently ignored (with WARN in the logs). Use isValid() to explicitly validate pronunciations beforehand.

Defining Common Mispronunciations

For young readers, you may want to explicitly recognize common mispronunciations so you can handle them appropriately:

// Recognize "liberry" as a mispronunciation of "library"
altWordPronunciations = [
    WordPronunciation("library", "L AY B EH R IY", tag="mispronounced")
]

// Validate as shown above
recognizer.createDecodingGraph(phrases, "page1", altWordPronunciations)

The optional tag parameter lets you identify when an alternate pronunciation was recognized, so you can provide appropriate feedback. For new words added to augment the lexicon (e.g., fictional names), a tag is not necessary since the word itself will appear in word.text. Tags are primarily useful when adding alternate pronunciations of existing words, where you need to quickly distinguish which pronunciation variant was recognized without analyzing word.phones[] in the result.

Pronunciation Format

Pronunciations are defined using phonemes specified in the lang/phones.txt file in the ASR Bundle. For English, ARPAbet notation is used. Common phonemes:

Phoneme	Example
`AH`	but
`AE`	cat
`IY`	see
`OW`	go
`K`	cat
`S`	sun

The lang/lexicon.txt file in the ASR Bundle contains a lookup table for word pronunciations, which can be used as inspiration when creating custom pronunciations. Note that phonemes in lexicon.txt include positional tags (e.g., AH_B, AH_I, AH_E, AH_S), which should not be included when defining WordPronunciation entries.

Warning: Do not directly edit files in the ASR Bundle. Any modifications to these files can result in errors or unexpected behavior.

Demos

We provide a few EdTech demos on our website, including oral reading and pronunciation scoring examples. While those use KeenASR for Web SDK, you can achieve the same functionality on any platform we support.

Summary

When building EdTech apps with KeenASR SDK:

Constrain recognition using decoding graphs built from expected content
Use contextual graphs if you are dealing with many graphs
Enable GoP scoring for pronunciation feedback (tune thresholds for your audience)
Add custom pronunciations for made-up words or to detect common mispronunciations

EdTech & Kids