TextAligner compares text (typically output from the recognizer) against a known reference text and returns a word-level alignment. This is the foundation for oral reading instruction and assessment, language-learning tasks where the expected utterance is known, and any analysis that needs accuracy or word-error-rate metrics.
TextAligner was introduced in release 2.2. Starting with that release, text normalization is unified between DecodingGraph methods and TextAligner, so the same reference text used to build a decoding graph can be passed to TextAligner and aligned directly against result.text without ad-hoc preprocessing.
Overview
A TextAligner is created with a reference text and a language (either explicitly via a language code, or implicitly by passing an initialized recognizer). The reference is tokenized and normalized at construction time, so the resulting reference token list is what every alignment call compares against.
There are two alignment methods on the aligner, intended for different points in the recognition lifecycle:
| Method | Use when | Keeps state across calls |
|---|---|---|
incrementalAlign(result.text) |
Streaming recognition; call on every partial result and also on the final recognized text | Yes; builds on the previous incremental result for performance |
align(result.text) |
One-shot scenarios where partials are not used, for example post-hoc analysis of a stored response or a batch pipeline | No; each call is independent |
Both methods return an AlignmentResult describing the alignment, including a trace of operations (match, substitution, insertion, deletion) and aggregate statistics such as word-error-rate.
Pick one of the two methods per recognition session and stick with it:
- If you are using
incrementalAlignon partials, also useincrementalAlignon the final recognized text (result.textin thefinalResponsecallback). The final call extends the same incremental state and returns the canonical alignment for the whole session. - Use
alignonly when partials are not in play, for example post-hoc analysis of a stored response or batch processing of recorded audio.
When the session is over, release the aligner (and create a new one) if you are moving to a different reference. If you used incrementalAlign and want another attempt against the same reference, call reset() first to clear the incremental state. After align, there is no incremental state to clear, so reset() is a no-op.
Typical lifecycle for an oral-reading interaction, assuming incrementalAlign is used:
- Create a
TextAlignerfor the current passage (the reference text). - On each partial result, call
incrementalAlign(result.text)to track reading progress in real time. This is typically relevant when you are updating the UI (for example, highlighting the next word or the words that have already been read) or adjusting recognizer behavior (such as tighteningendSilenceas the reader approaches the end of the passage). - When recognition finalizes, call
incrementalAlign(result.text)on the final text. This extends the same state and produces the canonical alignment for accuracy and fluency. - Call
reset()before another attempt against the same reference, or release the aligner (close it on platforms where that applies) and create a new one if you are moving to a different reference.
aligner = TextAligner(referenceText, "en-us")
onPartialResult(result, recognizer):
partial = aligner.incrementalAlign(result.text)
updateHighlighting(partial.matchedRefIndices)
onFinalResponse(response):
final = aligner.incrementalAlign(response.result.text)
recordAccuracy(final.matches, final.refLength)
recordWER(final.wordErrorRate)
aligner.reset() // ready for the next attempt at the same passage
Align vs Incremental Align
Align
Aligns the full result.text against the reference and returns an AlignmentResult snapshot. Use this only when no partials are being processed, for example post-hoc analysis of a stored response, batch processing of recorded audio, or any single-shot comparison.
alignment = aligner.align(response.result.text)
print("WER:", alignment.wordErrorRate)
print("Matched:", alignment.matches, "/", alignment.refLength)
If you used incrementalAlign during the session, do not switch to align for the final text; continue with incrementalAlign (see below).
Incremental Align
Keeps state between calls. Designed for streaming partial results: each call assumes the new result.text is an extension of (or a small revision to) the previous one, and builds on cached state for performance.
Call incrementalAlign on every partial, and again on the final result.text in finalResponse. The final call extends the same state and returns the canonical alignment for the whole session, so there is no benefit (and the risk of a stale stateless answer) in switching to align at the end.
onPartialResult(result, recognizer):
alignment = aligner.incrementalAlign(result.text)
updateHighlighting(alignment.matchedRefIndices)
if alignment.furthestMatchedIndex >= referenceTokenCount - 2:
// reader is at (or near) the end of the passage; finalize sooner
recognizer.setVADParameters(VADParameter.endSilence, 0.5)
onFinalResponse(response):
final = aligner.incrementalAlign(response.result.text)
// accuracy, fluency, etc. from `final`
aligner.reset()
Call reset() between recognition sessions so the next session starts with a clean state. If you are moving to a different reference text, release the current aligner (close it on platforms where that applies) and create a new one.
incrementalAlign and align within a single recognition session. If you used incrementalAlign on partials, also use it on the final text. Use align only when no partials were processed at all.Reference Normalization
The reference text is normalized once when the TextAligner is constructed, using the same normalization pipeline as the DecodingGraph methods. This means:
- Punctuation, casing, and other surface differences are handled consistently across recognition and alignment.
- The reference does not need to be pre-cleaned in your app. Pass the natural form of the passage.
- The normalizer produces UPPER-cased tokens across all ASR Bundles, matching the recognizer’s output.
You can inspect the normalized reference via the referenceTokens accessor, which is useful when mapping alignment indices back to UI elements:
tokens = aligner.referenceTokens
// tokens[i] is the i-th normalized reference word; matches the refIndex values
// returned in AlignmentItem.refIndex and the indices in AlignmentResult.matchedRefIndices
Alignment Result
An immutable snapshot of an alignment. Both align and incrementalAlign return this type.
Edit-distance counts
| Field | Description |
|---|---|
matches |
Number of matching tokens |
substitutions |
Number of substitutions |
insertions |
Tokens in recognized but not in reference |
deletions |
Tokens in reference but not in recognized |
refLength |
Number of tokens in the reference |
recLength |
Number of tokens in the recognized text |
wordErrorRate |
Standard WER, (substitutions + insertions + deletions) / refLength |
Trace
| Field | Description |
|---|---|
trace |
Ordered list of AlignmentItem describing every edit operation |
Each AlignmentItem has:
| Field | Description |
|---|---|
op |
AlignOp value: MATCH, SUBSTITUTION, INSERTION, or DELETION |
refIndex |
Index in the reference token list (-1 for insertions) |
recIndex |
Index in the recognized token list (-1 for deletions) |
refToken |
Reference word (empty for insertions) |
recToken |
Recognized word (empty for deletions) |
Oral-reading views
These accessors are convenient for reading-instruction UIs without having to walk the full trace:
| Field | Description |
|---|---|
matchedRefMask |
Boolean array, one entry per reference token, true where matched |
matchedRefIndices |
Indices of reference tokens that were matched |
skippedRefIndices |
Indices of reference tokens that were not produced (deletions) |
furthestMatchedIndex |
Highest reference index reached so far; useful for tracking progress through a passage |
repetitionRefIndices |
Reference indices the reader appears to have repeated (only populated when detectRepetitions is enabled) |
// Highlight every word read so far
for i in result.matchedRefIndices:
ui.markWordAsRead(i)
// Detect that the reader is near the end of the passage
if result.furthestMatchedIndex >= aligner.referenceTokens.length - 2:
shortenEndSilenceTimeout()
Align Op
| Value | Meaning |
|---|---|
MATCH |
Recognized token matches the reference token |
SUBSTITUTION |
A different token was recognized in place of the reference token |
INSERTION |
The recognized text contains an extra token not in the reference |
DELETION |
A reference token is missing from the recognized text |
Alignment Config
Optional per-call configuration passed to align or incrementalAlign. All fields are optional and have sensible defaults.
| Field | Default | Description |
|---|---|---|
insertCost |
1 |
Cost of inserting a recognized token |
deleteCost |
1 |
Cost of deleting a reference token |
substituteCost |
1 |
Cost of substituting one token for another |
detectRepetitions |
false |
When enabled, the result populates repetitionRefIndices for words the reader appears to have stuttered or repeated |
filterNoiseTokens |
true |
Drops recognized tokens whose text begins with < (for example <SPOKEN_NOISE>, <UNK>) before alignment |
config = AlignmentConfig(detectRepetitions: true, filterNoiseTokens: true)
alignment = aligner.incrementalAlign(result.text, config)
Adjusting costs is useful when you want the alignment to prefer one operation over another. For example, increasing substituteCost relative to insertCost + deleteCost will make the aligner prefer to model a hard mismatch as an insertion next to a deletion, rather than a single substitution.
Noise Tokens and WER
filterNoiseTokens defaults to true, which drops recognized tokens like <SPOKEN_NOISE> and <UNK> from the recognized stream before alignment. With the default behavior, these tokens do not appear in the trace, do not count as insertions, and do not inflate wordErrorRate. The resulting WER reflects only the words the recognizer actually decoded against the reference, which is usually what you want when measuring reading accuracy.
Keep in mind that <SPOKEN_NOISE> is not a 1:1 marker for out-of-vocabulary words. A single <SPOKEN_NOISE> token may span several spoken OOV words, or may be emitted for non-speech audio, so the count of these tokens is not a reliable count of OOV occurrences. If you specifically want to see where these tokens occurred (for diagnostics, or to feed them into your own OOV/struggle heuristics), set filterNoiseTokens to false; they will then show up as insertions in the trace and contribute to WER.
Resource Management
Lifecycle expectations differ slightly by platform; see the platform-specific reference docs for exact details:
- iOS: managed by ARC. Release the aligner like any other Objective-C / Swift object.
- Web:
TextAlignerwraps native (WASM) state. Callclose()(or use ausingdeclaration) when the aligner is no longer needed to release native memory. - Android: the aligner holds native state via JNI. Call the platform’s release / close method when done.
- React Native: the JS object holds a handle to a native aligner. Call
close()when done; subsequent calls on the instance throw.
A common pattern is to create a TextAligner per page (or per reference text), reuse it across multiple recognition attempts on that same reference by calling reset(), and release it when the user navigates away.
Putting It Together
For a concrete oral-reading scenario including highlighting, dynamic VAD adjustment, accuracy/fluency computation, and contextual graph page advancement, see the Text Alignment section of the EdTech use case page.
