Text Alignment

TextAligner compares text (typically output from the recognizer) against a known reference text and returns a word-level alignment. This is the foundation for oral reading instruction and assessment, language-learning tasks where the expected utterance is known, and any analysis that needs accuracy or word-error-rate metrics.

TextAligner was introduced in release 2.2. Starting with that release, text normalization is unified between DecodingGraph methods and TextAligner, so the same reference text used to build a decoding graph can be passed to TextAligner and aligned directly against result.text without ad-hoc preprocessing.

Note: Code examples use pseudo-code for clarity. See platform-specific API references for exact method signatures.

Overview

A TextAligner is created with a reference text and a language (either explicitly via a language code, or implicitly by passing an initialized recognizer). The reference is tokenized and normalized at construction time, so the resulting reference token list is what every alignment call compares against.

There are two alignment methods on the aligner, intended for different points in the recognition lifecycle:

Method	Use when	Keeps state across calls
`incrementalAlign(result.text)`	Streaming recognition; call on every partial result and also on the final recognized text	Yes; builds on the previous incremental result for performance
`align(result.text)`	One-shot scenarios where partials are not used, for example post-hoc analysis of a stored response or a batch pipeline	No; each call is independent

Both methods return an AlignmentResult describing the alignment, including a trace of operations (match, substitution, insertion, deletion) and aggregate statistics such as word-error-rate.

Pick one of the two methods per recognition session and stick with it:

If you are using incrementalAlign on partials, also use incrementalAlign on the final recognized text (result.text in the finalResponse callback). The final call extends the same incremental state and returns the canonical alignment for the whole session.
Use align only when partials are not in play, for example post-hoc analysis of a stored response or batch processing of recorded audio.

When the session is over, release the aligner (and create a new one) if you are moving to a different reference. If you used incrementalAlign and want another attempt against the same reference, call reset() first to clear the incremental state. After align, there is no incremental state to clear, so reset() is a no-op.

Typical lifecycle for an oral-reading interaction, assuming incrementalAlign is used:

Create a TextAligner for the current passage (the reference text).
On each partial result, call incrementalAlign(result.text) to track reading progress in real time. This is typically relevant when you are updating the UI (for example, highlighting the next word or the words that have already been read) or adjusting recognizer behavior (such as tightening endSilence as the reader approaches the end of the passage).
When recognition finalizes, call incrementalAlign(result.text) on the final text. This extends the same state and produces the canonical alignment for accuracy and fluency.
Call reset() before another attempt against the same reference, or release the aligner (close it on platforms where that applies) and create a new one if you are moving to a different reference.

aligner = TextAligner(referenceText, "en-us")

onPartialResult(result, recognizer):
    partial = aligner.incrementalAlign(result.text)
    updateHighlighting(partial.matchedRefIndices)

onFinalResponse(response):
    final = aligner.incrementalAlign(response.result.text)
    recordAccuracy(final.matches, final.refLength)
    recordWER(final.wordErrorRate)

    aligner.reset()    // ready for the next attempt at the same passage

Align vs Incremental Align

Align

Aligns the full result.text against the reference and returns an AlignmentResult snapshot. Use this only when no partials are being processed, for example post-hoc analysis of a stored response, batch processing of recorded audio, or any single-shot comparison.

alignment = aligner.align(response.result.text)
print("WER:", alignment.wordErrorRate)
print("Matched:", alignment.matches, "/", alignment.refLength)

If you used incrementalAlign during the session, do not switch to align for the final text; continue with incrementalAlign (see below).

Incremental Align

Keeps state between calls. Designed for streaming partial results: each call assumes the new result.text is an extension of (or a small revision to) the previous one, and builds on cached state for performance.

Call incrementalAlign on every partial, and again on the final result.text in finalResponse. The final call extends the same state and returns the canonical alignment for the whole session, so there is no benefit (and the risk of a stale stateless answer) in switching to align at the end.

onPartialResult(result, recognizer):
    alignment = aligner.incrementalAlign(result.text)
    updateHighlighting(alignment.matchedRefIndices)
    if alignment.furthestMatchedIndex >= referenceTokenCount - 2:
        // reader is at (or near) the end of the passage; finalize sooner
        recognizer.setVADParameters(VADParameter.endSilence, 0.5)

onFinalResponse(response):
    final = aligner.incrementalAlign(response.result.text)
    // accuracy, fluency, etc. from `final`
    aligner.reset()

Call reset() between recognition sessions so the next session starts with a clean state. If you are moving to a different reference text, release the current aligner (close it on platforms where that applies) and create a new one.

Note: Do not mix incrementalAlign and align within a single recognition session. If you used incrementalAlign on partials, also use it on the final text. Use align only when no partials were processed at all.

Reference Normalization

The reference text is normalized once when the TextAligner is constructed, using the same normalization pipeline as the DecodingGraph methods. This means:

Punctuation, casing, and other surface differences are handled consistently across recognition and alignment.
The reference does not need to be pre-cleaned in your app. Pass the natural form of the passage.
The normalizer produces UPPER-cased tokens across all ASR Bundles, matching the recognizer’s output.

You can inspect the normalized reference via the referenceTokens accessor, which is useful when mapping alignment indices back to UI elements:

tokens = aligner.referenceTokens
// tokens[i] is the i-th normalized reference word; matches the refIndex values
// returned in AlignmentItem.refIndex and the indices in AlignmentResult.matchedRefIndices

Alignment Result

An immutable snapshot of an alignment. Both align and incrementalAlign return this type.

Edit-distance counts

Field	Description
`matches`	Number of matching tokens
`substitutions`	Number of substitutions
`insertions`	Tokens in recognized but not in reference
`deletions`	Tokens in reference but not in recognized
`refLength`	Number of tokens in the reference
`recLength`	Number of tokens in the recognized text
`wordErrorRate`	Standard WER, `(substitutions + insertions + deletions) / refLength`

Trace

Field	Description
`trace`	Ordered list of `AlignmentItem` describing every edit operation

Each AlignmentItem has:

Field	Description
`op`	`AlignOp` value: `MATCH`, `SUBSTITUTION`, `INSERTION`, or `DELETION`
`refIndex`	Index in the reference token list (`-1` for insertions)
`recIndex`	Index in the recognized token list (`-1` for deletions)
`refToken`	Reference word (empty for insertions)
`recToken`	Recognized word (empty for deletions)

Oral-reading views

These accessors are convenient for reading-instruction UIs without having to walk the full trace:

Field	Description
`matchedRefMask`	Boolean array, one entry per reference token, `true` where matched
`matchedRefIndices`	Indices of reference tokens that were matched
`skippedRefIndices`	Indices of reference tokens that were not produced (deletions)
`furthestMatchedIndex`	Highest reference index reached so far; useful for tracking progress through a passage
`repetitionRefIndices`	Reference indices the reader appears to have repeated (only populated when `detectRepetitions` is enabled)

// Highlight every word read so far
for i in result.matchedRefIndices:
    ui.markWordAsRead(i)

// Detect that the reader is near the end of the passage
if result.furthestMatchedIndex >= aligner.referenceTokens.length - 2:
    shortenEndSilenceTimeout()

Align Op

Value	Meaning
`MATCH`	Recognized token matches the reference token
`SUBSTITUTION`	A different token was recognized in place of the reference token
`INSERTION`	The recognized text contains an extra token not in the reference
`DELETION`	A reference token is missing from the recognized text

Alignment Config

Optional per-call configuration passed to align or incrementalAlign. All fields are optional and have sensible defaults.

Field	Default	Description
`insertCost`	`1`	Cost of inserting a recognized token
`deleteCost`	`1`	Cost of deleting a reference token
`substituteCost`	`1`	Cost of substituting one token for another
`detectRepetitions`	`false`	When enabled, the result populates `repetitionRefIndices` for words the reader appears to have stuttered or repeated
`filterNoiseTokens`	`true`	Drops recognized tokens whose text begins with `<` (for example `<SPOKEN_NOISE>`, `<UNK>`) before alignment

config = AlignmentConfig(detectRepetitions: true, filterNoiseTokens: true)
alignment = aligner.incrementalAlign(result.text, config)

Adjusting costs is useful when you want the alignment to prefer one operation over another. For example, increasing substituteCost relative to insertCost + deleteCost will make the aligner prefer to model a hard mismatch as an insertion next to a deletion, rather than a single substitution.

Noise Tokens and WER

filterNoiseTokens defaults to true, which drops recognized tokens like <SPOKEN_NOISE> and <UNK> from the recognized stream before alignment. With the default behavior, these tokens do not appear in the trace, do not count as insertions, and do not inflate wordErrorRate. The resulting WER reflects only the words the recognizer actually decoded against the reference, which is usually what you want when measuring reading accuracy.

Keep in mind that <SPOKEN_NOISE> is not a 1:1 marker for out-of-vocabulary words. A single <SPOKEN_NOISE> token may span several spoken OOV words, or may be emitted for non-speech audio, so the count of these tokens is not a reliable count of OOV occurrences. If you specifically want to see where these tokens occurred (for diagnostics, or to feed them into your own OOV/struggle heuristics), set filterNoiseTokens to false; they will then show up as insertions in the trace and contribute to WER.

Resource Management

Lifecycle expectations differ slightly by platform; see the platform-specific reference docs for exact details:

iOS: managed by ARC. Release the aligner like any other Objective-C / Swift object.
Web: TextAligner wraps native (WASM) state. Call close() (or use a using declaration) when the aligner is no longer needed to release native memory.
Android: the aligner holds native state via JNI. Call the platform’s release / close method when done.
React Native: the JS object holds a handle to a native aligner. Call close() when done; subsequent calls on the instance throw.

A common pattern is to create a TextAligner per page (or per reference text), reuse it across multiple recognition attempts on that same reference by calling reset(), and release it when the user navigates away.

Putting It Together

For a concrete oral-reading scenario including highlighting, dynamic VAD adjustment, accuracy/fluency computation, and contextual graph page advancement, see the Text Alignment section of the EdTech use case page.