Acoustic Model
A statistical model that represents the relationship between the audio signal and phones: a perceptually distinct unit of sound in a specific language. Acoustic Models are usually language-specific and trained on thousands of hours of human-transcribed speech recordings. Keen Research provides acoustic models in ASR Bundles as part of the KeenASR SDK.
ASR Bundle
A KeenASR specific asset – a directory with several files – that defines the acoustic model, the lexicon, and various configuration parameters. The ASR bundle can be included with your app, or it can be downloaded after the user has installed and launched the app.
Decoding Graph
A weighted finite state transducer that combines lexicon, language model, and acoustic model into a structure internally used by the recognizer.
Final Recognition Result
The best hypothesis provided after the recognizer stopped listening. In addition to the text transcription, the final result contains start/end times and confidence values for each word. It may also contain similar information about the individual phones that make up each of the words in the final result. In addition to the best result, the final recognition result may also contain a list of N-Best results.
Language Model
A statistical model that models words (text) in context; ngram probabilities (for example, bigrams, trigrams, etc.) are typically used to capture context. Language models can also be based on Recurrent Neural Networks. Language models are trained on large amounts of domain specific text data.
A lookup table that defines mappings between words (used in the language model) and their phonetic pronunciation. For some languages this mapping is completely deterministic, i.e., given a word, its phonetic transcription can be obtained by following a set of rules.
Partial Recognition Result
The current best hypothesis provided in real-time while recognition is still running. A partial result is typically provided every 200ms through a callback method. Unlike the final result, the partial result may not contain all the relevant information (for example, word timings and confidences).
A second pass through the recognition lattice that uses a more complex language model. Rescoring is often used in large vocabulary recognition tasks where a smaller and less complex language model is used for real-time recognition and its output lattice is rescored using a more complex language model after the recognizer has stopped listening.
Trigger Phrase
A fixed phrase (for example, ‘hey computer’) that can be used to support always-on listening. Unlike a wake-word, which typically uses a small model trained on a specific phrase, trigger phrases use a general speech recognition engine. The trigger-phrase approach allows any arbitrary phrase of reasonable length to be used to start a listening session without any additional training. However, the approach does not lend itself to long always-on listening sessions due to battery use and device heating.
Voice Activity Detection (VAD)
Voice Activity Detection is a processing step that classifies an input audio signal (including background noise) as speech or silence. A speech recognition system can be used for this purpose, since it models silence and background-noise in the acoustic models. Sometimes, a simpler model trained only to discriminate between speech and non-speech can be used instead.
A fixed phase (‘hey computer’) that can be used to provide always-on listening, with minimal impact on battery life. When the user says the phrase, it wakes up the recognizer which then starts listening with a full decoding graph. Wake-word processing usually has much lower processing requirements than general speech recognition and helps reduce battery use in always-on listening scenarios.