Acoustic Model
A statistical model that represents relationship between audio signal and phonemes: a perceptually distinct units of sound in specific language. Acoustic Model is usually language specific and trained on thousands of hours of human-transcribed speech recordings. KeenASR SDK works with both Deep Neural Network acoustic models as well as Gaussian Mixture Models. Keen Research provides acoustic models in ASR Bundle, as part of the KeenASR SDK.
ASR Bundle
A KeenASR specific asset – a directory with several files – that defines acoustic model, lexicon, and various configuration parameters. It can be bundled with your app, or downloaded after the app has been installed.
Decoding Graph
A weighted finite state tranducer that combines lexicon, language model, and acoustic model into a structure internally used by the recognizer.
Final Recognition Result
The best hypothesis provided after the recognition stopped listening. In addition to the text ‘transcription’, it typically contains word start/end times and confidences. It may also contain similar information about individual phonemes that comprise each word in the result. In addition to the best result, final recognition result may also contain a list of N-Best results.
Language Model
A statistical model that models words (text) in context; ngrams (e.g. bigram, trigram, etc.) probabilities are typically used to capture context. Language model can also be based on Recurrent Neural Networks. Language models are trained on large amounts of domain specific text data.
A lookup table that defines mapping between words (used in language model) and their phonetic pronunciation. For some languages this mapping is completely deterministic, i.e. given a word, phonetic transcription can be obtained by following a set or rules.
Partial Recognition Result
The current best hypothesis provided in real-time while recognition is still running. It’s typically provided via callback method every 100-200ms. Unlike final result, partial result may not contain all the relevant information (e.g. word timings and confidences).
A second pass through the recognition lattice that uses a more complex language model. It is typically used for large vocabulary tasks where smaller/less complex language model is used for real-time recognition and its output lattice is rescored using a more complex language model after the recognizer has stopped listening.
Trigger Phrase
A fixed phrase (e.g. ‘hey computer’) that can be used to provide always-on listening. Unlike wake-word, which typically uses small model trained on the specific phrase, trigger phrases uses general speech recognition engine; this approach allows any arbitrary phrase (of reasonable length) to be used as a trigger phrase without any additional training, but it doesn’t lend itself to long always-on listening sessions due to battery use and device heating.
Voice Activity Detection (VAD)
Voice Activity Detection is a processing step that classifies input audio signal as speech or silence (including background noise). Speech recognition system can be used for this purposes since it models silence/background-noise via acoustic model. Sometimes, a simpler model trained only to classify between speech/non-speech can be used instead.
A fixed phase (‘hey computer’) that can be used to provide always-on listening, with minimal impact on battery life. When user says the phrase, it wakes up recognizer which starts listening with a full decoding graph. Wake-word processing usually has much lower processing requirements than general speech recognition, and helps reduce battery use in always-on listening scenarios.