Getting Started

The following describes the high-level steps required to integrate the KeenASR SDK into your app. The Quick Start pages for each platform provide more detailed instructions with code samples.

Overview

To add speech recognition functionality to your app using the KeenASR SDK, perform the following steps:

Initialize the SDK with the ASR Bundle.
Set up recognizer parameters (e.g. VAD).
Set up callback methods for partial result and final response, for handling of audio interrupts, and for the app going to the background or foreground.
Create one or more decoding graphs using API calls, unless they are already bundled with your app.
Prepare the recognizer to listen by using a specific decoding graph.
Call startListening when you are ready to listen, for example, when the user taps on the “Start Listening” button.
Act upon callbacks and consume the result. You will typically let the recognizer stop listening automatically, using one of the VAD rules.

Creating Language Models and Decoding Graphs

The Language Model defines what words the recognizer will be listening to and will be able to recognize.

For small to medium-sized vocabulary tasks (less than several thousand words), you can use the SDK to create the decoding graph directly on the device. The SDK provides methods to create the decoding graphs from the list of phrases/words. Once you created a decoding graph you can refer to it by its name, and check the existence of the decoding graph with a specific name as well as the date when it was created. Your app can have a logic that creates the decoding graph only once, or only when the inputs for creation of the graph have changed.

For large vocabulary, dictation-type tasks, decoding graphs need to be created ahead of time and bundled with the app or downloaded after the app has been installed. In this case a rescoring approach is used to minimize the memory footprint and computational requirements.

KeenASR SDK API allows developers to create decoding graphs optimized for a specific task (e.g. oral reading). The SDK can optionally compute goodness of pronunciation scores for some langauges.

Audio Handling

The KeenASR SDK handles the audio stack setup and capture of audio from the microphone on most platforms we support. Because audio needs to be captured by the microphone in the same way as the audio used to train the acoustic model, the KeenASR SDK needs control of the audio stack.

If you are using other audio frameworks (system or 3rd party) to play audio or to handle video, you will typically need to initialize the KeenASR SDK after these modules have been initialized.

To allow for interoperability with other SDKs that use audio (for example, playing audio from the app, Unity integration, audio/video communication, etc.) the SDK provides callbacks that allow developers to unwind other audio modules when the app goes to the background and set them up when the app becomes active.

Start and Stop Listening

Once the KeenASR SDK is initialized and prepared to listen with a specific decoding graph, you can call the recognizer’s startListening method to start capturing and processing live audio. While you can explicitly call the stopListening method, we recommend you let the recognizer automatically stop listening based on the VAD rules; these rules have a few configurable parameters that allow you to define how soon after the user stops talking recognition should stop.

You can tune the following parameters to control VAD end-silence timeout behavior KIOSVadTimeoutEndSilenceForGoodMatch and KIOSVadTimeoutEndSilenceForAnyMatch in iOS SDK, or KASRVadTimeoutEndSilenceForGoodMatch and KASRVadTimeoutEndSilenceForAnyMatch in Android SDK . For practical purposes you can treat these two settings the same way, that is, always set them to the same value. These parameters define the threshold that the recognizer uses to determine how much silence it needs to see after it has seen some speech, in order to automatically stop listening. They will typically be set to be in the range of 1 - 1.5 seconds.

There are a number of tradeoffs when setting these parameters, including the following:

If the value is too short, it may trigger the VAD rule and cause the recognizer to stop listening whenever the user pauses briefly (to think about what to say next, for example).
If the value is too long, the system may appear to be unresponsive. From the user’s perspective they finished speaking and are expecting a response from the system; if they have to wait for too long, the perception will be that the system is slow and unresponsive. Furthermore, the user may start speaking again, thus never allowing for the VAD rule to be triggered.

For certain types of use cases, you may be able to perform a simple semantic analysis of the recognition result in the partial result callback. When the result appears to be a complete response, you could shorten the VAD parameters (for example, from 1.2 to 0.7 seconds) – this will make your app and voice user interface appear more responsive.

Because of the manner in which the different layers of audio buffering and processing are performed within the SDK, setting these parameters to values shorter than 0.5sec will not be effective.

The SDK supports always-on listening using a trigger phrase. This approach is currently not viable for listening sessions longer than 1-2 hours. When using decoding graphs built with trigger phrase support, the SDK will listen continuously until it recognizes the trigger phrase (which will be reported via callback); after the trigger phrase has been recognized the SDK will start reporting partial results via callbacks and use VAD rules to stop listening.

Recognition Result and Callbacks

Partial and final results provide, via callbacks, the most likely sequence of words spoken by the the user. Partial results will only provide words recognized, whereas final result also include timing, and optionally, goodness of pronunciation scores for each phoneme in a recognized word.

Tip: Recognition result will only contain words that were used to set up the language model and decoding graph. If users say anything else (so-called out-of-vocabulary words, or OOV words), those words will be mapped to the acoustically closest sequence of words in the language model or by the filler model (<SPOKEN_NOISE>). The probability of <SPOKEN_NOISE> can be controlled via API, allowing more or less strictness with accents and mispronunciations of words.