The following describes the high-level steps required to integrate the KeenASR SDK into your app. The Quick Start pages for each platform provide more detailed instructions with code samples.
To add speech recognition functionality to your app using the KeenASR SDK, perform the following steps:
- Initialize the SDK with the ASR Bundle.
- Set up parameters (VAD, saving of audio recordings and ASR metadata on the device)
- Create one or more decoding graphs using API calls, unless they are already bundled with your app.
- Prepare the recognizer to listen by using a specific decoding graph.
- Set up callback methods for partial and final results, for handling of audio interrupts, and for the app going to the background or foreground.
- Call startListening when you are ready to listen, for example, when the user taps on the “Start Listening” button.
- Act upon partial and final results in their corresponding callbacks. You will typically let the recognizer stop listening automatically, using one of the VAD rules.
Creating Language Models and Decoding Graphs
The Language Model defines what words your app will be listening to and will be able to recognize.
For small to medium-sized vocabulary tasks (less than several thousand words), you can use the SDK to create the language model and the decoding graph directly on the device. The SDK provides methods to create the decoding graphs from the list of phrases/words. These methods will first create a language model and then a decoding graph. Once you created a decoding graph you can refer to it by its name, and check the existence of the decoding graph with a specific name as well as the date when it was created. Your app can have a logic that creates the decoding graph only once, or only when the input list has changed.
For large vocabulary, dictation-type tasks, decoding graphs cannot be created on the device; they need to be created ahead of time and bundled with the app or downloaded after the app has been installed. In this case a rescoring approach is used to minimize the memory footprint and computational requirements.
Keen Research is planning to provide higher-level methods for creating domain specific language models and decoding graphs (for example, for search, pronunciation scoring, oral reading, etc.). In addition, we are working on methods for creating decoding graphs for large vocabulary dictation. While all of these tasks can be handled with the current SDK, the goal is to abstract some of the logic that currently needs to be implemented by the developer in order to simplify the development of decoding graphs for these use cases. Contact us if these or similar domains would be beneficial for your apps.
The KeenASR SDK handles the audio stack setup and capture of audio from the microphone on most platforms we support. Because audio needs to be captured by the microphone in the same way as the audio used to train the acoustic model, the KeenASR SDK needs control of the audio stack.
If you are using other audio frameworks (system or 3rd party) to play audio or to handle video, you will typically need to initialize the KeenASR SDK after these modules have been initialized.
To allow for interoperability with other SDKs that use audio (for example, playing audio from the app, Unity integration, audio/video communication, etc.) the SDK provides callbacks that allow developers to unwind other audio modules when the app goes to the background and set them up when the app becomes active.
Start and Stop Listening
Once the KeenASR SDK is initialized and prepared to listen with a specific decoding graph, you can call the recognizer’s startListening method to start capturing and processing live audio. While you can explicitly call the stopListening method, we recommend you let the recognizer automatically stop listening based on the VAD rules; these rules have a few configurable parameters that allow you to define how soon after the user stops talking recognition should stop.
You can tune the following parameters to control VAD end-silence timeout behavior
KIOSVadTimeoutEndSilenceForAnyMatch in iOS SDK, or
KASRVadTimeoutEndSilenceForAnyMatch in Android SDK . For practical purposes you can treat these two settings the same way, that is, always set them to the same value. These parameters define the threshold that the recognizer uses to determine how much silence it needs to see after it has seen some speech, in order to automatically stop listening. They will typically be set to be in the range of 1 - 1.5 seconds.
There are a number of tradeoffs when setting these parameters, including the following:
- If the value is too short, it may trigger the VAD rule and cause the recognizer to stop listening whenever the user pauses briefly (to think about what to say next, for example).
- If the value is too long, the system may appear to be unresponsive. From the user’s perspective they finished speaking and are expecting a response from the system; if they have to wait for too long, the perception will be that the system is slow and unresponsive. Furthermore, the user may start speaking again, thus never allowing for the VAD rule to be triggered.
For certain types of use cases, you may be able to perform a simple semantic analysis of the recognition result in the partial result callback. When the result appears to be a complete response, you could shorten the VAD parameters (for example, from 1.2 to 0.7 seconds) – this will make your app and voice user interface appear more responsive.
Because of the manner in which the different layers of audio buffering and processing are performed within the SDK, setting these parameters to values shorter than 0.5sec will not be effective.
The SDK supports always-on listening using a trigger phrase. This approach is currently not viable for listening sessions longer than 1-2 hours. When using decoding graphs built with trigger phrase support, the SDK will listen continuously until it recognizes the trigger phrase (which will be reported via callback); after the trigger phrase has been recognized the SDK will start reporting partial results via callbacks and use VAD rules to stop listening.
Partial and final results determine, via callbacks, the most likely sequence of words spoken by the the user. Partial results will only provide word transcripts, whereas final result also include timing and confidence scores for each recognized word.
Confidence scores are in the range of 0 to 1, with 1 being high confidence that the word that was recognized was indeed said.