This page describes generic steps required to integrate KeenASR SDK in your app. Quick Start pages for each platform provide more detailed description and code samples.


To add speech recognition functionality via KeenASR SDK you will need to:

  • initialize the SDK with the ASR Bundle
  • setup parameters (VAD, saving of audio recordings and ASR metadata on the device)
  • create one or more decoding graphs using API calls, unless they are already bundled with your app
  • prepare recognizer to listen using specific decoding graph
  • setup callback methods for partial and final results, and handling of audio interrupts and app going to background/foreground
  • call start listening when you are ready to listen (for example, when user taps on “Start Listening” button)
  • act upon partial and final results in their corresponding callbacks. You will typically let recognizer stop listening automatically, using one of the VAD rules.

Creating Language Models and Decoding Graphs

Language Model defines what words your app will be listening to and will be able to recognize.

For small to medium vocabulary (less than several thousand words) tasks, you can use the SDK to create language model and the decoding graph directly on the device. The SDK provides methods to create decoding graphs from the list of phrases/words; these methods will first create a language model and then a decoding graph. Once you created a decoding graph you can refer to it by its name, check the existance of the decoding graph with a specific name, as well as the date when it was created. Your app can have a logic that creates decoding graph only once, or only when the input list has changed.

For large vocabulary, dictation type tasks, decoding graphs cannot be created on the device; they need to be created ahead of time and bundled with the app or downloaded after the app has been installed. In this case a rescoring approach is used to minimize the memory footprint and computational requirement.

We are planning to provide higher-level methods that create domain specific language models and decoding graphs (for example, search, pronunciation scoring, oral reading, etc.), as well as decoding graphs for large vocabulary dictation. While these tasks can be handled with the current SDK, the future enhancements will abstract some of the logic that currently needs to be implemented by the developer in order to handle these use cases. Contact us if these or similar domains would be beneficial for your apps.

Audio Handling

Keen ASR SDK handles audio stack setup and capture of audio from the microphone on most platforms we support. Because audio needs to be captured from the microphone in the same way as the audio used to train the acoustic model, KeenASR SDK needs the control of the audio stack.

If you are using other audio frameworks (OS or 3rd party) to play audio or to deal with video, you will typically need to initialize KeenASR SDK after these modules have been initialized.

To allow interoperability with other SDKs that use audio (e.g. playing audio from the app, Unity integration, audio/video communication, etc.) the SDK provides callbacks that allow developers to unwind other audio modules when the app goes to the background and set them up when the app becomes active.

Start and Stop Listening

Once KeenASR SDK is initialized and prepared to listen with a specific decoding graph, you can call recognizer’s startListening method to start capturing and processing live audio. While you can explicitly call stopListening method, we recommend you let the recognizer automatically stop listening based on the VAD rules; these rules have a few configurable parameters that allow you to define how soon after the user stops talking recognition should stop.

Parameters you can tune to control VAD behaviour are KIOSVadTimeoutEndSilenceForGoodMatch and KIOSVadTimeoutEndSilenceForAnyMatch, or KASRVadTimeoutEndSilenceForGoodMatch and KASRVadTimeoutEndSilenceForAnyMatch on Android. For practical purposes you can treat these two settings the same way, i.e. always set them to the same value. These parameters define the threshoold that the recognizer uses to determine how much silence it needs to see after it has seen some speech, in order to automatically stop listening. They will typically be set to be in the range of 1 - 1.5 seconds. The tradeoffs when setting these parameters:

  • if the value is too short, it may trigger VAD rule and force recognizer to stop listening, when user makes a short pause (thinking about what to say next, for example)
  • if the value is too long, the system may appear to be unresponsive. From the user’s perspective they finished speaking and are expecting response from the system; if they have to wait for too long, the perception will be that the system is slow and non-responsive. Furthermore, user may start speaking again, thus never allowing the VAD rule to trigger.

For certain type of use cases, you may be able to perform a simple semanatic analysis of the recognition result in the partial result callback. When the result appears to be a complete response, you could shorten these VAD parameters (for example, from 1.2 to 0.7 seconds) – this will make your app and Voice User Interface appear more responsive.

Due to a way different layers of audio buffering and processing are done within the SDK, setting these paramteres to values shorter than 0.5sec will not be effective.

The SDK provides support for always-on listening via trigger phrase. This approach is currently not viable for listening sessions longer than 1-2h. When using decoding graphs built with trigger phrase support, the SDK will listen continuosly until it recognizes trigger phrase (which will be reported via callback); after the trigger phrase has been recognized the SDK will start reporting partial results via callbacks and use VAD rules to stop listening.

Recognition Result

Partial and final results will, via callbacks, provide the most likely sequency of words that the user said. Partial result will only provide the text of words, whereas final result will also provide word timings, and confidence scores for each word.

Confidence scores are in the range of 0 to 1, with 1 being high confidence that the word that was recognized was indeed said.