Decoding graph for large vocabulary tasks (more than ~10k words) will need to be built ahead of time, and bundled with the app or downloaded after the app has been installed. It is currently not feasible to create large decoding graphs on the device due to memory and CPU constraints. Contact us for more details and help with creating these decoding graphs.
In the future we are planning to provide tools that will allow you to create such decoding graphs as well as a set of decoding graphs (generic and domain specific) for large vocabulary tasks. We can also help create language models and decoding graphs for domain-specific tasks (e.g. medical, construction, industrial, etc.) using existing data from your enterprise systems.
If this is still of concern, instead of including the ASR Bundle in the app, you can download it when the app is ran for the first time.
If you have a specific use case that doesn't match current capabilities, drop us a line.
It's also useful to have visual and audio indication on when the app is listening; this is a good general practice, not just during debugging.
If you are doing user testing, you can connect your app to Dashboard and automatically send audio data and recognition results to the cloud for further analysis.
For assessing speech recognition performance, the best approach is to collect a small amount of test data. You can run the SDK against the files to assess, in a controlled manner, how well the recognition works.
If you can run this processes in a background thread, in a controlled manner, and on a more recent devices (iPhone 7 or iPhone X, and equivalent Android devices) you may be able to create larger decoding graphs on the device.
Note that decoding graphs created on the device are not using rescoring approach in decoding. Also note that this answer relates to the process of creation of decoding graphs, which is typically done once.
This answer provides you with a baseline, but ultimately you will want to test this on devices that will be used in production setting.
CPU utilization will also depend on the size of the model (there is a fixed CPU processing related to the size of the acoustic model; for each frame of audio, audio features are pushed through the deep neural network). The other factor for CPU utilization is graph search, which will depend on the size of the graph (size of the language model) as well as various configuration parameters.
For medium vocabulary task (e.g. searching movie library with ~7000 titles), memory footprint with some of the in-house models we have will be around 100MB, and CPU utilization will be around 40% of a single core on iPhone 6s.
We are working on a number of optimizations for mobile devices that will significantly reduce memory footprint and CPU utilization.