KeenASR SDK enables hands-free voice interfaces for frontline workers in warehousing, logistics, manufacturing, field service, and other enterprise environments. On-device processing ensures low latency, offline operation, and data privacy, all critical requirements for enterprise deployments.
Common use cases include:
- Voice picking: hands-free item picking in warehouses with voice confirmation of locations, quantities, and item codes
- Checklists and procedures: voice-driven step-by-step workflows for inspections, maintenance, and safety procedures
- Data entry: capturing readings, codes, and notes without touching a device
- Equipment control: voice commands for machinery, vehicles, or wearable devices
In these environments, workers often have their hands occupied (carrying items, operating equipment, wearing gloves) and need their eyes focused on the task. Voice becomes the most practical input method.
Voice Picking
Voice picking workflows guide workers through item retrieval tasks using voice prompts and confirmations. The worker hears instructions (via text-to-speech or pre-recorded audio) and responds verbally to confirm actions.
Setup
Build a decoding graph containing all valid responses for the current step:
voicePickingCommands = [
"one", "two", "three", "four", "five", "six", "seven", "eight", "nine", "ten", "hundred",
"pause", "cancel", "next", "short one", "short two", "short three", "damaged", "skip", "print"]
recognizer.createDecodingGraph(voicePickingCommands, "commands")
You can set up multiple decoding graphs and switch among them based on the app context.
For short responses (one to two words), you can use shorter endSilence VAD values for faster interaction:
// onFinalResponse will trigger after this much silence after the command
recognizer.setVADParameters(VADParameter.endSilence, 0.6)
Build your vocabulary to handle common exceptions workers may need to report:
| Exception | Example Phrases |
|---|---|
| Short pick | “short 1”, “short 2”, “missing” |
| Damaged item | “damaged”, “broken” |
| Wrong location | “wrong slot”, “empty” |
| Skip/defer | “skip”, “later” |
Including these phrases in your decoding graph allows the system to recognize exceptions and route them appropriately rather than forcing workers to repeat valid quantities.
Then, you can prepare the recognizer with the specific decoding graph:
recognizer.prepareForListening("commands", /* gop */ false)
The recognizer will automatically stop listening based on VAD (Voice Activity Detection) thresholds and deliver the final response via the onFinalResponse callback. See VAD Thresholds for details on configuring when the recognizer stops.
For voice picking, you will most likely want to set up your app for continuous listening. To achieve that, analyze the result in onFinalResponse: if the result is empty or not actionable, restart listening via recognizer.startListening(). If the result has actionable content, perform the relevant action and then start listening again. See Continuous Listening for more details.
If using always-on listening, we recommend turning on VAD gating as well; this will minimize the use of battery when the user is not speaking:
recognizer.setVADGating(true)
Partial Results
Partial results are provided via a callback that is called periodically (approximately every 200ms) as the recognized text changes. They can be used to show streaming speech recognition results (e.g. in a debug mode) and for dynamic VAD threshold adjustments.
Partial results provide low-latency, real-time feedback but do not include word timings, confidences, etc.
Dynamic VAD adjustment: Analyze partial results to determine if the expected command has been recognized. If so, you can reduce the endSilence threshold to stop listening sooner rather than waiting for the full timeout. See VAD Thresholds for more details.
onPartialResult(result, recognizer) {
if (isActionable(result)) {
// shorten endpointing threshold since we have what appears to
// be an actionable command
recognizer.setVADParameters(VADParameter.endSilence, 0.4)
}
if (debug) {
show(result.text)
}
}
Final Response Callback
The final response callback is called when the recognizer stops listening because one of the VAD thresholds has been reached. The final response provided through the callback contains the result (with words, timings, word confidences, etc.) as well as some other metadata and artifacts (audio, JSON).
In this callback you analyze the recognized text and either act upon it or restart listening. Note that if you changed VAD parameters dynamically (e.g. in the onPartialResult callback), you should reset them to the original values before starting to listen again.
onFinalResponse(response, recognizer) {
if (isActionable(response.result)) {
// perform action
} else {
// set VAD parameters if necessary and then start listening
recognizer.startListening()
}
// if Dashboard is setup, you could queue up the response for upload
response.queueForUpload()
// or save audio/json locally and do something with it
response.saveAudioFile(dirname)
response.saveJsonFile(dirname)
}
Checklists and Procedures
Voice-driven checklists allow workers to progress through inspection or maintenance steps hands-free. The system reads each step, and the worker responds with completion status.
Setup
For checklist workflows, build a decoding graph with the expected responses:
// Common checklist responses
checklistResponses = [
"yes", "no",
"done", "complete", "completed",
"ok", "okay",
"pass", "fail",
"skip", "not applicable"
]
recognizer.createDecodingGraph(checklistResponses, "checklist")
Contextual Graphs for Multi-Step Procedures
When procedures have step-specific expected responses, you can use separate decoding graph for each step or set up a contextual decoding graphs:
// Each step has different valid responses
procedureSteps = [
[ // Step 1: Safety check
"clear", "not clear", "blocked"
],
[ // Step 2: Pressure reading
"normal", "low", "high", "critical"
],
[ // Step 3: Visual inspection
"pass", "fail", "needs attention"
]
]
recognizer.createContextualDecodingGraph(procedureSteps, "safety-procedure")
currentStep = 0
recognizer.prepareForListeningWithContextualDecodingGraph("safety-procedure", currentStep)
// ...
onFinalResponse(response) {
recordStepResult(currentStep, response.result.text)
currentStep = currentStep + 1
if (currentStep < procedureSteps.length) {
playNextStepPrompt(currentStep)
recognizer.prepareForListeningWithContextualDecodingGraph("safety-procedure", currentStep)
recognizer.startListening()
} else {
completeProcedure()
}
}
Data Entry
Voice data entry captures codes, serial numbers, readings, and notes without manual input. This is useful for field inspections, inventory counts, and equipment logging.
Technical Implementation
Continuous Listening
Continuous listening keeps the recognizer active across multiple interactions without requiring a push-to-talk trigger. This is essential for truly hands-free operation in frontline environments.
How Continuous Listening Works
To achieve continuous listening, the recognizer needs to be restarted after each final response:
- Recognizer is listening
- Worker speaks a command
- VAD detects end of speech
onFinalResponseis called with the result- If result is actionable you perform the action and restart listening
- If result is not actionable you restart listening immediately
This creates a seamless loop where the worker can issue commands continuously.
See Continuous Listening for detailed documentation.
VAD Gating
When continuous listening is enabled, VAD gating becomes important. Without VAD gating, the recognizer runs continuously, consuming CPU and battery even during silence. With VAD gating enabled, a simple low-power voice detection step runs until speech is detected, at which point the recognizer activates:
// Enable VAD gating to reduce battery consumption
recognizer.setVADGating(true)
The VAD errs on the side of detecting non-speech as speech to avoid missing actual commands, so you may still receive final results with empty text (e.g., from loud background noise). Your app should handle empty results by simply restarting listening.
Using Trigger Phrase
For environments where continuous listening may pick up unintended speech (nearby workers, radios), consider implementing a trigger phrase:
// Require trigger phrase to be set before any command
triggerPhrase = "hey computer"
recognizer.createDecodingGraphWithTriggerPhrase(commands, triggerPhrase, "commands")
This ensures the recognizer only acts on commands explicitly directed at the system. The trigger phrase (“hey computer” in this example) should be acoustically distinct and unlikely to occur in normal conversation. We recommend multi-syllable word/phrase to avoid match with other speech.
Tradeoff: Adding a trigger phrase increases reliability (fewer false triggers from ambient speech) but also increases interaction time since the worker must say extra words with each command.
Processing the result: When using a trigger phrase, strip it from the recognized text before processing the command. For example, if the result is “HEY COMPUTER NEXT”, extract “NEXT” as the actual command.
Choosing an Interaction Mode
| Mode | Hands-free | False triggers | Interaction speed | Best for |
|---|---|---|---|---|
| Continuous listening | Yes | Higher (mitigated with VAD gating) | Fastest | Headset users in controlled environments |
| Trigger phrase | Yes | Low | Moderate (extra words per command, can be tedious for frequent interactions) | Multi-worker or noisy environments |
| Tap-to-talk | No | None | Varies (requires button press) | Intermittent voice use, shared devices |
For most voice picking applications, continuous listening with VAD gating provides the best balance of speed and reliability. Add a trigger phrase if false triggers from ambient speech become problematic.
Minimizing Cross-talk Interference
In environments where multiple workers are nearby and interacting with their devices simultaneously, cross-talk can cause one worker’s speech to be picked up by another worker’s device.
Use directional microphones: Boom microphones with good directional characteristics help reject speech from other workers. Position the microphone close to the mouth to maximize the signal-to-noise ratio for the intended user.
Filter by audio level: Use the AudioQualityResult metrics from the response to identify and ignore speech that likely came from another user. Speech from nearby workers will typically have lower RMS levels than speech from the device’s own user:
onFinalResponse(response, recognizer) {
// Ignore speech that appears to come from farther away
if (response.audioQualityResult.peakSpeechRMS < -35) {
// Likely cross-talk from another worker, ignore
recognizer.startListening()
return
}
processCommand(response.result)
}
The appropriate threshold depends on your microphone and environment. Test with actual hardware to determine the right value.
Playing Audio While Listening
When using headsets, you can play audio prompts (TTS or pre-recorded) while the recognizer is listening. The close-talk microphone isolates the worker’s voice from the headset audio, so you typically do not need to stop the recognizer before playing prompts.
If using device speakers instead of headsets, stop the recognizer before playing audio to avoid the prompt being recognized as speech.
VAD Thresholds
Voice Activity Detection thresholds control when the recognizer stops listening. Optimal values depend on the type of response expected.
Short Commands
For short commands (single words, check digits, yes/no responses), use shorter thresholds for faster interaction:
recognizer.setVADParameters(VADParameter.endSilence, 0.8) // Stop quickly after speech ends
// leave defaults for noSpeech and maxDuration
Longer Responses
For data entry or descriptions where the worker may pause to read or think:
recognizer.setVADParameters(VADParameter.endSilence, 2.0) // Allow pauses
Dynamic Adjustment
You can adjust VAD parameters dynamically based on workflow state. A common pattern is to start with a longer endSilence and reduce it in the onPartialResult callback once the expected command is recognized:
// Start with longer timeout
recognizer.setVADParameters(VADParameter.endSilence, 2.0)
onPartialResult(result, recognizer) {
if (isActionable(result.text)) {
// Reduce timeout to stop listening sooner
recognizer.setVADParameters(VADParameter.endSilence, 0.4)
}
}
Decoding Graphs
Build decoding graphs from the expected vocabulary for each interaction context. Constraining recognition to valid responses dramatically improves accuracy compared to open-vocabulary recognition.
Command-Based Graphs
For command-driven interfaces, create graphs containing all valid commands:
commands = [
"next", "back", "repeat",
"confirm", "cancel",
"help", "pause"
]
recognizer.createDecodingGraph(commands, "navigation")
Domain-Specific Vocabulary
Enterprise environments often have domain-specific terminology. Add custom pronunciations for terms not in the standard lexicon:
// Add internal product codes and jargon
altWordPronunciations = [
WordPronunciation("SKU", "S K Y UW"),
WordPronunciation("RFID", "AA R EH F AY D IY"),
WordPronunciation("WMS", "D AH B AH L Y UW EH M EH S")
]
phrases = ["scan SKU", "check RFID", "update WMS"]
recognizer.createDecodingGraph(phrases, "warehouse-commands", altWordPronunciations)
Decoding graphs are persisted to disk and are specific to the ASR Bundle. You can check for their existence and skip the creation step if the graph is already available:
// Check if graph already exists before creating
if (!DecodingGraph.graphWithNameExists("zone-a-checkdigits")) {
recognizer.createDecodingGraph(zoneACheckDigits, "zone-a-checkdigits")
}
Dealing With Accented Speech
When dealing with heavily accented speech, you may get <SPOKEN_NOISE> tokens instead of the expected command because the pronunciation differs significantly from the standard. There are two ways to handle this:
- Reduce the
spoken_noiseprobability when creating decoding graphs (default is 0.5; you can set it to 0.2 or 0.3 to make the recognizer more lenient) - Specify alternative pronunciations for commands that are commonly affected by accents
// Add alternative pronunciations for accented speech
// e.g., "three" pronounced as "tree" or "free" by some speakers
altWordPronunciations = [
WordPronunciation("three", "T R IY"), // "tree"
WordPronunciation("three", "F R IY") // "free"
]
commands = ["one", "two", "three", "four", "five"]
recognizer.createDecodingGraph(commands, "numbers", altWordPronunciations)
Alphanumeric Codes
For alphanumeric codes like part numbers or serial numbers, you may need to support letter spelling:
// NATO phonetic alphabet for reliable letter recognition
phoneticAlphabet = [
"alpha", "bravo", "charlie", "delta", "echo", "foxtrot", "golf",
"hotel", "india", "juliet", "kilo", "lima", "mike", "november",
"oscar", "papa", "quebec", "romeo", "sierra", "tango", "uniform",
"victor", "whiskey", "x-ray", "yankee", "zulu",
"zero", "one", "two", "three", "four", "five", "six", "seven", "eight", "nine"
]
recognizer.createDecodingGraph(phoneticAlphabet, "alphanumeric")
Using the phonetic alphabet rather than raw letters (A, B, C) significantly improves recognition accuracy, especially in noisy environments where many letters sound similar.
Handling Unrecognized Input
When the recognizer produces a result that doesn’t match expected values (e.g., worker said something not in the grammar), prompt for retry:
onFinalResponse(response) {
if (!isValidResponse(response.result.text)) {
retryCount = retryCount + 1
if (retryCount < 3) {
playPrompt("I didn't understand. Please say the check digit.")
recognizer.startListening()
} else {
// Fall back to manual entry
showManualEntryUI()
}
}
}
Noise and False Triggers
In noisy environments, the recognizer may trigger on non-speech sounds even with VAD gating. Implement checks for out-of-vocabulary speech and low confidence:
onFinalResponse(response) {
// Check if this looks like actual speech
if (response.result.words.length == 0) {
// No words recognized - likely transient noise triggered VAD gating
recognizer.startListening()
return
}
// Check for out-of-vocabulary words (indicated by SPOKEN_NOISE token)
for (word in response.result.words) {
if (word.text == "<SPOKEN_NOISE>") {
// User said something not in the decoding graph
playPrompt("I didn't get that. Please repeat")
// assumes above call is synchronous, i.e. we start listening only
// after audio is done playing
recognizer.startListening()
return
}
}
// Check minimum confidence across all words
minConfidence = 1.0
for (word in response.result.words) {
if (word.confidence < minConfidence) {
minConfidence = word.confidence
}
}
if (minConfidence < 0.5) {
playPrompt("I didn't get that. Please repeat")
// assumes above call is synchronous, i.e. we start listening only
// after audio is done playing
recognizer.startListening()
return
}
processValidResponse(response)
}
Best Practices
- Use headsets: A headset with close-talk microphone dramatically improves signal-to-noise ratio compared to device microphones
- Constrain vocabulary: Smaller, well-defined vocabularies are more robust to noise than large grammars
- Design acoustically distinct commands: Avoid commands that sound similar (e.g., “fifteen” vs “fifty”)
- Enable VAD gating: Reduces processing overhead and preserves battery by only running recognition when speech is detected
Summary
When building frontline worker apps with KeenASR SDK:
- Constrain recognition using decoding graphs built from valid responses for each step
- Use continuous listening with VAD gating for hands-free operation
- Adjust VAD thresholds based on expected response length
- Add custom pronunciations for domain-specific terminology
- Design for noise: use headsets, distinct vocabulary, and confidence checks
