KeenASR SDK enables hands-free voice interfaces for frontline workers in warehousing, logistics, manufacturing, field service, and other enterprise environments. On-device processing ensures low latency, offline operation, and data privacy, all critical requirements for enterprise deployments.

Common use cases include:

Voice picking: hands-free item picking in warehouses with voice confirmation of locations, quantities, and item codes
Checklists and procedures: voice-driven step-by-step workflows for inspections, maintenance, and safety procedures
Data entry: capturing readings, codes, and notes without touching a device
Equipment control: voice commands for machinery, vehicles, or wearable devices

In these environments, workers often have their hands occupied (carrying items, operating equipment, wearing gloves) and need their eyes focused on the task. Voice becomes the most practical input method.

Voice Picking

Voice picking workflows guide workers through item retrieval tasks using voice prompts and confirmations. The worker hears instructions (via text-to-speech or pre-recorded audio) and responds verbally to confirm actions.

Setup

Build a decoding graph containing all valid responses for the current step:

voicePickingCommands = [
    "one", "two", "three", "four", "five", "six", "seven", "eight", "nine", "ten", "hundred",
    "pause", "cancel", "next", "short one", "short two", "short three", "damaged", "skip", "print"]
recognizer.createDecodingGraph(voicePickingCommands, "commands")

You can set up multiple decoding graphs and switch among them based on the app context.

For short responses (one to two words), you can use shorter endSilence VAD values for faster interaction:

// onFinalResponse will trigger after this much silence after the command
recognizer.setVADParameters(VADParameter.endSilence, 0.6)

Build your vocabulary to handle common exceptions workers may need to report:

Exception	Example Phrases
Short pick	“short 1”, “short 2”, “missing”
Damaged item	“damaged”, “broken”
Wrong location	“wrong slot”, “empty”
Skip/defer	“skip”, “later”

Including these phrases in your decoding graph allows the system to recognize exceptions and route them appropriately rather than forcing workers to repeat valid quantities.

Then, you can prepare the recognizer with the specific decoding graph:

recognizer.prepareForListening("commands", /* gop */ false)

The recognizer will automatically stop listening based on VAD (Voice Activity Detection) thresholds and deliver the final response via the onFinalResponse callback. See VAD Thresholds for details on configuring when the recognizer stops.

For voice picking, you will most likely want to set up your app for continuous listening. To achieve that, analyze the result in onFinalResponse: if the result is empty or not actionable, restart listening via recognizer.startListening(). If the result has actionable content, perform the relevant action and then start listening again. See Continuous Listening for more details.

If using always-on listening, we recommend turning on VAD gating as well; this will minimize the use of battery when the user is not speaking:

recognizer.setVADGating(true)

Partial Results

Partial results are provided via a callback that is called periodically (approximately every 200ms) as the recognized text changes. They can be used to show streaming speech recognition results (e.g. in a debug mode) and for dynamic VAD threshold adjustments.

Partial results provide low-latency, real-time feedback but do not include word timings, confidences, etc.

Dynamic VAD adjustment: Analyze partial results to determine if the expected command has been recognized. If so, you can reduce the endSilence threshold to stop listening sooner rather than waiting for the full timeout. See VAD Thresholds for more details.

onPartialResult(result, recognizer) {
    if (isActionable(result)) {
        // shorten endpointing threshold since we have what appears to 
        // be an actionable command
        recognizer.setVADParameters(VADParameter.endSilence, 0.4)
    }
    if (debug) {
        show(result.text)
    }
}

Final Response Callback

The final response callback is called when the recognizer stops listening because one of the VAD thresholds has been reached. The final response provided through the callback contains the result (with words, timings, word confidences, etc.) as well as some other metadata and artifacts (audio, JSON).

In this callback you analyze the recognized text and either act upon it or restart listening. Note that if you changed VAD parameters dynamically (e.g. in the onPartialResult callback), you should reset them to the original values before starting to listen again.

onFinalResponse(response, recognizer) {
    if (isActionable(response.result)) {
        // perform action
    } else {
        // set VAD parameters if necessary and then start listening
        recognizer.startListening()
    }

    // if Dashboard is setup, you could queue up the response for upload
    response.queueForUpload()
    // or save audio/json locally and do something with it
    response.saveAudioFile(dirname)
    response.saveJsonFile(dirname)
}

Checklists and Procedures

Voice-driven checklists allow workers to progress through inspection or maintenance steps hands-free. The system reads each step, and the worker responds with completion status.

Setup

For checklist workflows, build a decoding graph with the expected responses:

// Common checklist responses
checklistResponses = [
    "yes", "no",
    "done", "complete", "completed",
    "ok", "okay",
    "pass", "fail",
    "skip", "not applicable"
]
recognizer.createDecodingGraph(checklistResponses, "checklist")

Contextual Graphs for Multi-Step Procedures

When procedures have step-specific expected responses, you can use separate decoding graph for each step or set up a contextual decoding graphs:

// Each step has different valid responses
procedureSteps = [
    [ // Step 1: Safety check
        "clear", "not clear", "blocked"
    ],
    [ // Step 2: Pressure reading
        "normal", "low", "high", "critical"
    ],
    [ // Step 3: Visual inspection
        "pass", "fail", "needs attention"
    ]
]
recognizer.createContextualDecodingGraph(procedureSteps, "safety-procedure")

currentStep = 0
recognizer.prepareForListeningWithContextualDecodingGraph("safety-procedure", currentStep)

// ...

onFinalResponse(response) {
    recordStepResult(currentStep, response.result.text)

    currentStep = currentStep + 1
    if (currentStep < procedureSteps.length) {
        playNextStepPrompt(currentStep)
        recognizer.prepareForListeningWithContextualDecodingGraph("safety-procedure", currentStep)
        recognizer.startListening()
    } else {
        completeProcedure()
    }
}

Data Entry

Voice data entry captures codes, serial numbers, readings, and notes without manual input. This is useful for field inspections, inventory counts, and equipment logging.

Technical Implementation

Continuous Listening

Continuous listening keeps the recognizer active across multiple interactions without requiring a push-to-talk trigger. This is essential for truly hands-free operation in frontline environments.

How Continuous Listening Works

To achieve continuous listening, the recognizer needs to be restarted after each final response:

Recognizer is listening
Worker speaks a command
VAD detects end of speech
onFinalResponse is called with the result
If result is actionable you perform the action and restart listening
If result is not actionable you restart listening immediately

This creates a seamless loop where the worker can issue commands continuously.

See Continuous Listening for detailed documentation.

VAD Gating

When continuous listening is enabled, VAD gating becomes important. Without VAD gating, the recognizer runs continuously, consuming CPU and battery even during silence. With VAD gating enabled, a simple low-power voice detection step runs until speech is detected, at which point the recognizer activates:

// Enable VAD gating to reduce battery consumption
recognizer.setVADGating(true)

The VAD errs on the side of detecting non-speech as speech to avoid missing actual commands, so you may still receive final results with empty text (e.g., from loud background noise). Your app should handle empty results by simply restarting listening.

Using Trigger Phrase

For environments where continuous listening may pick up unintended speech (nearby workers, radios), consider implementing a trigger phrase:

// Require trigger phrase to be set before any command
triggerPhrase = "hey computer"
recognizer.createDecodingGraphWithTriggerPhrase(commands, triggerPhrase, "commands")

This ensures the recognizer only acts on commands explicitly directed at the system. The trigger phrase (“hey computer” in this example) should be acoustically distinct and unlikely to occur in normal conversation. We recommend multi-syllable word/phrase to avoid match with other speech.

Tradeoff: Adding a trigger phrase increases reliability (fewer false triggers from ambient speech) but also increases interaction time since the worker must say extra words with each command.

Processing the result: When using a trigger phrase, strip it from the recognized text before processing the command. For example, if the result is “HEY COMPUTER NEXT”, extract “NEXT” as the actual command.

Choosing an Interaction Mode

Mode	Hands-free	False triggers	Interaction speed	Best for
Continuous listening	Yes	Higher (mitigated with VAD gating)	Fastest	Headset users in controlled environments
Trigger phrase	Yes	Low	Moderate (extra words per command, can be tedious for frequent interactions)	Multi-worker or noisy environments
Tap-to-talk	No	None	Varies (requires button press)	Intermittent voice use, shared devices

For most voice picking applications, continuous listening with VAD gating provides the best balance of speed and reliability. Add a trigger phrase if false triggers from ambient speech become problematic.

Minimizing Cross-talk Interference

In environments where multiple workers are nearby and interacting with their devices simultaneously, cross-talk can cause one worker’s speech to be picked up by another worker’s device.

Use directional microphones: Boom microphones with good directional characteristics help reject speech from other workers. Position the microphone close to the mouth to maximize the signal-to-noise ratio for the intended user.

Filter by audio level: Use the AudioQualityResult metrics from the response to identify and ignore speech that likely came from another user. Speech from nearby workers will typically have lower RMS levels than speech from the device’s own user:

onFinalResponse(response, recognizer) {
    // Ignore speech that appears to come from farther away
    if (response.audioQualityResult.peakSpeechRMS < -35) {
        // Likely cross-talk from another worker, ignore
        recognizer.startListening()
        return
    }

    processCommand(response.result)
}

The appropriate threshold depends on your microphone and environment. Test with actual hardware to determine the right value.

Playing Audio While Listening

When using headsets, you can play audio prompts (TTS or pre-recorded) while the recognizer is listening. The close-talk microphone isolates the worker’s voice from the headset audio, so you typically do not need to stop the recognizer before playing prompts.

If using device speakers instead of headsets, stop the recognizer before playing audio to avoid the prompt being recognized as speech.

VAD Thresholds

Voice Activity Detection thresholds control when the recognizer stops listening. Optimal values depend on the type of response expected.

Short Commands

For short commands (single words, check digits, yes/no responses), use shorter thresholds for faster interaction:

recognizer.setVADParameters(VADParameter.endSilence, 0.8)  // Stop quickly after speech ends
// leave defaults for noSpeech and maxDuration

Longer Responses

For data entry or descriptions where the worker may pause to read or think:

recognizer.setVADParameters(VADParameter.endSilence, 2.0)  // Allow pauses

Dynamic Adjustment

You can adjust VAD parameters dynamically based on workflow state. A common pattern is to start with a longer endSilence and reduce it in the onPartialResult callback once the expected command is recognized:

// Start with longer timeout
recognizer.setVADParameters(VADParameter.endSilence, 2.0)

onPartialResult(result, recognizer) {
    if (isActionable(result.text)) {
        // Reduce timeout to stop listening sooner
        recognizer.setVADParameters(VADParameter.endSilence, 0.4)
    }
}

Decoding Graphs

Build decoding graphs from the expected vocabulary for each interaction context. Constraining recognition to valid responses dramatically improves accuracy compared to open-vocabulary recognition.

Command-Based Graphs

For command-driven interfaces, create graphs containing all valid commands:

commands = [
    "next", "back", "repeat",
    "confirm", "cancel",
    "help", "pause"
]
recognizer.createDecodingGraph(commands, "navigation")

Domain-Specific Vocabulary

Enterprise environments often have domain-specific terminology. Add custom pronunciations for terms not in the standard lexicon:

// Add internal product codes and jargon
altWordPronunciations = [
    WordPronunciation("SKU", "S K Y UW"),
    WordPronunciation("RFID", "AA R EH F AY D IY"),
    WordPronunciation("WMS", "D AH B AH L Y UW EH M EH S")
]

phrases = ["scan SKU", "check RFID", "update WMS"]
recognizer.createDecodingGraph(phrases, "warehouse-commands", altWordPronunciations)

Note: For words missing from the lexicon, the SDK uses grapheme-to-phoneme (G2P) conversion to generate a pronunciation automatically. While G2P works well for standard words, acronyms and unusual terms may get incorrect pronunciations. For critical vocabulary, explicitly define pronunciations using WordPronunciation.

Decoding graphs are persisted to disk and are specific to the ASR Bundle. You can check for their existence and skip the creation step if the graph is already available:

// Check if graph already exists before creating
if (!DecodingGraph.graphWithNameExists("zone-a-checkdigits")) {
    recognizer.createDecodingGraph(zoneACheckDigits, "zone-a-checkdigits")
}

Note: If input phrases change, this caching approach can create silent bugs (the old graph is used instead of rebuilding with new phrases). One way to avoid this is to hash all inputs and include the hash in the graph name, ensuring a new graph is created when inputs change.

Note: Your app is responsible for managing disk space by removing graphs that are no longer needed.

Dealing With Accented Speech

When dealing with heavily accented speech, you may get <SPOKEN_NOISE> tokens instead of the expected command because the pronunciation differs significantly from the standard. There are two ways to handle this:

Reduce the spoken_noise probability when creating decoding graphs (default is 0.5; you can set it to 0.2 or 0.3 to make the recognizer more lenient)
Specify alternative pronunciations for commands that are commonly affected by accents

// Add alternative pronunciations for accented speech
// e.g., "three" pronounced as "tree" or "free" by some speakers
altWordPronunciations = [
    WordPronunciation("three", "T R IY"),   // "tree"
    WordPronunciation("three", "F R IY")    // "free"
]

commands = ["one", "two", "three", "four", "five"]
recognizer.createDecodingGraph(commands, "numbers", altWordPronunciations)

Alphanumeric Codes

For alphanumeric codes like part numbers or serial numbers, you may need to support letter spelling:

// NATO phonetic alphabet for reliable letter recognition
phoneticAlphabet = [
    "alpha", "bravo", "charlie", "delta", "echo", "foxtrot", "golf",
    "hotel", "india", "juliet", "kilo", "lima", "mike", "november",
    "oscar", "papa", "quebec", "romeo", "sierra", "tango", "uniform",
    "victor", "whiskey", "x-ray", "yankee", "zulu",
    "zero", "one", "two", "three", "four", "five", "six", "seven", "eight", "nine"
]
recognizer.createDecodingGraph(phoneticAlphabet, "alphanumeric")

Using the phonetic alphabet rather than raw letters (A, B, C) significantly improves recognition accuracy, especially in noisy environments where many letters sound similar.

Handling Unrecognized Input

When the recognizer produces a result that doesn’t match expected values (e.g., worker said something not in the grammar), prompt for retry:

onFinalResponse(response) {
    if (!isValidResponse(response.result.text)) {
        retryCount = retryCount + 1

        if (retryCount < 3) {
            playPrompt("I didn't understand. Please say the check digit.")
            recognizer.startListening()
        } else {
            // Fall back to manual entry
            showManualEntryUI()
        }
    }
}

Noise and False Triggers

In noisy environments, the recognizer may trigger on non-speech sounds even with VAD gating. Implement checks for out-of-vocabulary speech and low confidence:

onFinalResponse(response) {
    // Check if this looks like actual speech
    if (response.result.words.length == 0) {
        // No words recognized - likely transient noise triggered VAD gating
        recognizer.startListening()
        return
    }

    // Check for out-of-vocabulary words (indicated by SPOKEN_NOISE token)
    for (word in response.result.words) {
        if (word.text == "<SPOKEN_NOISE>") {
            // User said something not in the decoding graph
            playPrompt("I didn't get that. Please repeat")
            // assumes above call is synchronous, i.e. we start listening only 
            // after audio is done playing
            recognizer.startListening()
            return
        }
    }

    // Check minimum confidence across all words
    minConfidence = 1.0
    for (word in response.result.words) {
        if (word.confidence < minConfidence) {
            minConfidence = word.confidence
        }
    }
    if (minConfidence < 0.5) {
        playPrompt("I didn't get that. Please repeat")
        // assumes above call is synchronous, i.e. we start listening only 
        // after audio is done playing
        recognizer.startListening()
        return
    }

    processValidResponse(response)
}

Best Practices

Use headsets: A headset with close-talk microphone dramatically improves signal-to-noise ratio compared to device microphones
Constrain vocabulary: Smaller, well-defined vocabularies are more robust to noise than large grammars
Design acoustically distinct commands: Avoid commands that sound similar (e.g., “fifteen” vs “fifty”)
Enable VAD gating: Reduces processing overhead and preserves battery by only running recognition when speech is detected

Resources

Frontline Worker Android PoC: Sample Android app demonstrating voice-enabled frontline workflows
Empowering Frontline Inspections: Blog post on voice-driven inspection workflows

Summary

When building frontline worker apps with KeenASR SDK:

Constrain recognition using decoding graphs built from valid responses for each step
Use continuous listening with VAD gating for hands-free operation
Adjust VAD thresholds based on expected response length
Add custom pronunciations for domain-specific terminology
Design for noise: use headsets, distinct vocabulary, and confidence checks

Frontline Workers / Warehousing

Voice Picking

Setup

Partial Results

Final Response Callback

Checklists and Procedures

Setup

Contextual Graphs for Multi-Step Procedures

Data Entry

Technical Implementation

Continuous Listening

How Continuous Listening Works

VAD Gating

Using Trigger Phrase

Choosing an Interaction Mode

Minimizing Cross-talk Interference

Playing Audio While Listening

VAD Thresholds

Short Commands

Longer Responses

Dynamic Adjustment

Decoding Graphs

Command-Based Graphs

Domain-Specific Vocabulary

Dealing With Accented Speech

Alphanumeric Codes

Handling Unrecognized Input

Noise and False Triggers

Best Practices

Resources

Summary