Wake word detection on the Raspberry Pi Zero
Why do we need wake word detection?
Wake-word or hot-word detection is an important part of a digital voice interface for two reasons.
Transcribing speech to text is a hard task. Usually, voice data is sent to a powerful server which does the transcription. Nobody wants their end device to stream nonstop every sample captured by their device. It wastes bandwidth, data volume and creates quite some privacy concerns. Also, cloud speech recognition usually isn’t free and running it 24/7 can create a significant bill. Using a wake-word is a convenient way of limiting the amount of audio which has to be processed for speech.
On the other hand, using a hot-word explicitly addresses the assistant. Imagine having a conversation with another person and asking “Can you buy some milk?” You don’t want the assistant to react to this question.
Wakewords are not here to stay forever
Using a hot-word feels unnatural because is just a single part of normal human communication. Humans use much more information to infer if they are addressed. People are looking at each other into the eyes. We know if someone is physically close and turn our body into the others direction.
We are aware if we are in a room with somebody or alone. A more clever assistant will probably know that it’s addressed when a user who is alone in the kitchen looking at the fridge says “Order some milk”.
I expect future versions of digital assistants will incorporate more aspects of human behavior and finally lead a way out of the uncanny valley. But for the time being, we are stuck to simple hot-words.
Options for the Raspberry Pi Zero
The Raspberry Pi Zero is a very low-cost device which makes it useful for so-called satellites. Small little physical voice interfaces which redirect complex tasks to a suitable back-end (this can be cloud-based or a home server). Currently, there are only three wake-word engines lightweight enough to work with the Pi Zeros constrained CPU.
The well known Snowboy engine from Kitt.ai has been the first wake word engine to show up for the Raspberry Pi. It was often used in combination with Google’s AIY voice kit which couldn’t run its native hot-word activation for the Pi Zero.
Snowboy allows you to create your own wake-words by providing three voice samples, but this will be so-called personal keywords and will only work reliably with your own voice.
The Snowboy engine does voice activity detection which is very lightweight and takes about 5% to 12% CPU when no voice is detected. During voice activity, it uses about 42% to 67% CPU.
The newcomer Porcupine from Picovoice comes with a benchmark tool and shows some impressive results.
What’s special about Porcupine is its wake-word generator. You can generate custom wake-words without any voice sample just by specifying it. However, you might need a commercial license to do this.
A part of the Nyumaya audio recognition software is hot-word detection. Currently, you can only use two fixed hot-words “Sheila” and “Marvin”. For each keyword two models are available. The smaller one uses 20% CPU on the Pi Zero. The recognition rate is similar to Porcupine, but comparing them is hard since different words are used.
In presence of noise, the performance of all engines degrades significantly. Don’t expect them to work reliably while playing loud music. One way to counter this would be acoustic echo cancellation, which in turn would only be possible if enough CPU power is available.
Other Use Cases
Hot-words can not only be used to address assistants but are also able to provide a simple voice command interface. If you can limit your application to a few keywords this might be an option for you. Imagine a voice-controlled door-opener for disabled people or a “bark” wake-word that dispatches some dog food if your dog barks.
Of course, you can control your lights or blinds via voice command, but these use cases are often better handled by simple manual switches.
I’m sure if you ditch all obvious ideas and think a bit further you can come up with some great use cases.