Wake word detection on the Raspberry Pi Zero
Why do we need wake word detection?
Wake-word or hot-word detection is an important part of a digital voice interface for two reasons. Transcribing speech to text is a hard task. Usually, voice data is sent to a powerful server which does the transcription. Nobody wants their end device to stream nonstop every sample captured by their device. It wastes bandwidth, data volume and creates quite some privacy concerns. Also, cloud speech recognition usually isn’t free and running it 24/7 can create a significant bill.
On the other hand, using a hot-word explicitly addresses the assistant. Imagine having a conversation with another person and asking “Can you buy some milk?” You don’t want the assistant to react to this question.
Using a hot-word feels unnatural because is just one part of a normal human communication. Humans use much more information to infer if they are addressed. People are looking at each other into the eyes. We know if someone is physically close and turn our body into the others direction. Also, we can use context. I expect future versions of digital assistants will incorporate more aspects of the human behavior and finally lead a way out of the uncanny valley. But for the time being, we are stuck to simple hot-words.
Options for the Raspberry Pi Zero
The Raspberry Pi Zero is a very low-cost device which makes it useful for so-called satellites. Small little physical voice interfaces which redirect complex tasks to a suitable back-end (this can be cloud-based or a home server). We assembled a comprehensive list of wake-word engines but only two of them are currently lightweight enough to work for the Pi Zeros constrained CPU.
The well known Snowboy engine from Kitt.ai has been the first wake word engine to show up for the Raspberry Pi. It was often used in combination with Google’s AIY voice kit which couldn’t run its native hot-word activation for the Pi Zero.
The newcomer Porcupine from Picovoice comes with a benchmark tool and shows some impressive results.
The Snowboy engine does voice activity detection which is very lightweight and takes about 5% to 12% CPU when no voice is detected. During voice activity, it uses about 42% to 67% CPU. Both engines can detect multiple keywords with little overhead. In presence of noise, the performance of both engines degrades significantly. Don’t expect them to work reliably while playing loud music. One way to counter this would be acoustic echo cancellation, which in turn would only be possible if enough CPU power is available.
Whats special about Porcupine is its wake-word generator. You can generate custom wake-words without any voice sample just by specifying it. Snowboy allows you to create your own wake-words by providing three voice samples, but this will be so-called personal keywords and will only work reliably with your own voice.
Both engines are free for hackers and have paid options for commercial use. If you’re running the hot-word detection on a Raspberry Pi 3 and you use another keyword than Alexa, Porcupine might do a better job than Snowboy. For the Pi Zero we recommend the Snowboy engine. The benchmark of Picovoice currently does not apply the audio frontend to Snowboy, which gives an exceptionally good model of the Alexa hot-word. Our tests show that for this configuration is slightly better than the big Porcupine model.
Open source solutions?
Both Snowboy and Porcupine are great but somewhat limited and cannot be modified. Fortunately, Raphael Tang and Jimmy Lin released two papers (paper1 paper2) which describe state of the art wake-word detection and also provide the source code for it. Also, Google released a Speech Command Dataset which contains short samples of spoken words from different speakers. The resulting project Honk is looking very promising already but definitely needs more work to be ready for production. This includes optimizing the source code for the Raspberry (dependencies, installation, performance) also needs more data especially for common wake-words like Alexa. We also started our own audio classification project here.
Other Use Cases
Hot-words can not only be used to address assistants but are also able to provide a simple voice command interface. If you can limit your application to a few keywords this might be an option for you. Imagine a voice-controlled door-opener for disabled people or a “bark” wake-word that dispatches some dog food if your dog barks. Of course, you can control your lights or blinds via voice command, but these use cases are often better handled by simple manual switches. I’m sure if you ditch all obvious ideas and think a bit further you can come up with some great use cases.