What makes a good dataset set?
A good dataset is key to a well performing keyword-spotting model. Try to make it as close to the production use case as possible and as diverse as possible.
For audio recordings this might be:
- If you want to recognize audio from a headset record your training examples using a headset.
- If you want to recognize from headset and far field, record samples from both.
- Diversity is important. The dataset should contain a wide variety of voices. If the dataset does not contain for example childrens voices, the accuracy for children will be degraded. The same goes for accents, voices etc…
- Try to record from as many voices as possible. As a rule of thumb record between 5 and 10 samples from one person.
- If you can only record with a single microphone use the same microphone as in the production case.
- If you can record from multiple microphones the resulting model will adapt better to different production environments.
Special Cases for Keyword spotting:
For keyword spotting models one important metric is False Activations per hour (FA)
The value of a models FA can be very different depending on the testing data.
For example FA can be almost 0.0 if you only feed silence or low volume microphone noise.
It can be very high if your test data is continuous speech with similar sounding words to your keywords.
An example:
You are trying to recognize the phrases “Sport Mode” and “Economy Mode” for a voice control system which is embedded in the helmet of the rider. A good test set for FA would be real recordings of different riders in traffic. A bad test set would be recordings from a far field speaker array in someone’s living room.
Dataset Splitting
When you are creating datasets for machine learning you have to be very careful. One of the most important aspects is splitting your dataset. This is absolutely necessary, no matter if you train a model yourself or letting someone else train it.
When you are training a neural network you might encounter overfitting. This is the case when your network is better at recognizing examples it had seen during training, than new ones.
When you let someone train a network on all your available data, and don’t keep a portion of your dataset for evaluation, all you can do is test it on the training data. The measured accuracy might be 95% in this case. Because all data in production usage is unseen the neural network might perform much worse in this case.
Good splits between training and eval/test are:
- 80% training data, 20% test data (Use this when you are letting someone else training the model. They will split the training data themselves)
- 80% training data 10% eval 10% test data (Use this when training the model yourself)
Why should I never give out my test set?
By never giving out your test set you can guarantee that the model is not trained (by accident or on purpose) on examples from the test set.
Does it help to artificially amplify the training data
Yes if you train a model yourself. Adding noise, time stretching and room simulation can help to improve your model. This is called data augmentation.
No if you send the training data to us. We already do a wide range of augmentation.
Therefore we need the samples without data augmentation.
How many samples do I need?
About 200 diverse samples per keyword is the absolute minimum to get any reasonable results.
The demo files (marvin and sheila) are trained on 2000 samples.
More samples give a better accuracy, especially on bigger models, but expect diminishing returns on a large amount of samples.
Being future proof
Currently many audio recognition systems use a relativly low sample rate (16kHz) and a single channel. Getting recordings can be an expensive and time consuming process. If you can take recordings in with a higher sample rate (48kHz) and multiple channels (2) or even more from a microphone array.
This way you can use the same dataset if a more advanced model can make use of this increased data resolution.