Cleaning the SpeechCommand dataset
Simple Audio Recognition
The Tensorflow Simple Audio Recognition tutorial is a great way to get started with machine learning. It uses the Speech Command Dataset to detect spoken occurrences of different keywords. With an accuracy of 85 to 90%, it’s pretty good but I want to take it a step further. The first step was to use a different model. The Honk project already tested some and the Res8 model was both small and performant. The second step was to use the updated version of the Speech Command Dataset. This boosted the recognition rate to 93%. After utilizing these low hanging fruits I ran across a wall. The model just wouldn’t improve.
Finding bad data fast
Time to take a closer look at the dataset. With over 100k files I could just pick some samples to manually review. At first glance, it looked good. But then I ran the model to spit out the audio files which were miss-classified. This time the result was different. The yield was just a couple hundred files. This time more of half of the files were silence or wrong words and cutoff samples. I removed them and rerun the model. I inserted the manually review correct files with a prefix back to the dataset. The prefix was used so that I wouldn’t have to review them again.
The new model still spits out new false classifications. Upon reviewing them the silent files and wrong files were mostly but not completely gone. What was left were more cutoff files. I guess these were classified correctly by chance. Like files “on” and “off” where just the “o” was audible have roughly a 50% chance to get classified correctly. The other reason could be that the old model learned to recognize word parts as correct labels. For the keyword, “Sheila” “She” was also classified as a valid sample. This is very bad because it doesn’t show up in the dev error and leads to unexpected detections in production.
After a few iterations, the errors got less and less so I could be confident most of the errors are gone. This procedure took significantly less time than manually reviewing all the data. It’s also a starting point for a method which makes eliminates manually reviewing files at all.
Be careful when cleaning data
When cleaning your dataset you have to be very careful. You have to be aware that every time you are changing your test and eval sets you are changing the distribution of your data. This can have a big impact on your production model. In my case, the model will be used to recognize words which are streamed from a microphone. The speech command dataset was recorded in with an app/browser and trimmed by a program. Because the production model will work continuously, cut off examples won’t occur in production. Removing them from the training examples moves our training set closer to the production-distribution. If we instead wanted to use our model to recognize captured voice snippets and cutoff examples are to be expected, this cleanup step would be counterproductive.
Here is the clean dataset with a folder which contains the bad files. Of the 110k files, 3500 were removed which represent about 3% of the files. The new dataset gave me a recognition rate of 95.5% and with some tuning of the learning rate I was able to finally get 97.3%. One last trick was to increase the background noise during training, which got me to 97.6%.
At this point, the misclassified keywords are hard to interpret even for a human closely listening with headphones.