Use machine learning to uncover the hidden value of audio data
A hidden treasure:
Audio is one of the five human senses and carries surprisingly much information. Close your eyes and listen to some sounds nearby. Chances are you can get a pretty good overview of whats going on around you. You can hear cars passing by and count them. You can estimate how fast they are going and you can distinguish a truck from a sports car. Maby a birds nest is nearby and you can suddenly hear baby birds begging for food. Of course, you know that one of their parents just arrived with some tasty food in its mouth. A seasoned machinist might put his ear on a marine diesel engine and can tell if all pistons are working smoothly and if the bearings are ok.
Microphones can capture this information from the distance at a spot where they can be easily placed and supplied. Often the captured information is incredibly useful and valuable. A researcher might use place a microphone in a bird sanctuary and use the bird sounds to get information about species and population size. It would be pretty tedious have a bird expert to listen trough weeks of audio recordings, and straight out impossible to scale this to hundreds of areas.
Mining the data:
This is where machine learning is stepping in. One application where this is done right now is the hot-word detection and speech recognition in smart assistants. If you can gather enough data its possible to build a neural network which is able to automatically classify sound in real-time. But how can we get so much data?
Google has released the audio dataset which uses labeled Youtube uploads of 635 categories and already covers a lot of good applications. DeepMind also provides the Kinetic Dataset. Another great source is https://www.kaggle.com/datasets.
An easy way to gather more data is to build a crawler and search for license-free videos which need to be manually segmented and tagged. Services like Mechanical-Turk can help with the tagging. Its also possible to start with little data and then use the neural net to further classify data, or at least reduce the amount which has to be reviewed.
Additionally, you can reduce the amount of the needed data by applying transfer learning and data augmentation. In the first case data from a related application can be used to pre-train a neural network, and then the real data is used to fine tune it. This is most commonly seen in image recognition where a pre-trained ImageNet acts as a starting point. In the second case, data can be modified to multiply it. For voice recognition this usually includes modifying the voice pitch, adding background noise, reverb, time shifting, even simulating room echo and other audio aspects.
Building a model
With the needed data in place, you’re ready to build your neural network. Tensorflow offers a good tutorial on how to do simple audio recognition. The open source project Honk provides a Pytorch implementation. The training can be done on a reasonably powerful GPU at home or using cloud services.
To get an idea of which applications are possible researchers usually aim for human performance. Humans are pretty good at interpreting data and surpassing them is hard, so listen to the gathered audio and try to interpret the data. This should give you some idea of the upper bound of the possible performance. In some expert areas machine learning can surpass human performance, but especially when little data is available humans will perform much better. If somebody tells you their name you only need one sample to remember it. Machine learning has some tricks under its belt, but general high-performance single shot learning has yet to be invented.
Not a silver bullet
Using audio as a sensor certainly is not a silver bullet. Often more simple approaches are more reliable and easier. But for many industries and applications, this could make a big impact. Researchers have little insight about how certain industries work. Maby you can find a hidden treasure in your industry.