Creating an audio dataset for machine learning can be a complex and time-consuming task. Unfortunately, many people make mistakes when creating these datasets, which can lead to poor performance of machine learning models. Here, we will discuss some of the common mistakes made when creating audio datasets for machine learning and how to avoid them.
No Split between testing and training data
First, a common mistake is not having a clear split between test, evaluation, and training data. Ideally, the split should be 80/20 or 80/10/10, where 80% of the data is used for training, 10% for validation, and 10% for testing. When training neural networks it can happen that they memorize example rather than generalizing. This is called overfitting. In such a case the network performs better on the training data than on data it sees in production.
Data distribution mismatch
Another mistake is not matching or having too narrow of a distribution in the dataset. For example, if you are creating a voice assistant device, your training data may be recorded in a near-field environment. However, the real-world usage of the device may be in a far-field environment. In this case, the distribution of the training data does not match the real-world usage, which can lead to poor performance of the machine learning model. Similarly, if the training data is only recorded in a near-field environment, but the real-world usage may be in either near-field or far-field, the distribution is too narrow and may not accurately represent the real-world usage.
Taking validation data from the same source as training data
Lastly, another mistake is drawing the test data from the training data. For example, if you are recording training data in-house, you may create a corpus and split it into train, dev, and validation sets. However, if the distribution between train and test is the same, the model may not accurately predict performance in the real-world. This is because the distribution in the field will likely be different from the in-house recording environment.
To avoid these mistakes, it is important to carefully consider the distribution of the training data and ensure it matches the real-world usage of the machine learning model. Additionally, a clear split between test, eval, and training data should be implemented, and test data should not be drawn from the training data. By avoiding these mistakes, you can create a high-quality audio dataset that will lead to better performance of your machine learning model.