How Noisy Labels Impact Machine Learning Models

March 29, 2021

Supervised Machine Learning requires labeled training data, and large ML systems need large amounts of training data. Labeling training data is resource intensive, and while techniques such as crowd sourcing and web scraping can help, they can be error-prone, adding ‘label noise’ to training sets. Can ML systems trained with noisy labels operate effectively?

Studies have shown that under certain conditions, ML systems trained with mislabeled data can function well. For example, a 2018 MIT/Cornell University study tested the accuracy of ML image classification systems trained with various levels of label noise. They found that the ML systems could maintain good performance with high levels of label noise under the following conditions:

  • The ML system must have a large enough parameter set to manage the complexity of the image classification task. For example, a four-layer convolutional neural network (CNN) was sufficient for a hand-written character benchmark, but an 18-layer residual network was needed to perform well on a general object recognition benchmark.
  • The training dataset must be very large – large enough to include many properly labeled examples, even if most of the training data is mislabeled.  With enough good training samples, the ML system can learn accurate classification by finding the ‘signal’ buried in the label noise.
Mislabeled data that is structured creates a large enough signal, sending training in the wrong direction

Another factor important to this study’s results was the nature of the labeling noise. The mislabeled samples were added to the training sets in a way that was random enough to not create strong patterns that would override the ‘signal’ represented by the properly labeled samples.

Many practical applications of ML are faced with complications that make label noise more of a problem.

While this study demonstrates that ML systems have a basic ability to handle mislabeling, many practical applications of ML are faced with complications that make label noise more of a problem. These complications not being able to create very large training sets, and systematic labeling errors that confuse machine learning. Let’s look at an example – using remote sensing and ML to assess earthquake damage.

Assessing damage after a major earthquake is critical to recovery, and this can be a huge task. For example, the 2010 earthquake in Haiti required assessment of over 750,000 buildings, and a 2011 earthquake in New Zealand required identification of over 360,000 tons of silt and rubble for removal. Analysis of aerial and satellite imagery with the help of ML systems proved to be essential to these recovery efforts.

ML systems for damage assessment are trained using data specific to each earthquake, because there are many factors that make each locale and recovery effort unique. To handle the scale of training in a timely manner, multiple recovery organizations and even the public are often enlisted. After the 2010 Haiti earthquake, for example, remotely sensed data was analyzed by representatives from 60 universities, 20 government agencies, and over 50 private companies. This approach inevitably creates label noise. In Haiti, the crowd-sourced labeling was shown to be only 61% accurate compared to ground surveys. 

Mislabeled data that is random in nature is less harmful to the ML system

In 2017 researchers analyzed the effect of mislabeled training data on ML systems used to classify rubble from the 2011 New Zealand earthquake. They had noticed that in this type of remote sensing application, label noise does not follow the sort of random patterns that ML systems were able to tolerate in the MIT/Cornell study. The labeling mistakes they observed were mainly due to inaccurate geospatial delineation, caused by lack of training (e.g., misunderstanding what to include as rubble) or inadequate tools (e.g., a coarsely drawn polygon including undamaged sidewalks as part of rubble). 

Below is an example of labeling noise typical in this application – the image on the left shows an intact roof improperly identified as rubble; the image on the right is correctly labeled.


The researchers simulated training data sets with the sort of geospatial labeling noise they had observed, and also with random labeling noise. They compared the performance of ML classification on these two data sets and found that geospatial mislabeling degraded classification performance about five times more than random mislabeling.

If your labeling errors are mostly random in nature, they will be less harmful to your ML system.

What can we take away from these studies?

  • Not all training data labeling errors have the same impact on ML system performance
    • If your labeling errors are mostly random in nature, they will be less harmful to your ML system. The errors will not create a large enough ‘signal’ to send training in the wrong direction
    • If your labeling errors are structured, for example because of repeated misapplication of labeling rules, they can be very harmful to your ML system. The system will learn to recognize the patterns created by this erroneous data as if it were correctly labeled.
  • To reduce the impact of labeling errors
    • Make sure your training data presents a strong learning ‘signal’ to your ML system with a high enough volume of accurately labeled samples
    • Clearly define labeling requirements upfront. This is absolutely critical – training with labels that don’t adequately reflect what you are looking for in your application will sabotage your ML system
    • Choose a highly skilled annotation partner.  The expertise to deliver data that meets your requirements is as critical as the requirements themselves.

If you wish to learn more about creating the training data you need to succeed in your machine learning application, please contact us to talk to an expert.