What Data Quality Means to the Success of Your ML Models – 6 Rules You Need to Follow

June 03, 2021

“Quality is not an act, it is a habit,” said the great Greek philosopher Aristotle. It’s an idea that is as relevant today as it was back when he said it over two millennia ago. Quality, however, is not an easy thing to attain, especially when it comes to data and technologies like Artificial Intelligence (AI) and machine learning. 

Some applications see little harm in using data with high error rates, while others grind to a halt with the slightest flaw in a vast dataset. “Junk in, junk out” is a warning not to be taken lightly.  The tiniest of indiscretions in datasets can reverberate throughout a model and produce results that are worthless. Data cleanliness and consistency is the key for complex ML models.


The heavy cost of poor data quality

It is far more cost-efficient to prevent data issues than to resolve them. If a company has 500,000 records and 30% are inaccurate, then $15 million must be spent to correct those issues versus $150,000 to prevent them.

‘Big Data’ was a buzzword several years ago and many businesses convinced themselves the more data they had, the more value could be generated from it, but they ignored the fact that data needs to be wrangled, labeled, and massaged properly before it produces any valuable ROI. 

Quality depends on several factors

Quality plays an important role based on which stage of production you are in. When you are just starting out and trying to make your mark with an impressive proof of concept (POC), data collection is imperative and a compromise on quality might be necessary. However, once the product moves beyond the POC stage and safety becomes critical, quality may take preference over speed.

Quality also depends on the use case. When annotating a car with a bounding box, a 3-pixel threshold is acceptable. However, when denoting key points on a face, no pixel shifting is allowed. A 3-pixel threshold would make an image of a face useless.

Various objects in a scene may also determine quality. For example, for an autonomous vehicle use case, one would want the precision of the objects closer to the ego vehicle or the ones in the ego lane to be high. However, the objects at a distance or not in the ego lane are still allowed some compromise.

80% of the work is data preparation

Andrew Ng, the founder of and former head of Google Brain, says, “for a lot of problems, it’d be useful to shift our mindset toward not just improving the code but in a more systematic way improving the data.” 

Ng believes machine learning development can be accelerated if the process becomes data-centric rather than model-centric. While traditional software is powered by code, AI systems are constructed on data and code, which is in the form of models and algorithms. “When a system isn’t performing well, many teams instinctually try to improve the code. But for many practical applications, it’s more effective instead to focus on improving the data,” says Ng. It is commonly assumed that 80 percent of machine learning is data cleansing. “If 80 percent of our work is data preparation,” asks Ng, “then why are we not ensuring data quality is of the utmost importance for a machine learning team.”

80% of the work is data preparation

Consistency is key in data quality

For data labeling, “Consistency is key,” and it is especially crucial in the development of AI. When datasets are being labeled, labeling must be consistent across labelers and batches.

Errors can easily creep in because data labeling guidelines can be interpreted in two different ways by two different people, making the dataset noisy and inconsistent.

Be systematic and use proven tools and processes throughout the ML lifecycle. Structured error analysis is crucial during the model training stage.

Labeling data for quality

In one example, background car noise ruined the training of a speech recognition algorithm. A data-centric approach would identify the error — the car noise — then train the model with more data that included car noises. This should improve the consistency of labels, in this case something like, ‘speech with a noisy background that contains car noise.’ 

Although it might seem counterintuitive, even data with an annoying background car noise becomes quality labeled data, which, in turn, becomes quality training data. 

Once ‘background car noise’ is labeled correctly in the machine learning model’s training set, the speech recognition algorithm should be able to understand both what the car noise is as well as differentiate it from the speech it was initially trying to recognize. 

Data labeling tools don’t guarantee quality

The importance of correct data labeling cannot be understated. Naturally, it raises the question, “Which data tool is right for my application?” 

iMerit’s data labeling services are used in advanced machine learning algorithms, computer vision, Natural Language Processing, augmented reality, and data analytics. The company, which is funded by the British International Investment, Omidyar Network and the Michael and Susan Dell Foundation, uses its transformative technologies in cancer research, driverless car training, and crop yield optimization. 

Besides iMerit, there are a wide variety of data labeling tools which specialize in different use cases, such as Lionbridge AI, Amazon Mechanical Turk (MTurk), the Computer Vision Annotation Tool, SuperAnnotate, LightTag, DataTurks, Playment, and Tagtog. Some are free, most are paid. Some work better for video, or images, or LiDAR. iMerit works with every data labeling tool on the market.

MTurk is a crowdsourcing marketplace used for data labeling. One of the cheapest solutions on the market, MTurk has several drawbacks, including lacking key quality control features. It offers very little in the way of quality assurance, worker testing, or detailed reporting. Furthermore, heavy project management duties are placed on the user, requiring them to recruit workers as well as monitor all design request tasks.

The Computer Vision Annotation Tool (CVAT) boasts a wide range of features for labeling computer vision data,  supporting tasks like object detection, image segmentation, and image classification. However, the user interface is complex, and takes considerable time to learn.

SuperAnnotate works for image, video, LiDAR, text, and audio data classification. Advanced features include automatic predictions, transfer learning, and data and quality management tools. The company claims their tool performs three times faster than their competitors.

LightTag only works on text data, but it has a free starter package. DataTurks, another open-source platform, provides services for labeling text, image, and video data. Playment works on images and it is useful for building training datasets for computer vision models. Tagtog, another text labeling tool, can annotate data both automatically or manually.

AI is set to revolutionize a multitude of industries, but data labeling is key. If CT scans are labeled properly, AI will be able to detect pneumonia caused by COVID-19 in chest CT scans.

Other examples include utilizing human head detection, density mapping, crowd behavior detection in video surveillance for safety monitoring, disaster management, and traffic monitoring.  Natural Language Processing can be used to recognize entities, attributes, as well as understand relationships between factors that will help improve drug development.

The construction, railway, and energy industries will benefit from the annotation of LiDAR data captured by drones. Robotic Process Automation (RPA) can speed up the accounting process, while keeping it error free.

6 rules for quality data

Follow these 6 basic data quality rules for deploying ML efficiently:

  • Ensuring high-quality data is available is imperative when dealing with MLOps.
  • Labeling consistency is essential.
  • A methodical improvement in basic model data quality is often better than a state-of-the-art model implementation on low-quality data.
  • A data-centric approach should always be followed.
  • With a data-centric view, there is plenty of room for improvement when problems contain smaller datasets (<10k example).
  • When utilizing small datasets, the tools and services that promote data quality are critical.

Good data is defined consistently, covers all edge cases, has timely feedback from production data and is sized appropriately. Rather than counting on engineers lucking into fixing the model, the most important objective of any MLOps team is to ensure a high-quality and consistent flow of data through all stages of an ML project.

“Data quality requires a certain level of sophistication within a company to even understand that it’s a problem,” says Colleen Graham, the founder and CEO of NextGen compliance. 

AI can be a difficult technology to implement and having bad or improperly labeled data running through it almost guarantees failure. To get to a point where data quality is monitored as part of a standard operation is not just a goal but a necessity. 

Data cleanliness and consistency is the key, and iMerit can help you achieve the quality results that you’re striving for.