Quality Assurance Techniques in Data Annotation

September 10, 2023

One of the most prevalent sayings in Data Science, Artificial Intelligence, and Machine Learning is the well-known phrase, “Garbage in, garbage out.” Although this expression may seem simple at first glance, it effectively encapsulates one of the most critical challenges within these domains. It suggests that low-quality data input into any system will inevitably generate predictions and outcomes of equally poor quality. 

Across all spheres of Artificial Intelligence applications, data holds paramount importance and serves as the foundation for training models and frameworks designed to support humans in many ways. These models, however, heavily rely on high-quality annotated data that faithfully represents the ground truth. No matter how good the model is, if the data provided is low quality, the time and resources poured into making a quality AI system will be wasted.

Since data quality is so critical, the initial stages of building a machine learning system hold immense importance. The excellence of the data is influenced not solely by its origin but also significantly by the methodology used for labeling the data and the caliber of the labeling procedure itself. The quality of data annotations is a pivotal element within the data pipeline of machine learning, upon which subsequent stages rely extensively.

Machine Learning Cycle


Data Quality

Data required for any task is prone to error, the primary cause being usually the human factor and biases. Label assignment to different data types, including text, video, or images, can yield divergent interpretations from various labelers, leading to errors in the process,

Data drift

Data drift occurs when the distribution of annotation labels, or data features, changes slowly over time. Data drift can increase error rates for machine learning models or rule-based systems. Without static data, an ongoing annotation review is necessary to adapt downstream models/solutions as data drift occurs. It is a slow and steady process throughout annotation that may skew the data.


While data drift refers to slow changes in data, anomalies are step functions – sudden (and typically temporary) changes in data due to exogenous events. For example, in 2019-20, the COVID-19 pandemic led to anomalies in many naturally occurring data sets. It is crucial to have procedures to detect anomalies with human-in-the-loop workflows instead of automated solutions. Credit

Quality Assurance Techniques

Quality assurance techniques help in detecting and reducing data annotation errors. These techniques ensure the final deliverable data is of the highest possible quality, consistency, and integrity. The following are some of those techniques:


This is a common statistical technique used to determine the data distribution, which refers to randomly selecting and keenly observing a subset of the annotated data to check for possible errors. If the sample is random and representative of the data, it can help to predict where issues are probable. 

Setting a Gold Standard

A selection of well-labeled images that accurately represent the perfect ground truth is called the Gold set. These image sets are mini-testing sets for human annotators, either as part of an initial tutorial or to be scattered across labeling tasks to ensure that the annotator’s performance is not deteriorating, either due to poor performance on their part or to changing instructions. It also sets a general benchmark for annotator effectiveness.

Annotator Consensus

This means assigning a ground truth value to the data after taking inputs from all the annotators and using the most likely annotation. This technique relies on the well-known fact that collective decision-making outperforms individual decision-making.

Using scientific methods to determine label consistency

Again inspired by statistical approaches, these methods involve using unique formulas to determine how different annotators perform. It determines human label consistency using scientific methods such as Cronbach Alpha, Pairwise F1, Fleiss’ Kappa, and Krippendorff’s Alpha. Each of these allows for a holistic and generalizable measure of the quality, consistency, and reliability of the data labeled.

 Fleiss’ Kappa 

(Where, po is the relative observed agreement among raters and pe is the hypothetical probability of chance agreement.)

Annotator Levels

This approach relies on ranking annotators and assigning them to levels based on their labeling accuracy (tested via the gold standard discussed above) and gives higher weight to the annotation of quality annotators. This technique is helpful for tasks with high variance in their annotations or those requiring a certain level of expertise. It is because the annotators who lack this expertise will have a lower weight given to their annotations, and those with expertise will have more influence on the final label given to the data.

Edge case management and review

Mark edge cases for review by experts. Edge case determination can be done by thresholding the inter-rater metrics listed above or by flagging by individual annotators or reviewers. It makes anomaly correction easy for the most problematic data.

Automated (Deep learning-based) Quality Assurance

Researchers and organizations are often looking for ways to include human input in data annotation to improve the quality of the labels. There are certain approaches, however, that exploit the principles of deep learning to make this process easier, primarily by identifying data that may be prone to errors, thus picking out data that should be reviewed by humans, ultimately ensuring higher quality.

Without delving too deep, this approach relies on actively training a deep learning framework and then using the neural network to predict the labels/annotations on the upcoming unlabeled data.

If an adequate framework is selected and then trained on data with high-quality labels, the predictions will have little to no difficulty classifying or labeling common cases. In cases where the labeling is challenging, like in an edge case, the framework will have high uncertainty (or low confidence) in the prediction.


Whether you are in the tech industry or working on cutting-edge research, having high-quality data is of utmost importance. Regardless of whether your task is statistical or related to AI, having an early focus on the quality of data will pay off in the long run.

At iMerit, using a combination of techniques mentioned above, we ensure that we only ship the highest quality labels to our customers. Whether by employing complex statistical methods to keep quality high or cutting-edge deep learning frameworks to keep speed high and assist human annotators in review, we keep quality at its highest standards, subjecting it to numerous checks before final delivery.

Are you looking for data annotation to advance your project? Contact us today.