Data Labeling for Machine Learning

July 26, 2022

For supervised learning applications such as image recognition, sentiment analysis and time series forecasting, the quality of labeled data dictates the performance of the machine learning application.

The challenge we currently have with data labeling is twofold. Firstly, data annotation is a largely manual process, with some emerging technologies helping the process. Second, most applications require specialist input. Think healthcare diagnosis and natural language processing in the legal industry. Even more commonplace areas such as labeling street data for autonomous vehicles requires meticulous and attentive work from labelers.

All labeling efforts must be supported with adequate data labeling tools. These are specialized tools depending on the type of dataset your project uses.

In this article, we will describe the most important aspects to optimize your data labeling process and scale quickly:

  1. Data Labeling Project Management
  2. Types of Data Labeling Teams
  3. Estimating Dataset Sizes
  4. Finding and Expanding Datasets
  5. Data Annotation Tools
  6. Advanced Labeling Techniques
  7. Continuous Improvement with human-in-the-loop

Data Labeling Project Management

Project management for artificial intelligence ensures that the data labeling process is consistent and qualitative. Once you get past the proof of concept stage, project management will help you scale your application from 1x to 100x. Here, we need to consider five main aspects:

  1. Data Management – determine details such as total data volume, data storage and sharing, whether delivery is a one-off or recurring, data input and output formats such as .jpeg and .geotiff for input and industry-standard formats such as COCO, VOC, or Yolo for output.
  2. Guidelines – determine the success metric of the algorithm. For example, this could be ‘identify drivable areas based on satellite imagery with 95% accuracy’. Using existing examples can offer a baseline to a large team of annotators to ensure consistency through the dataset.
  3. Tooling – labeling tools can either be built in-house, through open-source software, or using third-party applications. One of the most important aspects of tooling is having clear documentation and guidelines such that your labeling team can learn how to use the tool.
  4. Dedicated Project Manager – Appointing an in-house project manager for the data labeling process can drastically streamline the project and ensure data scientists and developers are focused on their tasks. Their responsibilities include planning the project in terms of requirements, timelines, and costs associated with the project, partner evaluation, stakeholder liaison, data quality review, and progress tracking.
  5. Time & Effort Estimates – Time spent per one labeled entity depends on complexity. For example, a satellite image can take anywhere between one and four hours to fully annotate. To estimate the cost of on-house labeling, use your local minimum wage as a ballpark estimate. For example, to label 1000 images, each requiring 2 hours spent per image and a $9.35/h local minimum wage gives us an estimate of $18,700. To reduce these costs, labeling can also be crowdsourced rather than completed in house. This lower cost, however, comes with come compromises, such as lower accuracy and inconsistent work.

Types of Data Labeling Teams

The data labeling process entails human operators to annotate raw data with representative labels. Typically, artificial intelligence projects use one of the following data labeling team models:

  1. Crowdsourcing – An outsourcing method in which small tasks are delegated to large numbers of people to perform data labeling tasks. It’s arguably the cheapest and fastest way of gathering data, where projects are typically mediated through third party platforms such as Amazon Mechanical Turk. It is a good option for simple tasks, such as generic product categorization or age categorization. However, due to the lack of control over labeler’s skills and knowledge, crowdsourcing consistently falls short with more complex tasks.
  2. Non-specialized Outsourced – These types of vendors, often referred to as business process outsourcing (BPO), have a white collar workforce which can deliver data labeling services with more accuracy than crowdsourcing. This is typically due to basic training for the workers, the opportunity for workers to specialize in one area, and a formal working environment.
  3. Specialized Outsourced – These are data solution vendors who focus exclusively on providing high quality data labeling services. Their workforce is highly trained in specific areas and works within a robust framework. In addition to a consistent flow of high quality labeled data, these vendors can help you estimate how much data you need, establish delivery volumes and time frames, as well as work with your development team to define good practices and guarantee quality assurance.
  4. In-house Labeling – This model offers the greatest level of control over the labeling process. Building an in-house team of labelers enables you to hand-pick labelers based on expertise, scale the workforce accordingly, maintain high-quality communications and consistent training. However, the high entry cost, especially in advanced economies, reserves in-house labeling for the Teslas of the world. Moreover, scaling with in-house labeling can be especially challenging, as labeling tens or hundreds of thousands of data points requires a proportional workforce.

With good project management, AI companies can create a hybrid approach which combines two or more of the models described above. Combining the scalability and cost-effectiveness of crowdsourcing with the accuracy of in-house labeling can help you scale your labeling operations much faster than using a single mode alone.

Estimating Dataset Sizes

While there are no hard and fast rules for a minimum or recommended amount of data, you can arrive at feasible points by applying the following rules of thumb:

  1. Estimate using the rule of 10 – the rule states that the amount of training data you need is 10 times the number of parameters – or degrees of freedom – in the model. This recommendation came about as a way of addressing the totality of outputs available when combining the defined parameters.
  2. Supervised deep learning guideline – 5,000 labeled examples per category is enough for a supervised deep learning algorithm to achieve acceptable performance which will match human performance according to the Deep Learning book. To exceed human performance, the book authors recommend at least 10 million labeled examples.
  3. Computer vision models guideline – When using deep learning for image classification, a good baseline to start from is 1,000 images per class. The ImageNet Classification challenge, where the dataset had 1,000 categories with 1,000 images for each class, was large enough to train the early generations of image classifiers trained models such as AlexNet.
  4. 4x your validation data – for deep learning model training, use about 80% of the data for learning and 20% for validation. If you have carried out a successful validation or proof of concept on your algorithm, we suggest quadrupling the amount of data you’ve used to develop your final product.
  5. Determine the higher dataset threshold – as you plot the learning curve of the sample size against the success rate, the graph will look similar to a log function. If you found that the last two points plotted with your current sample size still have a positive slope, then you can increase the dataset for a better success rate. As the slope approaches zero, increasing the dataset is unlikely to improve the success rate.

Finding and Expanding Datasets

To help you get comprehensive enough datasets of unlabeled data, we provide a list of open datasets, as well as recommendations for data synthesis and augmentation.

  1. Open datasets – great sources of data from reputable institutions include: 
    • Registry of Open Data on AWS (RODA) contains public datasets available from AWS resources, such as Allen Institute for Artificial Intelligence (AI2), Digital Earth Africa, Facebook Data for Good, NASA Space Act Agreement, NIH STRIDES, NOAA Big Data Program, Space Telescope Science Institute, and Amazon Sustainability Data Initiative. 
    • DBpedia is a crowd-sourced community effort to extract structured content from the information created in various Wikimedia projects.
    • European Union Open Data Portal contains EU-related data for domains such as economy, employment, science, environment, and education. 
    • – The US government provides open data on topics such as agriculture, climate, energy, local government, maritime, ocean and elderly health.
  2. Data augmentation – is a technique that performs transformation on an existing dataset to repurpose it as new data. The clearest example of augmenting data is in computer vision applications, where images can be transformed with a variety of operations, including rotating the image, cropping, flipping, color shifting, and more.
  3. Data synthesis – is a type of generated data which has the same characteristics and schema as ‘real’ data. It is particularly useful in the context of transfer learning, where a model can be taught on synthetic data, and then re-trained for real-world data. A great example to understand synthetic data is its application to computer vision, specifically self-driving algorithms. A self-driving AI system can be taught to recognize objects and navigate a simulated environment using a video game engine. Synthetic data has some particular advantages, such as cheaply producing data once the synthetic environment was defined, having perfectly accurate labels on the generated data, and lack of sensitive information such as personal data.

Data Annotation Tools

Data labeling tools provide labelers with specialized features that enable them to streamline the data annotation process. Tools are often use case-specific, some being dedicated for annotating images, videos, text or audio. However, all tools have in common a set of evaluation criteria, which include file formats, security, pricing, deployment models, API integrations, quality control, documentation and learning curves.

The tool’s file formats and extensions refer to the supported input and output formats of datasets. If your datasets come from multiple sources and have different formats, the annotation tool must be able to support all the file extensions. Similarly, the tool should be able to export the training data into the formats you need. The tool’s pricing should be a reflection of all its features and use cases. In instances where your project requires a single annotation type (i.e. a computer vision annotation tool), the price should be lower than a comprehensive, all-purpose tool.

As a wider consideration, security can look at certifications such as SOC2, data storage policies, transmission security via VPNs, and user access controls. Especially when working on personal non-anonymized data, security is a must-have. Also consider how the tool will be deployed and where it will run from. Today, most common deployment models include on-premise appliances (where you buy a dedicated piece of hardware that you need to install or run), virtual appliances (the tool can run as a piece of software either on a local machine, or in a virtual machine hosted in a public or private cloud), and software-as-a-service (where the tools are accessible through a web portal, maintained by the provider).

Every tool should also enable seamless project management and quality control features to evaluate how datasets are managed, assigned to labelers, tracked for progress and reviewed for quality assurance.

Specialized Data Annotation Tools

At a high level, there are six types of data which can be used for machine learning. Each of those has specific annotation requirements depending on their nature. These include image, video, text, audio, time-series and sensor.

  • Image annotation tools – To support annotators to label images, image annotation tools can provide features such as bounding boxes, cuboids, polygons, semantic segmentation and landmark annotation. For bounding boxes, an annotator will place a 2D rectangle around the object that requires labeling and correctly classify it. Cuboids are similar to bounding boxes but expand into 3D space by adding a depth dimension. Polygons – aka ‘freehand drawing’ – allow annotators to add multiple points around the labeled object to accurately identify its borders without background noise. Semantic segmentation classifies pixels in an image and assigns it to an entity, such that every pixel in the image belongs to a specific group. Lastly, landmark annotation is used in applications such as facial recognition to determine the outline of the face and other key unique identifiers such as the eyes, mouth and nose.
  • Video annotation tools – To support video annotation, we recommend tools which support scene classification and object tracking. Scene classification is the practice of labeling a video by determining what is happening in the clip rather than the objects that are present. Object tracking labels the position of an object within a video with each frame. Object tracking can be done with either bounding boxes, cuboids or polygons.
  • Text annotation tools – A tool which can support the annotation of text datasets must provide features that enable sentiment analysis and parts of speech labeling for NLP. Sentiment analysis helps artificial intelligence algorithms to determine whether a piece of text uses positive, negative or neutral phrasing. To do so, the labeled training datasets must identify keywords with respect to the context they are in. Parts of speech (POS) refers to labeling each word within a text with its adequate syntactic function. For example, labeling actions as verbs, and objects as nouns. Parts of speech plays a critical role in disambiguation, such that a ML algorithm can correctly identify homonyms – words with more than one meaning – in a wider context.
  • Audio annotation tools – Should support audio transcription  and word tagging. Audio transcription has a labeler listen to an audio file and transcribe speech into text. Word and phrase tagging, compared to transcription, require the labeler to assign a tag to a timestamped section of audio, such that a machine learning algorithm can be trained to recognize the tagged speech.
  • Time-series annotation tools – For an annotation tool to support time series analysis, it should support multichannel labeling. Multichannel labeling refers to having multiple time-series datasets within the tool’s workspace. With multichannel labeling, you can synchronize multiple channels to identify relationships that happen across individual datasets. Considering that time-series data may contain hundreds of thousands of data points, it is important for the data annotation tool to display, zoom and pan across the points without delays or loading times.
  • Sensor annotation tools – The most common and reliable sensor data used nowadays is provided by Light Detection and Ranging (lidar) sensors. Sensor annotation tools must support lidar labeling, with features such as 3D point cloud annotation and polyline annotation. 3D point cloud annotation segments objects into clusters to help AI models to recognize the objects captured in the 3D point clouds. Polyline annotation has an open ended ending or side, which make it suitable for detecting long and narrow objects, such as street lines.

Advanced Labeling Techniques

Emerging technologies in data labeling tools use advanced techniques for labeling automation. Using techniques such as transfer learning, 3D point cloud, and automatic multi-point selection.

  • Transfer Learning – is a machine learning method which mimics the human ability of transferable skills. It uses trained weights from a source model as the initial weights for the training of a target dataset. First, take an unlabeled image. Then feed the image into multiple specialized neural networks, each one being able to identify a specific object, such as car, lamp post, pedestrian, etc. These source neural networks which specialize in identifying select objects can create ‘pseudolabels’ on the unlabelled data. The resulting annotated data can now be fed into the target neural network. To prevent this and to ensure human-level accuracy, the pseudo labels need to be verified by a trained human labeler to validate the selections and annotations. The human validation process can be further improved by ensuring the labelers are experts in their industry and are able to identify slight errors or misrepresentations.
  • 3D Point Cloud Data Annotation – Light Detection and Ranging (LiDAR) sensors are used to identify 3D entities using point clouds. LiDAR uses light pulses to measure the distance between the sensor and other objects to form a 3D point cloud and determine the outline of an object. These point clouds must be segmented as they belong to each object. Just as with other labeling techniques, this is very time consuming, especially considering the 3D nature of the data. To increase the speed in which this can be completed, neural networks can be trained to segment data points into different objects, often completed by detecting the clusters of different points. Following segmentation, a human can then annotate each object. This removes a time-consuming portion of the work for the human labeler, but still requires the human to label each object correctly.
  • Automatic Multi-Point Selection via Bounding Boxes – Multi-point data annotation entails human labelers to draw the exact boundaries of an object within an image. Compared to bounding boxes – which only require a rectangle to encompass the annotated object – multi-point annotation is more accurate and more time consuming. We can achieve the speediness of bounding boxes and accuracy of multi-point selection by using a trained neural network to detect the edges of an object in an image.  A bounding box is provided by a human annotator, and an application that uses the trained neural network will create a multi-point annotation for the object in the bounding box. The data annotator can then correct any mistakes in the annotation and add the label. This significantly increases the speed in which data annotators can perform data labeling. Multi-point selection contains considerably less noise and is preferred in creating models that achieve ground-truth.

Continuous Improvement with Human-in-the-Loop

Upon deploying an ML model in production, we can continue refining algorithms and edge cases by implementing a human-in-the-loop model to create a continuous improvement workflow. These are human operators which can intervene where the machine learning model fails. Especially for edge cases, where the ML model has not seen that instance before and does not know how to react, a human operator can perform the task, log the edge case and forward the details to the AI project team to train the AI model on that edge case. Human operators need to be available in realtime to pick up the cases and log them whenever required.


Data labeling is a complex process and a core pillar of machine learning. To ensure a constant flow of high quality training data, we recommend the following:

  • Create a robust project management framework for data labeling which includes adequate data labeling teams and data labeling tools
  • Determine the amount of data that your project requires and leverage existing open datasets to have access to unlabeled data.
  • Understand the differences between labeling for different types of data, such as images, videos, text, and audio.

iMerit’s extensive data labeling services can help you address all of these points. With multiple labeling teams that have industry-specific knowledge and tried-and-tested project management, iMerit can provide a steady stream of high quality training data for your machine learning project.

If you wish to learn more about iMerit’s data labeling services, please contact us to talk to an expert.