Building and Labeling High-Functioning Datasets for Your Model

September 24, 2021

There’s something we need to get off our chest, and it isn’t easy for us to say: optimizing datasets can be a nightmare. There, we said it. 

Ensuring datasets fit our needs involves much due diligence. Data isn’t always clean, clear, or readily actionable. Sometimes there’s just too much of it to even know where to begin. Other times it’s unclear what tools are best for annotating it, or just how to ensure the teams working on the data are up to the task.

In an effort to help innovators maximize the potential of their datasets for use in AI/ML models, we will be focusing on the top five considerations including data quality, data volume, annotation guidelines, tools for the job, and talent considerations.

Data Quality – Clean data vs Dirty data

Clean Data Dirty Data
Error-free Irrelevant
Correctly formatted and processed Processed erroneously
Relevant to the use case Visually unclear

In a perfect world, newly generated data would come neatly annotated and perfectly-packaged with a shiny red bow on top. But that isn’t the nature of data; it’s a chaotic beast that requires some elbow grease before it can be deployed. These are the two categories that we classify data into: clean and dirty.  

In any given dataset, the clean, usable data will only be a subset of the overall data. Defining a filtering mechanism can help you separate the clean and dirty data from your available dataset. Dirty data is not inherently useless as it can be used to supplement training for distinct and similar use cases.

Dirty data gets included in unlabeled datasets because, at first glance, it may appear to match some minimum criteria (i.e. image is taken from eye level and contains cars and pedestrians). But if each image within the dataset isn’t visually clear (such as containing blurred objects with hard to define edges), it can negatively impact the algorithm’s accuracy.

Using data relevant to the business function allows the algorithm to be trained to fit the prescribed function. For example, when training an algorithm to detect pedestrians crossing roads, it should consider roads, pedestrians, and cars, while excluding stationary entities such as buildings, foliage and the like.

Data Volume & Diversity

Data volume can vary widely depending on the nature of your project, industry, use case and data source. In the initial stages of your project, you’ll need to determine how much data you require and how it can be supplied. The method by which data is provided may also alter the behavior, performance and application of the dataset. 

For example, the data may be delivered in a single instance or be streamed continuously across the length of the project. If it is streamed continuously, the delivery could be sporadic, which is common where the industry has peak times, such as retail.  

Data should be as diverse as possible, with different objects and quantities of each object in a single data sample. A dataset that is diverse in relation to the content it presents will allow an algorithm to generalize as much as possible and learn the true patterns in the dataset. Varying the conditions in which data is captured such as the weather conditions, time, amount of light and so forth helps attune the algorithm to all possible scenarios rather than a particular instance. For autonomous vehicles, you need an algorithm that is also trained for snowy, dark nights rather than just dry, sunny afternoons.

Data Volume & Diversity

To train a model in a cost-efficient way for multiple likely scenarios, you need to strike the right balance between volume and diversity. A high volume dataset with a low level of diversity will not help to improve an algorithm for real-world usage. Similarly, a small dataset with high variety will not be enough to adequately train any models. Trying to get both a high volume of very diverse data will very likely be too expensive.

Instructions – the Guardrails of Data Annotation

In the data labeling context, instructions refer to all the necessary documentation that can help annotators fully understand how to deliver consistent results. Especially when working with larger or outsourced teams, clear instructions will considerably lower the amount of human errors and delays. 

Instructions – the Guardrails of Data Annotation

Consider the following points when devising instructions:

  • Project guidelines: These need to dictate exactly what should be annotated, how each object should be annotated, and how staff should ensure they do not miss key objects when annotating
  • Tool configuration and usage: How can the annotation tool be used? Are all features clearly documented and explained? How can the tool be configured for this project’s specific use cases?
  • Images and Videos: Rather than expressing all instructions in writing, visual representations of how data should be annotated will be much easier to follow regardless of the annotator’s background or native language
  • Ease of updating instructions: Requirements often change mid-flight, so creating instructions which can be easily updated will help you adjust along the way. Maintain a versioned set of guidelines as well as a versioned dataset, keeping track of changes as you go. Be explicit in terms of how anomalous and unexpected data should be handled.

Tools – Make Your Life Easier

Tools can make or break a project in terms of data quality, management and usage. When purchasing a labeling tool, consider the following criteria:

  • Integration and configuration: Tools must integrate with your application ecosystem, ensuring data can be handled seamlessly and securely. Being able to easily configure tools to each use case can also speed up the annotation process while ensuring datasets are labeled with optimal technique.
  • Data management: The best tools will empower you to manage your data seamlessly with top-down visibility into certain projects. This real-time reporting will give you insights into how far along your labelers are, allowing you to easily understand the status of a project while also monitoring performance outcomes to effectively determine whether performance is increasing or decreasing over time.
  • Annotation optimization: The best tools will allow you to optimize annotation processes per use case. For example, you can assign a static class when drawing a bounding box so annotators can create boxes without having to label each one individually. Filtered annotations could select the only data you are focusing on, such as ‘human’ class or ‘cyclist’ class. The ability to filter annotations in complex use cases can help teams to focus on improving specific annotation classes; this also makes investigations into datasets far easier. 
  • Machine learning: In video annotation, machine learning tools can annotate the first frame of a video, and then independently interpolate the rest of the frames, effectively annotating other objects of the same class. But be sure when assessing machine-learning-enabled tooling to ask plenty of questions during the demonstration, such as ‘how does the tool handle failure?’, etc. Just as this tool can automate much of the annotation around certain tasks, so too can it do it wrong across the board as well. So push the salesperson’s buttons! Don’t just let him wow you with the machine-learning capability of the tool.
Machine learning Tools Make Your Life Easier

The Talent – Assemble the Avengers

There are four main aspects involved in optimizing your annotation efforts for your available datasets:

  • Scaling and readiness: Scaling refers to hiring more talent to participate in the annotation process, while readiness refers to a new hires’ experience. If your annotation is being handled by a data service provider, ensure their scaling and readiness processes will meet your needs. It can sometimes help to begin labeling in-house to understand the time, effort, and fatigue of the work, which can help you navigate the process when assessing annotation providers and setting realistic project expectations.
  • Training and ramp-up time: Similar to scaling and readiness, training and ramp-up time refers to the amount of time that is required by labelers to get to a point where their work meets the standards you require. In addition, consider how to accelerate and monitor this process.
  • Communications and feedback loop: This refers to the ability to triage issues and communicate back to the team members adjustments, recommendations or common pitfalls. Be mindful with respect to the frequency and content of the feedback to keep your team engaged and motivated.
  • Conclusive or regular retrospective: Both throughout a project and between projects, try listening to a lot of different sources and formulate a theory that captures your processes, lessons learnt and opportunities for improvement.
The Talent – Assemble the Avengers - Copy

Closing Statement

Maximizing the potential of a dataset is no easy task. It requires a methodical and rigorous approach to communication and project management. Being mindful of these four criteria will help give you a great baseline to maintain high quality data at a consistent rate.  

There’s no shame in needing help either. Data service providers are excellent at taking over this crucial responsibility and delivering timely, high-quality datasets that can virtually guarantee the success of your project. For more information on how to take your project to the next level, consider speaking with an iMerit representative today.