Tooling strategies are only as good as the humans behind them.
At the iMerit ML Data Ops summit, Dataloop’s CEO Eran Shlomo and Datasaur’s CEO Ivan Lee discuss strategies for creating reliable datasets using adequate tooling powered by human wisdom.
These two experts put forward the importance of having a deep understanding of your data and the labeling process, which often entails AI project leads to gain a better, deeper understanding of data. Lee and Shlomo also discuss the idea of technical debt in AI projects and using existing tools and knowledge to get projects off the ground faster.
Employing Adequate Tooling
Just as humanity made progress in science by standing on the shoulders of giants, so can AI engineers make use of existing knowledge and efforts. Many AI projects start off by using non-specialized tools and labeling mundane datasets. For example, a natural language processing (NLP) project will most likely start by using spreadsheets and labeling common words such as Starbucks, London and Monday.
Understanding that this type of work has been already done and optimized, we can opt directly for specialized labeling tools, and also leverage existing models to pre-label and pre-annotate data. There’s no need to reinvent the wheel. And this will help you build a reliable tooling infrastructure from the get-go.
In addition, today’s AI projects are slowed down due to a legacy workflow, where data is labeled and processed in batches. Moving forward, we need to go in the trajectory of continuous streaming, where data is gathered, labeled, and used in training continuously, offering much more granular control and visibility over the performance of the model.
The Human Element
Would you be able to predict whether your project will have data quality issues before the data is labeled? We can approach this chicken-and-egg problem by looking at the expected labeling workflow:
- Once unlabeled data is gathered, a team of labelers annotates relevant elements
- Labeled data is used to train machine learning model
- Based on the model’s performance, we can retrospectively assess the quality of data
Collective experience tells us, in this instance, that human error is one of the top reasons for why data is labeled incorrectly. In a report published by Cloudfactory and Hivemind we can quantify error rates of crowd-sourced and outsourced services ranging from 7% for a simple transcribing task to 80% for rating sentiment from reviews.
With this insight, we can add a quality assurance process before data labeling starts by devising and assessing the instructions given to labelers.
Data labeling issues begin with human-to-human communications. AI project leads usually try to explain to annotators how to label in certain ways, without fully understanding the labeling process themselves. This is where mistakes happen, before data is even labeled. If instructions are unclear, the output quality is low.
AI project leads can considerably increase their project success rate by labeling a few hundred data points themselves. Having a more intimate knowledge of the data will enable you to pick up on many subtleties to better define your labeling requirements and communicate with your labeling team.
Human in the Loop
The whole idea of labeling is to capture humans’ ability to recognize and understand abstract notions. In this process, we also capture faults, complexities and a lot of gray areas. This is where data streaming can come into play.
Data streaming allows you to get insights on-the-go to better understand your dataset and quality of labeling. You will be able to identify the correlation between human behavior and data quality. With batch processing, getting insights over the quality of the overall dataset is nice, but it’s often too little and too late.
A continuous datastream can allow you to improve and adjust processes on-the-fly with a level of control that is simply not available with batch processing.
Picture this: One of the data labelers spends considerably more time – two, three, even four times – more time on labeling one image than others. This may mean that the labeler is distracted or interrupted. However, it could also mean that the data point is special, an edge case. These cases can now be identified and adequately approached. This type of metadata – such as time spent on annotating – opens up a whole new world of data quality information.
Identifying edge cases is a critical component of developing a high-performance model. It is relatively easy to get performance of 70-80% accuracy. However, we have seen time and again that any following data batches (not streams) have only provided incremental improvements, with the model struggling to deal with the remainder of the cases. This is where accurately identifying edge cases and developing an approach for labeling them can get a machine learning model to the home stretch.
Modern tooling can also enable mindset shifts. The equivalent of the DevOps infrastructure-as-code approach can be thought of as data-as-code. This can help create different workflows, such as the debug build and the production build. In the debug-type build, we can optimize quality labeling, whilst the production/release type-build can optimize cost-efficiency.
Avoiding Technical Debt
Going from proof of concept, to production, and to maintenance, projects may have to sacrifice on the quality of the delivery in order to expedite the process, save costs, and deliver a proof of concept in time. The problem here is accruing technical debt on foundational aspects of the projects. This may be acceptable with activities that have flexibility, which are expected to be changed as we learn and develop the product.
However, some are straight-up project killers.
The activities that form the foundation of the projects, where we should never incur technical debt include workforce training, instructions, tooling, and stable data flow infrastructure. Cutting corners on any of those may help get a proof of concept off the ground faster. However, as we move onto production and maintenance, a shaky foundation can make the whole project collapse.
You can look at your ML project from a fresh perspective by thinking that you’re building a single machine which is part-human, part-computer. Our goal is to enhance human wisdom and intelligence with machine-like speed and scale. All our efforts around quality data labeling is to accurately capture humans’ ability to interpret data.
Why wouldn’t we invest all the right resources into this?