Scaling data labeling with generalists

The creation of training data for supervised learning involves humans in the loop. These are humans who carefully study raw datasets and extract or highlight useful information such as the location of objects in visual data or the sentiment expressed in textual material.  

At iMerit, members of the full-time data workforce are recruited and skilled to perform a range of data annotation tasks requiring varying levels of specialization and subject matter expertise. A robust skilling structure helps produce both labeling specialists and domain experts.

Each data contributor’s skilling journey begins with classroom-oriented training at the beginning of their tenure and continues on the floor with an agile Learning & Development team on hand to address queries and tackle challenges. 

How candidates are selected 

To start, promising individuals are recruited directly or through iMerit’s sister non-profit organization Anudip. Selection criteria are mainly around attributes like logical and critical thinking, aptitude for digital work, and an eye for detail. Once within the iMerit structure, they go through the Learning & Development programs developed internally.

Preliminary skilling for data labeling 

Merit trainees undergo an induction program that introduces them to emerging areas such as Machine Learning, Computer Vision, Natural Language Processing, sentiment analysis, e-commerce, social media, and geospatial technologies. Formative understanding of these domains is strengthened through hands-on data annotation tasks conducted under the supervision of L&D facilitators. This familiarizes trainees with service delivery concepts such as task instructions, quality, and throughput. Post induction, trainees are assessed and successful employees are placed in client projects for on-job training. Labeling and annotation are iterative processes, so a contributor becomes increasingly skilled with each data point he or she labels. 

At iMerit, context is key in the skilling program. The training modules are designed to enable the contributors to come up with analysis and insight based on the data they’re tackling – where does this work fit in, who is the end-user, and what is the impact of erroneous data. In turn, they feel more invested in their work and its futuristic outcomes. 

How teams are skilled in specialized domains 

The L&D team has designed dedicated modules for teams entering specialized projects where subject matter expertise is required alongside labeling dexterity. For example, contributors working on finance projects are trained in financial terminology and documents. They can then perform services like financial data extraction, corporate classification, and sentiment analysis. 

For a domain like medicine, the curriculum covers medical lexicon, pathology, spatial orientation, and data manipulation. When a project begins, the specialist is custom-trained using live-demos, videos, models, and instruction guides dealing with the specific pathologies of the project.

(Read this blog to learn more about the making of iMerit’s medical data labeling experts) 

With all data labeling projects, and particularly with subject-intensive assignments, edge cases account for a small percentage of the total data points, but provide tremendous learnings. Each case is carefully documented for the use of all members of the project and also incorporated into future learning materials, as relevant. 

With iMerit’s agile and mobile-friendly Learning Management System, inputs and additional modules are available at the touch of a button, and contributors can build their confidence with specific tasks or topics.