Post

Overcoming the Obstacles of Achieving High-Quality Data

December 07, 2022

Raj Aikat, iMerit’s Chief Technology and Product Officer and Lucas Chatham, Sr. Product Manager at iMerit discuss industry trends, insights, and challenges that companies face as they work towards production-ready AI. Learn about the 6 obstacles holding companies back from achieving high-quality data and the innovative solutions needed to overcome them.

From ML DataOps 1.0 to ML DataOps 2.0

Today’s Cambrian data expansion, with every industry producing thousands or millions of datapoints, needs to be corrected by ML Ops. So far, we have seen that the ML ModelOps and ML DataOps act largely independent. Moreover, data preparation and annotation is a human skillset, supported with some tooling, such as tracking and analytics.

This approach poses three challenges:

  1. Data at scale – encompasses both the annotation and analytics of data
  2. Production at scale – dealing with edge cases and generating similar scenarios based on real-world cases
  3. Real-time fleet ops – which describes the need for real time management and monitoring

In ML DataOps 2.0, the paradigm changes from tech-enabled to tech-driven. This new approach presents automated data labeling and preparation to be directly integrated into the ML Ops CI/CD flow. Rather than having humans carry out time consuming tasks, most of the heavy lifting is carried out by tools, freeing up humans to focus on validation and tackling of anomalies using their domain expertise. The humans’ skillset expand to all stages of ML development and deployment.

“Humans-in-the-loop get converted to experts-in-the-loop”
– Raj Aikat

Have you explored our cutting-edge Radiology Annotation Suite? Click here to discover its powerful capabilities.

The Six Obstacles to Data Annotation Workflows

Data annotation is a process which has many moving parts, and consequently, many points of failure. The most common obstacles in a data annotation workflow are as follows:

  1. Tool interoperability – using multiple third party and in-house tools can result in difficulties with tools communicating with one another.
  2. Reporting fragmentation – often coming as a result of tool interoperability issues, when systems that don’t talk to each other you can’t see what’s happening at a macro level
  3. Edge cases – these anomalies in the data that sit in the gray area of annotation guidelines
  4. Workforce skill & scale – for high quality data, annotators require skill, domain expertise, and the ability to get smarter as the project continues. Scaling also poses huge challenges, where moving from 10-15 annotators to 2000 annotators requires watertight frameworks
  5. Managing data security – the supporting tech infrastructure must support isolation, encryption and consistent regulatory compliance.
  6. Access to real time insights – without real-time visibility, whole batches of data with undetected issues may need to be redone in their entirety

“High quality data is the north star”
– Lucas Chatham

To overcome these obstacles, the team at iMerit has developed Ground Control, a data annotation management tool which is the one-stop shop for ensuring data annotation maintains the highest quality.

Overcoming Tool Interoperability Issues

To solve for tool interoperability issues, a tool must act as a single pane of class and have the ability to consume multiple data sources. Ground Control can achieve this in multiple ways. Firstly, using Cloud Cover a Chrome extension that tracks URL changes and captures analytics that are stored in a data lake. API integrations can use standard frameworks to ingest and export data. The tool also supports the upload of data files such as CSVs or emails.

Overcoming Reporting Fragmentation

Once all data is available within the tool, it must have the ability to visualize information depending on the stakeholder, metrics, and level of detail. Ground Control has the distinguishing feature to display information based on multiple workflows and dashboards. A standard report typically encompasses one workflow that’s displayed in one dashboard. Ground Control can produce multiple dashboards for a single workflow, where different stakeholders can view only the data they are interested in. The tool also supports multiple workflows consolidated in a single dashboard, which is an excellent way of tracking the high-level performance across the whole project.

Overcoming Edge Case Ambiguities

Dealing with edge cases first entails to capture anomalies and then integrate them back in the dataset. These edge cases must be accounted for and new guidelines devised for similar scenarios.

Once edge cases are captured, annotation tools need to provide adequate collaboration and escalation tools. These types of anomalies typically require multiple people working together to reach a consensus on how they should be tackled. For more difficult instances, a chain of escalation can bring this up the management team to help in the decision making process.

Overcoming Workforce Skill and Scale Issues

To support increasingly challenging tasks and tackle a larger number of datapoints, data annotation tools must work in conjuncture with workforce skilling platforms and people platforms. The workforce skilling platform prepares the annotators with clear indicators of the annotation guidelines such that they can hit the ground running and ensure quality labels from the beginning of the project. People platforms offer AI projects access to a pool of trained labelers that can help bring the overall workforce number up to the required scale.

Overcoming Data Security Challenges

To ensure end-to-end data security, tools need to implement security best practices, which include role-based access controls, data encryption, and endpoint security. Compliance with industry-standard regulations such as HIPAA or PCI are important points to evaluate annotation tools for.

Overcoming to Real-Time Insights Limitations

Data annotation tools need to update metrics in real-time as the project progresses. Changes to guidelines, different batches of data and scaling of the workflows may impact performance and quality. Having near-instantaneous views over these types of changes can accelerate investigations into performance degradations and prevent any long-lasting problems.

If you’d like to learn more about our services, contact us today to talk to an expert.