Navigating Data Tooling and Expertise To Achieve High-Quality AI Training Data

January 13, 2023

At the iMerit MLDataOps Summit 2022, Avi Yashar, Co-founder and CPO of Dataloop, Chris Karlin, Head of Sales at Superb AI, and Michael Hazard, Product Manager at Applied Intuition unveil what makes enterprise-grade worthy tools, why data tools require super users, realization that one tool isn’t a fit for all and what lies ahead in the data tooling ecosystem. As iMerit’s VP of Strategic Business Development, Jai Natarajan, moderates the panel, viewers gain the following insights:

  • Collaborative data workflows and a talented tooling workforce are necessary for achieving high-quality AI training data
  • While industry leaders typically use real data, it’s also important to understand how synthetic data can be used to supplement.
  • Data annotators are evolving into data developers, and data tooling companies work to prevent quality issues.

High Availability of System Stability

According to Dataloop’s Avi Yasnar, in the next two to five years many enterprises will push their AI solutions from the research phase to production. Therefore, having a tool that can easily scale up without quality issues is crucial. In addition, enterprises continue to look for customization options on the workflow by utilizing their talented workforce to ensure high data quality by working with the platform. Strictly following SLAs will be a core element of taking a great product to Enterprise-grade.

Working with Larger Customers

As data tooling companies work with larger companies, it’s becoming common to require the implementation of safety cases on top of the technology. According to Michael Hazard at Applied Intuition, the technology needs to be validated and needs to be proven to correlate to the real world. That is what makes the investment in data tooling companies worthwhile.

“They need to be able to build a safety case on top of our technology, which means that the technology needs to be validated. It can’t have any flake.”
– Michael Hazard, Product Manager at Applied Intuition

Collaborative workflows are also more important at these larger companies. You’ll find specific roles are handled by certain individuals, rather than relying on a jack-of-all-trades. Handing off issues throughout the workflow, from those who identify the issue to those who triage and solve the issue, is key in scaling up.

Checklist for Delivering Synthetic Data

The quality assurance for delivering synthetic data for training includes:

  1. Automating the system
  2. QA tools for running model inference
  3. Ensuring correlation to real data
  4. Tracking failures
  5. Conduct a formal postmortem
  6. Cross-functional user interoperability of the technology

Employing DevOps Principles

Bringing ML operations to fruition is the goal. As Chris Karlin states,

“The way we’re seeing Enterprise companies operate today is in effect, in a lot of ways, as a legacy sort of waterfall-type software development mode when it comes to machine learning and we want to get to a place where it’s really sort of mimicking the principles of DevOps.”
– Chris Karlin, Head of Sales at Superb AI

To achieve this end, you need professional data labelers, machine learning engineers, data scientists, and data ops people, to interact with one another. The way to bring these individuals together is to use technology.

Where does this fit in?

Jai Natarajan follows up the discussion by asking how these series A companies balance the many demands to fit in the big customers’ needs. All three members of the panel jump in to explain their thoughts, which can be summed up below:

Know your market

If you’re going to justify every dollar you spend and put everything into a long-term roadmap, you need to know your customer very well, understand your users, and feel confident about the value you can provide. Additionally, it’s not enough to know what you’re currently doing – you also need to look ahead at what is to come in the next phase.

Allowing for different user flows

Depending on what kind of company you’re working with, the design can be different. For instance, working with a professional data annotation company, like iMerit, is very different than working with a data scientist or engineer that wants to use their code.

Different stakeholders value specific features of a tool. Being able to design a product that can be used by different types of stakeholders is what sets a data tooling company apart from those who can’t or won’t work with Enterprise-grade clients.

From Data Annotators to Data Developers

Avi explains his theory around how data annotators will transition to a role that also encompasses data development. Similar to software developers, data developers will end up developing pipelines because they have such a high level of understanding of the data.

“They see a lot of examples. They understand what’s working and what’s not working. What is easy to annotate? Where are the edge cases? Where are the anomalies?”
– Avi Yashar, Cofounder, and CPO of Dataloop

This is the next phase of deep learning. The data developers will emerge as key players in the next two to five years.

Moving Forward with High-Quality AI Training Data

When it comes to generating high-quality AI training data, employing a reputable data annotation company is a good first step. As a leader in iMerit provides a solution that brings together technology, talent, and techniques to provide high-quality data and precision at the scale of production required.

To find out how iMerit can help your enterprise, contact us today.