Andrew Ng, the AI pioneer and technology entrepreneur, is a man on a mission. He recently spoke about how the AI ecosystem must move from a model-centric approach to a data-centric approach. This, he says, will have a dramatic impact on the quality of Machine Learning models deployed.
In the lifecycle of a Machine Learning model, training is one small part of the process. Ng points out that sourcing and preparation of high-quality data forms 80% of the Machine Learning process. The actual training only makes up 20%. Despite this, training is the primary focus, both in industry and academic research. This has been changing in recent years. Research and surveys showcase how data annotation and labeling are crucial for innovative and accurate AI. Gartner listed data labeling as an ‘Innovation Trigger’ in its Hype Cycle for AI.
Need for Consistent Labeling
‘Consistency is key’ is even more crucial in the development of Artificial Intelligence. When datasets are being labeled, labeling must be consistent across labelers and batches. Ng uses an example of labeling iguanas using bounding boxes to illustrate how easy it is for errors to creep in. With the guidelines given, these examples are right. But the data is noisy due to different interpretations of the guidelines and the resulting inconsistencies.
Varying Impact of Noisy Labels
Introducing noisy labels into data affects model efficiency. But the impact is not the same in every instance (Read: How Noisy Labels Impact Machine Learning Models). In industries like healthcare or agriculture, with smaller training sets, noisy data can cause more damage. In large datasets with millions of data points the noise in the data can average out, due to the volume.e.
“If 80% of our work is preparing high-quality data, then that is a core part of a Machine Learning engineer’s work”
Overcoming Inconsistency
In the iguana example, the inconsistencies are in the labeled datasets. In a model-centric approach, this dataset remains fixed while optimizing the code to improve model performance. Ng advocates a systematic data-centric approach, where the data improves through iteration. His teams solve for consistency by making two independent labelers label a sample. After discovering the disparities, the teams update labeling instructions with the findings. This is one approach.
Systematic Data-centric Approach
Ng stresses the need to be systematic and to use tools and processes throughout the ML lifecycle. Structured error analysis is crucial during the model training stage. In another example, the training of a speech recognition algorithm is less effective when there is car noise in the background. The data-centric approach includes identifying the error and applying a targeted solution. This could be through the collection of more data with car noise or by improving the consistency of labels with noisy backgrounds. A methodical approach also accounts for data drift and concept drift, after deploying the model in production.
“I would love to see a shift in mindsets from Big Data to Good Data.”
It is a key part of an MLOps engineer’s role to ensure a steady stream of high-quality data all through the ML lifecycle, says Ng. He envisions a future where MLOps tools and processes reach a level of maturity. This will make the data-centric approach efficient and effective.
Key takeaways:
- An AI system made up of its data and its code, both of which need to perform, for the model to succeed.
- Model training is one part of the Machine Learning lifecycle. The best results lie in optimizing the entire process.
- A data-centric approach focuses on improving quality and consistency of labeled data and augments AI models.
- A systematic approach with data-centric tools and processes improves model performance