Post

From Supervised to Unsupervised – The Evolving Role of Training Data in Machine Learning

October 12, 2022

After numerous false dawns and “AI winters” over the decades, we are finally in a sustained golden age of Artificial Intelligence. Researchers are making rapid strides in the development of sophisticated new approaches to machine learning, creating AI that is smarter and more competent than ever before.

Digital assistants, text editors, online shopping suggestions, service chatbots, face recognition software, digital assistants – the list of everyday applications backed by AI is growing with each passing day.

All these AI deployments have one crucial thing in common – they were all created using the supervised learning approach. While reliable and simpler than other approaches in machine learning, supervised learning has one major flaw – a massive appetite for high-quality training data.

Why is access to training data a bottleneck for AI development in an era of plentiful data? Why can’t machine learning rely on other approaches like unsupervised learning for most commercial AI projects? What does the future hold for the role of training data in machine learning?

These are some of the pressing questions we will answer in this blog post. Let’s get started with a quick introduction to the different approaches in machine learning and the critical role played by training data.

An Introduction to Machine Learning 

Machine learning is a subset of Artificial Intelligence that focuses on teaching computer programs to behave with a certain degree of autonomy. Traditional programs operate within strictly defined paradigms, based on explicitly defined instructions or code.

Machine learning uses data and advanced algorithms to create software that imitates the learning process of human beings. At the core of the human learning process lies observation, followed by pattern recognition, and the creation of theories/models that explain those patterns.

Likewise, pattern recognition is at the core of the machine learning process. Developers create an ML algorithm to address a specific problem and then feed it with relevant data. Depending on the use case, the data could be images, text, video, or audio.

The aim is to create a program that improves with each new set of data. In his excellent textbook on Machine Learning, Professor Tom M Mitchell explains it best with the following definitive explanation:

                “The field of machine learning is concerned with the question of how to construct computer programs that automatically improve with experience.”

Data plays a critical role in machine learning. All approaches in machine learning require some form of training data. The “quality” of the data sets and required quantities may vary drastically, but you cannot teach machines to learn without data, period.

ML developers combine algorithms with training data to create machine learning models – the output that represents what was “learned” by the software after the training process. Though often used interchangeably by many, algorithms and models are not identical. Basically, you get a model when you run an algorithm on some machine learning training data.

The creation process behind a machine learning model has the following basic steps:

  1. Collect and Prepare Training Data

The training data sets used should be similar to the data sets the program will use when it is deployed out in the real world. If they are not representative of the real-world data sets, the resulting ML model will have low accuracy in predictions/more errors. Depending on the type of algorithm used, the data sets may be labeled or unlabelled, structured, or raw.

  1. Select an Algorithm

 The choice is decided by multiple factors including the following – the nature of the problem the developers are trying to solve, the availability of labeled training data sets, and the use case (prediction/classification vs clustering/dimensionality). There are up to a dozen different machine learning algorithms available, including various regression algorithms, neural networks, Naïve Bayes classification, KNN, and SVM.

  1. Train the Algorithm

This is an iterative process of feeding the training data and letting the algorithm analyze it for patterns. The process involves a lot of tuning, tweaking, and optimization of variables, most of which are automated and performed by the software without human intervention. The human role is predominantly in the selection and optimization of training data and the coding of the algorithms.

  1. Validate the Model

Once the machine learning algorithm has processed enough training data and created a sizable output of results, it is time to test it against hitherto untouched data sets. Developers will usually set aside a portion of the training data set as a testing/validation data set. This is a form of the final evaluation of the model before field deployment.

The myriad ML algorithms can be broadly classified into the following categories:

  • Supervised Learning
  • Unsupervised Learning
  • Reinforcement Learning

Supervised Machine Learning Comes with Many Options 

The closest approximation of supervised machine learning from a human perspective in the classroom. The students have a structured knowledge base with clearly labeled items and concepts in the form of textbooks. The teachers feed the knowledge to students and provide all the questions and answers.

The algorithms in supervised learning work along with the same principles. Here, they are the students, while the training data collected and organized by the developers perform the role of textbooks. This approach requires high-quality data where all the pertinent variables are clearly labeled and identified.

Ultimately, the aim here is to empower the program to use the experience gained from the training in identifying similar patterns in raw, unstructured data sets. If the quality of the input data is poor, the ML model created in the process will also turn out to be inaccurate.

Supervised Learning Algorithms broadly fall into two categories based on use cases:

  • Regression
  • Classification

Regression Algorithms

When you have to analyze the relationship between dependent and independent variables in the data, regression algorithms are the first choice. They are commonly used for making projections and forecasts, in financial markets, business, predicting future trends, and even drug response modeling.

The following are some of the popular regression algorithms used in supervised learning:

  • Linear Regression – used to study the relationship between two variables, this algorithm is frequently used in predictions involving everything from portfolio performance to traffic flow, real estate pricing, and salary forecasting.
  • Logistic Regression – this is another widely used technique suitable for scenarios involving fraud detection, credit scoring, and clinical trials, anything where you need a quantified value scoring based on multiple variables.

Classification Algorithms

As the name suggests, a classification algorithm is used when you want to recognize specific entities in the data set and classify them into specific categories.  Facial recognition, image recognition, and text recognition for search engines/chatbots all use classification algorithms.

These are a few of the major classification algorithms commonly used in supervised learning:

  • Support Vector Machines (SVMs) – a very versatile technique that is equally viable for classification and regression, SVMs are widely used in image categorization, face detection, geospatial analysis, bioinformatics, etc.
  • Naïve Bayes Classification – based on the Bayes Theorem, this approach is predominantly used in spam identification, personalized recommendations, and many scenarios where some form of text classification is required.
  • K-Nearest Neighbor (KNN) – this algorithm classifies data points based on their proximity and links with other data. It is frequently used in recommendation AI software and for image analytics automation

Unsupervised Machine Learning is The Future of AI

In supervised learning, the algorithm is given both the input data and desired output data. This empowers the model to deliver accurate results after deployment. The performance of the machine can also be clearly gauged since the correct and wrong answers are already spelled out in the training data sets.

Unsupervised learning (UL) is an approach that stands at the polar opposite of supervised machine learning in this regard. The algorithms are only provided input data, which is often in the raw, unstructured form. The onus is on the model to identify hidden patterns in the unclassified new data.

Take an example of an infant who sees a dog for the first time. After a few days, if the infant sees another breed of dog, she will most likely recognize it as a dog, based on the common features between the two animals – fur, four legs, wagging tail, etc.

In supervised learning, the AI would require training data sets containing labeled images of hundreds or even thousands of different dog breeds to reach the same level of accuracy as a human infant. UL tries to avoid this burden by using more complex algorithms that can detect common patterns without any handholding or spoon-feeding.

Though in its early stages at the moment, the unsupervised approach is widely recognized as the future of AI research. The ultimate aim of the UL approach is to create autonomous AI that can replicate human capabilities of self-learning – making intuitive connections and recognizing hidden patterns on their own.

In any attempt to build a “strong AI,” software capable of getting anywhere close to the human mind in autonomy and problem-solving capabilities, UL algorithms will certainly play a critical role, along with recent advances in techniques like deep learning neural networks

The following are some of the most prominent techniques used in UL:

  • Clustering – a data mining technique that groups data points based on shared features or dissimilarities.
  • Association – a technique that uses various rules to identify how certain variables in the dataset are related to each other.
  • Dimensionality Reduction – if the given dataset has too many features, this approach is used to reduce the number to a manageable level without compromising the overall data integrity.

Data Differences Between Supervised and Unsupervised ML 

Both approaches to ML require data inputs. However, the two approaches have radically different philosophies regarding the nature of data used. In supervised learning, success hinges heavily on the quality of data sets earmarked for the project.

Supervised algorithms cannot work with raw data. They require multiple data sets – input data, output data, with separate sets of training data and subsequent validation test data. More importantly, the data sets have to be clearly labeled, tagged, and annotated, preferably by professionals or domain experts.

In stark contrast, unsupervised learning does not require enriched or labeled data sets. Instead, the algorithms are fed raw, unstructured data. This is one of the reasons why there is considerable interest in shifting from supervised to unsupervised learning – the burden of test set data collection and preparation is virtually absent in UL.

Supervised Learning Enjoys Widespread Adoption 

Despite being heralded by AI experts as the future of AI development, unsupervised models still lag behind supervised models in terms of real-world deployments. Surveys suggest that nearly 80% of AI development at present is focused on supervised algorithms. There are several possible reasons for this.

Over the years, AI research has undergone a cyclical trajectory, with breakthroughs followed by long spells of dormancy, a phenomenon labeled “AI winter.” In the past, a key reason for false starts in AI evolution was excessive hype, followed by a lack of any practical real-world applications.

Funding is essential for any scientific field and in many instances, it depends on the market potential of the underlying technologies. In the earlier cycles, neither the technology nor the wider ecosystem was prepared for the deployment of AI solutions.

But that is no longer the case. With more compact and powerful processors, smarter devices, high-speed internet connectivity, and easier access to data, AI has unlimited potential. However, reliability and accuracy are important for the successful deployment of an AI-powered solution.

An error-prone self-driving AI could cause fatalities on the roads. A geospatial AI making an erroneous prediction about weather/pests/soil productivity could lead to massive losses for large agri-businesses. Inaccurate forecasting models can wreak havoc on financial markets.

At present, only supervised learning algorithms combined with high-quality machine learning data sets can guarantee the level of accuracy demanded by businesses, regulators, and other customers at large. Current experiments with UL algorithms have been restricted to more abstract problems with limited real-world application.

Ideally, the perfect AI deployment should combine the strengths of both approaches – Supervised algorithms that handle all predictable outcomes, and unsupervised AI to handle the edge cases which are inevitable in the real world!

The Training Data Challenges in Supervised Learning

Despite its widespread appeal in AI development at present, supervised learning is beset with multiple challenges related to training data. Industry estimates indicate that anywhere from 50% to 80% of the total production time in supervised AI is taken up by database selection, cleaning, and labeling.

For many smaller AI startups and even mid-sized corporations, in-house data labeling is often prohibitively expensive due to the following factors:

  • Labor Intensive – creating high-quality training data sets from raw data within acceptable time frames requires a dedicated team of data labeling professionals. Only big tech firms like Google, IBM, and Amazon have the financial muscle to devote entire divisions to data labeling.
  • Domain Expertise – data labeling services for certain fields like geospatial analysis, meteorology, healthcare AI and medical diagnostics require the knowledge and skills of a certified domain expert. Hiring trained diagnostics experts to tag thousands of X-rays or CT scans is prohibitively expensive, and not an optimal utilization of their valuable skills.
  • Communication – Supervised learning is an iterative process where the algorithms undergo frequent tweaking and optimization between model training cycles. The training data sets will also have to be refreshed to mirror these changes. This requires constant communication and coordination between the developer team and the labeling teams.
  • Rapid Scaling –  AI deployments can often result in boosts to productivity and rapid growth. In response to the dynamic changes in the work environment, demand for training data can fluctuate rapidly. Data labeling teams have to be highly agile and elastic to keep pace with the changing demands.
  • Data Security – on top of everything, firms also have to deal with corporate data safety and compliance issues when using sensitive data. This could include customer data, patient health data, or data with implications for national security.

In some contexts, like defense, law & order, and national security, agencies often have no option but to keep all data labeling tasks in-house. But for the majority of supervised ML projects in the market, this may not be an option from a financial perspective.

Open-source data sets are usually of middling quality, ranging from excellent to abysmal, and may lack specific features that a particular project demands. Utilizing freelancers and crowdsourcing platforms like Amazon Turk may be affordable, but the quality of the output can be a mixed bag. Then there are also the security implications of sharing data with strangers online.

Semi-Supervised Learning – A Happy Compromise?

The two approaches to machine learning occupy uncomfortable extremes. While supervised learning is often costly and time-consuming, unsupervised learning suffers from extremely low accuracy/success rates in predictions or clustering. 

As a hybrid approach that straddles the middle path between the two, semi-supervised machine learning may offer a way forward, at least in some use cases. In this approach, instead of using large data sets, the developer will employ a limited number of high-quality training data sets.

These data sets will be fed to an unsupervised learning algorithm, alongside a large quantity of unlabelled data. To make logical connections between the labeled and unlabeled data, the algorithm may be programmed with one of several assumptions:

  • Continuity Assumption– objects near each other should be labeled in the same group
  • Cluster Assumption – all data sets are divided into discrete clusters, with discrete output labels
  • Manifold Assumption – using distances/densities, the data points are assumed to fall on a manifold with fewer dimensions than the input data space

Semi-supervised approach retains the “human-in-the-loop” paradigm from supervised learning, combining it with a sophisticated UL model for data efficiency. The model has already shown promise in the following AI use cases – speech analysis, text classifier, protein sequencing, and webpage ranking. 

Here is what a standard semi-supervised learning process looks like:

Step 1  – train the algorithm with the high quality, low scale training data

Step 2 – once the algorithm has achieved a satisfactory level of consistency, use the unlabeled data sets, but with pseudo labels.

Step 3  – While the results may lack accuracy at this point, it still creates links between input data, pseudo labels, and unlabeled data

Step 4 – Continue training the algorithm with the new combined input. With each iteration, the overall model performance will improve, bringing down the errors in the process. 

Conclusion – The Enduring Relevance of Labelled Training Data 

Expert estimates regarding the evolution of Strong AI vary widely, with an early arrival date set around 2030.  Meanwhile, more conservative estimates place it in the 2060s, or several decades away at the very least.

At the current pace of AI evolution, we still have a long way to go before unsupervised machine learning models with artificial neural networks dethrone supervised learning as the leading paradigm. Until the evolution of fully autonomous learning software, humans will remain a vital cog in the workflow for the creation of high-quality AI training data.

Meanwhile, the global market for AI is growing at a frenetic pace. Starting at a valuation of $65 billion in 2020, the sector is expected to reach $1.5 billion by 2030, with a CAGR of 38%. Firms cannot afford to wait around for the next big step in autonomous AI algorithms.

Third-party data labeling services will play a vital role in the coming decade. Combining scalability with domain expertise, efficient feedback loops, and flexible/affordable pricing, iMerit offers AI startups and corporations a path to supercharging their ML efforts without breaking the bank.

An industry-leading service provider with domain expertise across computer vision, natural language processing, and other content services, iMerit offers a wide range of services including image annotation, video labeling, audio transcription, sentiment analysis, image segmentation, and text annotation.

Our labeling professionals have extensive experience collaborating with our clients from industries as diverse as agritech to autonomous vehicles, medical AI, agricultural AI, geospatial scanning, and Commerce/Finance.

If you wish to learn how iMerit can augment your machine learning projects, please contact us to talk to an expert.