Post

Machine Learning with Unlabeled Training Data

June 01, 2021

Machine learning relies on supervised learning, which uses labeled training data. However unsupervised learning, which uses unlabeled training data, can supplement supervised learning, and improve ML system performance.

Unsupervised learning uses unlabeled training samples to model basic characteristics of an ML system’s input data. These characteristics can be a useful starting point for supervised learning, and they can be used to extrapolate what is learned from labeled training data.

Modeling Input Data

Modeling Input Data

An ML system’s input data, n-dimensional vectors of measurements, form a collection of points in n-dimensional measurement space. Clustering is an unsupervised learning technique that sorts these points according to proximity in measurement space. The example on the left shows data in a two-dimensional measurement space sorted into three clusters.

Clustering does not require training data labels, and it is widely used in unsupervised learning applications today. It can be accomplished using either stand-alone algorithms such as the k-means algorithm, or by optimizing parameters as part of an ML system.

Clustering is a very useful way to model input data. As shown by the dotted lines in the example above, clustering partitions the measurement space into regions associated with each cluster. After unlabeled training data is clustered, a new input data point can be assigned to one of the clusters. This assignment to a cluster, or embedding, brings with it information about similarities with other data points that can be quite useful to subsequent supervised learning.

Let’s look at two applications that use clustering to improve supervised learning.

Semi-Supervised Learning

Semi-supervised learning combines unsupervised and supervised learning by using a relatively small labeled training set together with a larger unlabeled training set.  The labeled set provides initial training that is used to infer labels for the unlabeled data, which then can refine training.

A recently developed approach to semi-supervised learning is  Local Label Propagation (LLP), which has been applied to image recognition.  LLP trains simultaneously on both labeled and unlabeled training samples, using the architecture shown below.

Semi-Supervised Learning

Input images are initially processed through a standard convolutional neural network (ConvNet, in this case ResNet). The ConvNet output is then processed by a second network that is trained to produce two outputs: an embedding vector, that creates clusters in a reduced dimensionality space, and an image class label.

Training is a repeated two-step process. First known labels are ‘propagated’ to unlabeled data based on cluster proximity, creating pseudolabels. Then the parameters of the system are updated to minimize a loss function that rewards statistically consistent clusters and high confidence in pseudolabels. This parameter update changes the location of unlabeled data points in the embedding, so the process starts again with a new set of pseudolabels. This is repeated until training converges to a desired loss function value.

The LLP system was tested on the ImageNet benchmark. Using a training set where only 10% of the samples were labeled, it was able to achieve 88.53% top-5 accuracy on the test set. This is in contrast to an 80.43% accuracy obtained by supervised training based on the same labeled samples.

Self-Supervised Learning

Self-supervised learning uses unlabeled training samples to pre-train an ML system, which is then further trained using labeled training samples. A recent example is the SElf-supERvised (SEER) model, which has been applied to image recognition.

Like the LLP example above, the SEER system uses a ConvNet (in this case RegNet) and learns embeddings to map ConvNet outputs to clusters. However rather than use clusters to assign labels to unlabeled data, SEER exploits a basic property of image recognition: an ML system should recognize different viewpoints of an object as the same object.

SEER does this by learning embeddings that assign multiple views of the same image to the same cluster. These multiple views are produced using data augmentation, where an original image is modified such as shown on the right. This learning can be unsupervised because it does not depend on knowing the object’s label. And because it is unsupervised, it can take advantage of the almost unlimited number of images on the internet.

A 1.3 billion parameter SEER system was pre-trained using one billion randomly selected, unlabeled images from the internet. The pre-trained system was then further trained using labeled training samples from ImageNet. Performance on ImageNet test data showed that pre-training with randomly selected images allowed SEER to consistently outperform previous pre-trained systems.

Self-Supervised Learning

Takeaway: Unlabeled Data Can Sometimes Supplement Labeled Data

Unsupervised learning can discover basic input data characteristics that can be useful in supervised learning.  This can be used to enrich labeled training data with unlabeled data, if the labeled data is sufficiently representative of the combined data, so that cluster proximity yields valid label assignments.

Unsupervised learning can also be used in conjunction with training data augmentation, to create embeddings that preserve invariance over multiple appearances of the same object. This can be applied not only to image recognition from multiple viewpoints, but also, for example, to recognizing audio samples with different types of background noise.

If you wish to learn more about how to create an effective data annotation process that can benefit from unsupervised learning, please contact us to talk to an expert.