Natural Language Processing with Images and Text

In a previous blog we talked about how early Natural Language Processing (NLP) systems tried to represent word meaning, but were outperformed by systems that rely on the distinctive statistical patterns of language. We noted that this approach to NLP is quite different from the way humans process language. For us, words are not just arbitrary symbols – they represent things, ideas, and relationships in the real world.

While NLP based on statistical language patterns has worked very well, a new approach is attempting to move closer to human language understanding. Systems based on this new approach are trained not just on the symbols of language, but also on images associated with these symbols. With well-crafted training sets that match words to images, these new systems can outperform NLP systems that work only with words.

Annotation image drom COCO database — Annotated image from the COCO database

Training with Words and Images

A new system called the Vokanizer adds image patterns to linguistic patterns to interpret words. This diagram shows the Vokanizer processing the word sequence ‘Humans learn language by listening, speaking, …’.

The Vokanizer processing a word sequence

The Vokenizer has three main elements:

A Masked Language Model – this component is based on the BERT Transformer Model, a widely used NLP model that predicts sequences of output tokens (words) from sequences of input tokens. The Vokenizer uses a pre-trained BERT model to extract patterns from the sequence of words in the language input, and it learns to use these patterns to predict the correct output word sequences, in conjunction with the other two Vokanizer components
A Vokanization processor – this component creates vokens (visualized tokens) using image patterns extracted by a ResNeXt convolutional neural network. It learns to associate the patterns of individual images to individual words using image/word training sets
A Voken Classification Task – like the first component, this one uses a BERT transformer model. However, this component learns to predict the correct output sequence of images, based on the language input.

The Vokenizer system integrates these three components by adding extra trainable layers to the basic BERT and ResNeXt networks. These extra layers are jointly trained with words and images so that the three components work together to effectively predict language outputs from language and image inputs.

Toward Language Understanding

Training the Vokanizer results in an NLP system that predicts an output linguistic sequence using both language and images. This moves NLP a bit closer to the way humans process language – it uses both symbols and a part of what the symbols represent.

Of course, the Vokenizer ‘understands’ words in a very crude sense compared to humans. Here is an example of the images it associates with the sentence ‘Humans learn language by listening, speaking, writing, reading.’

Although the Vokanizer’s ‘understanding’ of word meaning is crude, it turns out to be useful. On seven NLP benchmarks, the Vokanizer trained on both words and images outperformed comparable systems trained only with words.

Quality Image Grounded Training Sets are Key

NLP systems such as the Vokenizer rely on training data sets that demonstrate how words relate to images. Since these relationships can be subtle and complex, it is essential that training data be prepared by teams such as iMerit’s: annotators experienced in multiple media, following a high-quality annotation process.

To find out how to create high-quality training data for multimedia training, contact us to talk to an expert.

TALK TO AN EXPERT

Post

How Natural Language Processing Combines Words and Images

Training with Words and Images

The Vokenizer has three main elements:

Toward Language Understanding

Quality Image Grounded Training Sets are Key

To find out how to create high-quality training data for multimedia training, contact us to talk to an expert.

Training with Words and Images

The Vokenizer has three main elements:

Toward Language Understanding

Quality Image Grounded Training Sets are Key

To find out how to create high-quality training data for multimedia training, contact us to talk to an expert.

Subscribe to our newsletter