Post

Learning Common Sense from Video

September 29, 2021

Today a big gap between human and machine intelligence is common sense. When we humans interpret language or visual scenes, we draw on vast knowledge of how the physical world works, for example that eggs can become chicks or omelets, or a bat, ball, and a window can become broken glass.

Video shows how the world works
Video shows how the world works

Common sense makes humans very efficient learners, so machine learning researchers have been working on ways to imbue machines with at least some ‘common sense’. In a previous blog we discussed using pictures to train natural language processing systems, in a sense giving the systems partial ‘knowledge’ of what words represent in the physical world. ML systems can get even closer to common sense with a little help from human annotators and video ML models.

Video offers a great opportunity for machines to learn about the physical world. Videos can reveal basic things humans take for granted, such as cause and effect, how an object is used, or whether an object is rigid or flexible.

ML systems can get even closer to common sense with a little help from human annotators and video ML models

Let’s explore how video machine learning (ML) systems can learn common sense. First, we’ll review how ML models can be structured to capture both spatial and temporal patterns in video. Then we’ll discuss an innovative training approach for common sense learning.

Capturing Motion Patterns in Video

Video ML systems are ubiquitous in applications such as self-driving cars and license plate readers. These systems typically work by analyzing objects frame-to-frame. They use techniques, such as convolutional neural networks, developed for recognizing objects in fixed images.

License plate reader for automated toll collection
License plate reader for automated toll collection

More complex video ML systems recognize not just objects in each frame, but also patterns of action across multiple frames. Extracting this additional information enables applications such as traffic pattern analysis and detection of criminal activity, and it is a prerequisite to learning basic facts about how the physical world works.

A number of approaches have been developed to capture video motion patterns. One approach uses 3D convolution. This extends standard 2D convolutional neural networks, extensively used for single image analysis, by analyzing the 3D patterns formed by stacking sequences of 2D frames.

A second approach can handle longer video sequences by taking advantage of ML sequence models, such as LTSM or Transformer, developed for language processing. In this approach, 2D convolutional neural networks first extract features from each frame and create a sequence of ‘tokens’ to represent a video clip. These tokens are then processed by a sequence model in a way analogous to the processing of a sequence of linguistic tokens or words.

The TimeSformer borrows the concept of attention fromsequence models to capture patterns in both space and time

A third approach has recently been developed that has shown improvements in accuracy and efficiency over both 3D convolution and the 2D convolution + sequence model approach. The TimeSformer borrows the concept of attention fromsequence models to capture patterns in both space and time.

The TimeSformer (from ‘time-space transformer’) divides each video frame into sixteen non-overlapping patches. This example illustrates TimeSformer processing the single patch colored blue, which is part of a three-frame video. The red and green coloring illustrate the idea of attention – the red and green patches are what the analysis of the blue patch ‘pays attention to’.

TimeSformer ‘paying attention’

Paying attention is this case means that the TimeSformer’s analysis of the blue patch is based on calculations that combine the characteristics of the blue patch, the red (spatially relevant) patches, and the green (temporally relevant) ones. The analysis of a whole frame combines the characteristics of all the patches in the frame. Although this example shows three frames being used to analyze the blue patch, the operational system actually uses up to 96 frames to analyze every patch in every frame.

In tests using a video dataset depicting 400 human action classes, the TimeSformer had higher classification accuracy than the best previous result. In addition, though TimeSformer uses a relatively large number of parameters to represent all the interrelationships among frames and patches (121 million), its calculations are particularly efficient because they make good use of high-speed vector/matrix processors.

This resulted in TimeSformer using up to 70 percent less computation for inference and 90 percent less for training than the other systems.

Common Sense Training

ML systems such as the TimeSformer can learn patterns of motion from video, patterns that can reflect basic facts about the physical world. In order to explore if this might enable ML systems to learn ‘common sense’, researchers have created the ‘Something Something’ (SS) database.

The SS training set is intended to teach a video ML system the basic language of movement, a component of ‘common sense’.

This database is a training set of videos showing multiple instances of generic actions such as ‘Moving something from left to right’ or ‘Putting something into something’. The SS training set is intended to teach a video ML system the basic language of movement, a component of ‘common sense’. It is expected that wIth this basic training, the system will be able to more quickly learn more specialized tasks.

Examples from the SS database
Examples from the SS database

The database was created using a novel approach to training set development. Rather than gather and annotate videos from the web, a team of workers created the videos according to generic action templates. For example, for a generic action template ‘Putting something into something’, workers would video themselves performing the action using a variety of specific instances such as ‘Putting a white remote into a cardboard box’ or ‘Putting a sweater into a drawer’. They annotated the video clips simply by identifying the somethings they used for the template placeholders. While the actual annotation was simple, the creation of the training data relied on the annotator’s deep understanding of the physical world.

While the actual annotation was simple, the creation of the training data relied on the annotator’s deep understanding of the physical world

The SS database has been used to test several video ML systems. So far, state-of-the-art ML systems such as TimeSformer have been able to achieve recognition accuracies in the 60 percent range. While this work is in the preliminary stages, it shows how innovative training can begin to give ML systems ‘common sense’, potentially increasing their efficiency in learning subsequent specialized tasks.

Takeaway

Video provides a rich source of information about the physical world. Frame-by-frame video analysis is useful in many applications; however, more sophisticated video analysis can be performed by ML systems that can learn complex temporal as well as spatial patterns. With proper training, these systems can begin to learn basics about the physical world, potentially leading to more efficient machine learning across a variety of more specialized tasks. iMerit’s broad experience in all facets of training set development makes them an exceptional partner to help you take advantage of advances like this in machine learning.

To find out how to teach your ML system the basics and specifics of your application, contact us to talk to an expert.