As videos have become easier to distribute and consume, we have seen considerable developments in machine learning algorithms that can analyze their semantic content. Video classification has a wide range of use cases, most notably search engine optimization, computer vision and summarization.
Compared to images, videos also have a time dimension. In essence, they are a sequence of individual images, which makes image classification a solid baseline to build upon. With video, we can typically make the assumption that subsequent frames in a video are correlated with respect to their semantic contents.
Before making a start, we will need to differentiate between convolutional neural networks (CNN) and recurrent neural networks (RNN)
- CNNs are commonly used in solving problems related to spatial data, such as images. the size of the input and the resulting output are fixed. That is, a CNN receives images of fixed size and outputs them to the appropriate level, along with the confidence level of its prediction.
- RNNs are better suited to analyzing temporal, sequential data, such as text. In RNNs, the size of the input and the resulting output may vary.
Video Classification vs Image Classification
To achieve state-of-the-art performance in image classification, researchers have leveraged deep convolutional networks (CNNs). This architecture has been demonstrated as an effective class of models for understanding content on image recognition, segmentation, detection and retrieval. The key enabling factors behind these results were techniques for scaling up the networks to tens of millions of parameters and massive labeled datasets that can support the learning process.
To perform image classification, you need to input an image to a convolutional neural network (CNN) such as Imagenet, obtain the predictions from the network, and have it select the label with the largest corresponding probability.
Since a video is just a series of frames, a naive video classification method would be pass each frame from a video file through a CNN, classify each frame individually and independently of each other, choose the label with the largest corresponding probability, label the frame, and assign the most assigned image label to the video.
Training neural networks on video is a very challenging task due to the large amount of data involved. Typical approaches take an image-based network, and train it on all the frames from all videos in the training dataset.
However, applying simple image classification to video classification will produce “prediction flickering”, where the label for the video changes rapidly when scenes get labeled differently. To solve this and fine-tune the network, you can use a rolling prediction average to smooth over any sudden changes between labels by computing the average of the last n predictions and choose the label with the largest corresponding probability. The assumption here is that subsequent frames in a video will have similar semantic contents. If that assumption holds then we can take advantage of the temporal nature of videos, assuming that the previous frames are similar to the current frame. The averaging, therefore, enables us to smooth out the predictions and make for a better video classifier.
A successful video classifier does not only provides accurate frame labels, but also best describes the entire video given the features and the annotations of the various frames in the video. For example, a video might contain a tree in some frame, but the label that is central to the video might be something else. The granularity of the labels that are needed to describe the frames and the video depends on the task.
If we are able to take advantage of the temporal nature of videos, we can improve our actual video classification task results using more advanced neural networks architectures such as Long short-term memory (LSTMs) and Recurrent Neural Networks (RNNs). They are suited for time series data but they are also resource-hungry and time-consuming when it comes to training over thousands of video files.
By using a combined CNN-RNN architecture, we can get the best of both worlds. The images of a video are fed to a CNN model to extract high-level features. The features are then fed to an RNN layer and the output of the RNN layer is connected to a fully connected layer to get the classification output. The goal of RNN models is to extract the temporal correlation between the images by keeping a memory of past images.
Challenges
CNNs require extensively long periods of training time to effectively optimize the millions of parameters that parametrize the model. This difficulty is further compounded when extending the connectivity of the architecture in time because the network must process not just one image but several frames of video at a time. To mitigate this issue, we show that an effective approach to speeding up the runtime performance of CNNs is to modify the architecture to contain two separate streams of processing: a context stream that learns features on low-resolution frames and a high-resolution fovea stream that only operates on the middle portion of the frame.
Similarly, because CNNs require inputs of standardized length, another challenge of training video classifiers is figuring out a way to feed the videos to a network. Since a video is an ordered sequence of frames, we could just extract the frames and put them in a 3D tensor. But the number of frames may differ from video to video which would prevent us from stacking them into batches. As an alternative, we can save video frames at a fixed interval until a maximum frame count is reached.
From a practical standpoint, there are currently no video classification benchmarks that match the scale and variety of existing image datasets because videos are significantly more difficult to collect, annotate and store.
Video Classification and Human Activity Recognition
Human Activity Recognition refers to classifying or predicting the action performed by one or more people. It is a type of time series classification problem where you need data from a series of timestamps to correctly classify the action being performed. In contrast with object detection, human action recognition is based on video files, where the ML algorithm needs a series of data points to correctly predict the action.
Imagine a person doing a backflip. If we only took a single frame of that person mid-air, a deep learning model may inaccurately label it as ‘falling’. However, feeding a whole video with a person jumping, flipping, and landing will offer the model enough information to make an accurate prediction.
The model is also learning the environmental context. So with enough examples, the model learns that a person with a running pose on a football field is most likely to be playing football, and if the person with that pose is on a track or a road then he’s probably running. However, it is often the case that a trained model will have learned to classify videos by learning environmental context instead of the actual action sequence, leading to over-fitting. Consider the action of Opening a door and Closing a door. In both actions, the frames are almost the same. The difference is the order of the frame sequence, such that we need temporal information to correctly predict these actions.
Types of Activity Recognition Problems
We can classify human activity recognition into three main categories, such as:
- Simple Activity Recognition – This type refers to simple videos with a single human clearly performing an action, where the video length only encapsulates the activity and nothing else. The examples we have described earlier such as running, playing football, or opening a door, all fall within this category.
- Temporal Activity Recognition/Localization – this includes longer videos that contain more than a single activity. Temporal Activity localization requires models whose architecture can localize each individual action into temporal proposals, and can also classify each video clip or temporal proposal.
- Spatio-Temporal Detection – this type of video includes not only a single human performing multiple actions, but multiple humans within the same video, where each one may be performing different actions. To tackle this, the algorithm will have to detect and localize each person in the video and classify activities being performed by each individual. Additionally, it also needs to make a note of the time span of each action being performed, just like in temporal activity recognition.
Video Classification Methods
These video classification methods represent different approaches for labeling a short video clip with the human Activity being performed in that video clip.
- Single-Frame CNN – As discussed above, a single-frame CNN is a simple model which uses an image classification algorithm. Single-frame CNN assigned a label to a video by looking at the average of the labels assigned for each individual frame within the video to get the final probabilities vector.
- Late Fusion – Is a similar method to single-frame CNN in the sense that it labels each individual frame of a video. The difference between late fusion and single-frame CNN is that in the latter, the averaging of the labels is conducted after the labeling was complete. However, for late fusion, the process of averaging is built into the network itself. Due to this, the temporal structure of the frames sequence is also taken into account. This is done by implementing a fusion layer that combines the output of separate networks that operate on temporally distant frames. This approach enables the model to learn spatial as well as temporal information about the appearance and movement of the objects in a scene. Each stream performs image (frame) classification on its own, and in the end, the predicted scores are merged using the fusion layer.
- Early Fusion – compared to late fusion, this approach has the temporal dimension and the channel) dimension of the video are fused before passing it to the model which allows the first layer to operate over frames and learn to identify local pixel motions between adjacent frames. An input video typically has the following dimensions: time, three color channels (red, green, and blue), the height and the width of the matrix. After fusion, the dimension of time is merged onto each of the color channels
- Using CNN with long short term memory (LSTM) networks – LSTM networks are a type of recurrent neural network capable of learning order dependence in sequence prediction problems. This approach uses convolutional networks to extract local features of each frame. The outputs of these independent convolutional networks are fed to a many-to-one multilayer LSTM network to fuse this extracted information temporarily.
-
- Using Optical Flow and CNNs – Optical flow, a technique used for motion tracking, is a per pixel prediction which assumes a brightness constancy to estimate how the pixels’ brightness moves across the screen over time. Using two parallel streams of convolutional networks, we define the spatial stream, which takes a single frame from the video and then runs CNN kernels to make a prediction based on its spatial information. Then, we define the temporal stream takes, which merges frames using the early fusion technique and then takes every adjacent frame’s optical flows after to make a prediction. Lastly, this technique averages the predicted probabilities for both streams to get the final probabilities. The downside of this technique is the reliance on an external optical flow algorithm outside of the main network to find optical flows for each video.
- Using Slow-Fast Networks – this technique also describes two parallel streams. The first one, called the slow branch, operates on a low temporal frame rate video and has a lot of channels at every layer for detailed processing for each frame. The second stream, the fast branch, has low channels and operates on a high temporal frame rate version of the same video. Both streams are connected to merge the information from the fast branch to the slow branch at multiple stages.
- Using 3D CNN’s / Slow Fusion – This approach uses a 3D convolution network that allows for the processing of temporal and spatial information by using a three-dimensional CNN. Unlike Early and Late fusion, this method fuses the temporal and spatial information slowly at each CNN layer throughout the entire network. A four-dimensional tensor, which contains two spatial dimensions, one channel dimension and one temporal dimension is passed through the model, allowing it to learn all types of temporal interactions between adjacent frames. Operating on a higher number of input dimensions also increases the computational and memory requirements.
Data Preparation and Preprocessing
Considering that video classification machine learning algorithms are highly dependent on labeled data, the preparation process will have a considerable impact on the performance of the algorithm.
Human activity recognition is one of the most important use cases for video classification, and will be used as the example for preparing data.
Human activity recognition can be divided into three main categories. These include
- Human-Object Interaction – for example, playing a musical instrument, opening a door, throwing a ball
- Body-Motion Only – typical activities include walking, running, sitting up and standing down, among many others
- Human-Human Interaction – involves more than one human within the same video such as playing sports, shaking hands, and talking
It’s also worth noting that major activities such as playing musical instruments and sports deserve their own categories for the purpose of training a video classification model.
Data consistency – the main challenge for getting a suitable dataset for video classification is consistency, where the selected video clips have the following characteristics
- Video length is the same or within a few frames tolerance. To address any discrepancies between video lengths, a good guiding principle is to take the longest video in the set and note its frame count. All videos shorter than that will have buffer frames added at the end such that all videos can be the same length.
- Videos do not contain any jump cuts or major changes. This is particularly difficult to extract from videos in the wild, which is one of the reasons why getting longer videos (>5 seconds) may not be feasible.
- The human subject in the video is well positioned within the frame to capture the activity’s movements. For example, playing guitar may only require the upper half of the body and instrument to be captured, whilst fencing requires the framing of both subjcet’s full bodies.
- The videos have some consistent characteristics. For example, the clips belonging to one group share some common features, such as the background or actors. For example, all football-related videos should be captured on a playing field.
- Videos should undergo normalization and to resize and get consistent frame sizes or resolutions.
One of the most used pre-labeled datasets for human activity recognition is the UCF101 dataset, which is known to not contain extreme variations in objects and actions across frames.
To convert the video data into a trainable format, each video needs to be converted to an image format, such that every frame can be processed and labeled by itself, especially for single-frame CNN methods. This needs to be accompanied by a CSV file which contains referencing information for the videos, such as their class, whether they are a testing or training input, and frame count.
Following industry best practices, test and training data should be split into 80% for training and 20% for validation, where data should be randomized across the two to ensure a fair distribution of classes and prevent any spilling from the test set into the training set.
Video Labeling-as-a-Service
For large-scale video classification use cases which go beyond the UCF101 and other open video datasets, iMerit offers video labeling services to provide you with a consistent flow of expertly-labeled data. Our tried-and-tested methodology and long-term talent pool ensure that we keep consistent labeling practices across time and projects.