YOLO (You Only Look Once) is a fast and effective deep neural network (DNN) architecture that can identify and locate multiple objects in video, in real time. YOLO is a great example of innovative architectural elements combined to create a state-of-the-art machine learning system, in this case one for computer vision applications such as autonomous driving. The original version of this real-time object detection algorithm was developed in 2015 and described in You only look once: unified, real-time object detection, a paper by Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. Although in subsequent years the algorithm has evolved through a number of YOLO versions, its basic approach has remained the same. In this article we describe the main features of YOLO’s approach.
The basic YOLO architecture is summarized in the figure below. (The image is from the original YOLO paper.)
The input to YOLO is an image and the output is the identification of objects in the image and their locations, as depicted by the labels and colored boxes in the diagram above. YOLO is fast enough to identify objects in real time from video, as shown in this demonstration.
YOLO uses a convolutional neural network (ConvNet) as its basic machine learning model, and it includes the following design features:
- YOLO’s uses a fixed grid superimposed on the input image to define sub images or grid cells. In the example above, the grid is 7 x 7, resulting 49 grid cells. The YOLO ConvNet is applied simultaneously to all the grid cells, producing a single output for the entire input image
- The output of YOLO’s ConvNet includes both object classification information (e.g., whether an object is a dog or a bicycle), and localization information (coordinates of the bounding box for the object).
- YOLO uses anchor boxes to allow detection of multiple objects within a grid cell.
- YOLO uses non-max suppression to consolidate multiple detections of the same object.
We explain these features in detail below. But first, we present a brief tutorial on convolutional neural networks, the basis for the machine learning model used in YOLO.
ConvNets like YOLO’s are based on convolution, which in the context of machine learning means multiplying small patches of an image by filters (also called a ‘kernels’). These filters are systematically scanned across the image to produce an output that reflects the presence of relevant details detected by the filters at various locations in the image.
The example below shows the convolution of a 5 x 5 image with a 3 x 3 filter. The filter is first superimposed on the upper left-hand corner (shaded) of the image. We perform an element-by-element multiplication of the filter and the image, and sum the products. This gives the number 15, which becomes the first element of the 3 x 3 output of the convolution. The full convolution is completed by shifting the filter one pixel at a time horizontally and vertically, until the entire image has been covered and all the values of the 3 x 3 output have been calculated.
ConvNets use machine learning to set the values of their filters. Training data sets composed of images annotated with ground truth are presented to the convolutional network, and backpropagation regression algorithms adjust the values in the filters to produce the desired output. For an image classifier, the desired output would be, for example, the correct classification of an image as a dog.
ConvNets have proven to work very well for computer vision. One reason is that convolutional filters can learn the shapes, textures, and other spatial features that are important to interpreting what’s in an image. Another reason is that the filter values that are learned, nine in the example above, are applicable across the entire input image, no matter what its size. In the example above, if the image were input directly to a standard neural network, the first layer would need to learn 25 weights instead of just nine. For a larger image, say 225 x 225, the difference would be nine weights versus over 50,000.
Deep learning ConvNets such as YOLO’s extend the basic two-dimensional convolution shown above to three dimensions by adding a ‘channel’ dimension to the two spatial dimensions. For example, a color version of the example image above could be processed using 3 x 3 filters for each of the red, green, and blue channels, as illustrated below:
In this case the input image, the filter, and the convolution output are all three-dimensional arrays of numbers, or tensors. Tensors show up in all types of DNNs and working with them has become much easier with the availability of tool kits such as TensorFlow and Pytorch and special purpose hardware such as Graphics Processing Units (GPUs) to speed up processing. In addition, general computer science progress in languages, development environments, and repositories such as GitHub have made data science experimentation and building on prior results much more efficient. The ability to build upon previously developed software and data sets and efficiently experiment with many different options – filter sizes and other hyperparameters – has been essential to YOLO’s success.
The YOLO model uses much larger tensors than those in the simple example above. A typical YOLO ConvNet filter might be 3 x 3 in the spatial dimensions, but 192 in the channel dimension, creating an output tensor that is 3 x 3 x 192. The use of this many channels increases the capacity of the ConvNet, improving its ability to distinguish shapes and patterns in the input image.
The capacity of the YOLO ConvNet is also increased by using multiple layers, 24 layers in the first version of YOLO, YOLOv1. It’s easy to extend convolution to multiple layers – the output of one layer can be processed by the next layer as if it were an input ‘image’. In a typical ConvNet, as the data goes from layer to layer the spatial dimensions get smaller and smaller, and the channel dimension gets larger. You can see an example of this in this diagram of YOLOv1:
As is typical of ConvNets, YOLO uses another type of layer, maxpool layers, which are inserted between the convolutional layers. These layers are simplified versions of a convolutional layer, where instead of a filter multiplying the image underneath it, the filter outputs the maximum value of the pixels it covers. Maxpool layers smooth out predictions and help match the dimensions of one layer’s output tensor with the dimensions required for the next layer’s input tensor.
YOLO also uses a third type, fully connected layers, to produce its final output. These layers pull together the information extracted through the multiple convolution layers to produce the object classifications and locations that are YOLO’s output.
Because YOLO’s output includes both identification and localization information, YOLO can process a video frame in a single pass, at rates of 65 frames per second (FPS) or more. (Competing object detection models, such as Fast R-CNN, Faster R-CNN, Retina-Net, or Single-Shot MultiBox Detector (SSD), use a two-pass approach, the first pass for segmentation/object localization and the second for object classification. These detection methods tend to be slower than YOLO.)
Let’s look at how YOLO’s output is organized to allow one-pass operation.
The image classification part of YOLO’s output follows a format typically used for ConvNets. Here is a simple example: a ConvNet built to classify an image as a dog, cat, parrot, or none of these. This ConvNet has three units in the output layer which use a SoftMax activation function to estimate class probabilities. The following are the target output vectors for different types of input image:
ConvNet output vector training targets for simple 3-class example
After this ConvNet is trained, it might produce an output like this, given a dog image from ImageNet:
Simple ConvNet classifying an input image
The output vector is highest in the first position, indicating a .79 probability that the image is a dog.
YOLO follows this format to identify the classifications of its objects. YOLOv1 was trained using the Pascal VOC dataset (Visual Object Classes) to recognize 20 different object classes, so instead of three elements in the example above, the classification part of its output vector had 20 elements, c1, c2, c3, … , c20. (Later versions of YOLO were trained to recognize 80 different object classes using the COCO dataset (Common Objects in Context).)
Adding bounding box information to its output vector is what makes YOLO a one-pass object detector. Here is what the classification plus bounding box output vector looks like, for one of YOLO’s grid cells (image from the Stanford Car Dataset):
Adding localization to the YOLO ConvNet output vector
The first element of this output vector, pc, is trained to be “1” if there is an object in the grid cell, “0” otherwise. This simplifies both training and object detection by allowing the ConvNet to ignore the rest of the output vector for grid cells containing no objects.
The next set of components in the output vector are the bounding box coordinates. These use the coordinate system defined by an origin (0,0) at the upper left-hand corner of the grid cell, and the coordinates (1,1) at the lower right-hand corner.
In the output vector (bx, by) are the coordinates of the midpoint of the bounding box. Each of these coordinates will be a number between 0 and 1; in the example above, their values would be (.55, .71). (bh , bw) are the height and width of the bounding box, defined in the same coordinate system. In an object extends beyond the boundary of the grid cell, bh and bw can take on values greater than 1.
Objects Spanning Multiple Grid cells
In the car example above, the object is fully contained within a single grid cell. However, YOLO often encounters objects that overlap one or more grid cells. YOLO handles this situation by assigning an object to the grid cell that contains the midpoint of its bounding box. This trains the ConvNet to detect objects that are centered on its grid cell. Here’s an example.
Labeling multi-grid cell objects
In this example a training image of a dog has been divided into 3 x 3 grid cells. An annotator has drawn the bounding box for the dog, shown in green. The midpoint of the dog’s bounding box falls in the center grid cell, so that grid cell is assigned to detect the dog. The annotator specifies the target output vector for this grid cell as shown here, reflecting the detection of the dog and its bounding box. In contrast, the target output vector for the grid cell below the assigned grid cell reflects no object is to be detected. The “?” entries for this grid cell mean these values are to be ignored in training and in object detection.
Multiple Objects in One Grid cell
YOLO’s grid cell size is generally set to be about the size of expected objects, so a grid cell will usually contain just a single object. However, sometimes multiple objects can appear in a single grid cell, for example, if a person is standing in front of a car. YOLO handles multiple objects like this using anchor boxes.
Anchor boxes are rectangular boxes selected to approximate the height and width of typical objects to be found in the input images of a particular YOLO application. In this example we show a single grid cell with a boy standing in front of a car. Here YOLO uses two anchor boxes: anchor box 1 is tall and thin, like a person, and anchor box 2 is shorter and wider like a car. (Additional anchor boxes could be defined to represent different sized vehicles, traffic signs, etc.)
To take advantage of anchor boxes, YOLO’s output is expanded to include a separate classification/ bounding box vector for each anchor box. This allows YOLO’s ConvNet to learn to look, in each grid cell, for objects matching the size and shape of each anchor box. This allows the detection of multiple objects in a grid cell, as long as each object is associated with a different anchor box. The use of anchor boxes can also improve accuracy, by training parts of the ConvNet to become specialized in detecting objects of a particular size and shape.
During training sample annotation, an object’s bounding box is drawn, its classification is noted, and it is assigned to the anchor box most similar* to its bounding box. In the example above, the boy is assigned to the first anchor box A1, and the car is assigned to the second anchor box A2. The target output then shows the boy’s classification (c1 = 1, “person”) and his bounding box coordinates in the A1 target output vector, and the car’s classification and bounding box in the A2 target output vector.
So now we’ve described all the components of the YOLO output: classifications and bounding boxes, for each anchor box (A1, A2, A3, A4), for each grid cell (G1, G2, … G49). This is depicted in the diagram below:
Complete output tensor for YOLO ConvNet
The output tensor produced by YOLO’s ConvNet is subject to a final post-processing step, non-max suppression. This is required to eliminate redundant object detections and bounding boxes that occur when two or more grid cells detect the same object. This can happen when an object overlaps multiple grid cells, as in the example on the right. Here the dog covers multiple grid cells, and it’s not surprising that several of them detect a dog and estimate its bounding box.
Multiple bounding box prediction
Fortunately, YOLO’s ConvNet usually assigns a higher probability to the grid cell containing the object’s midpoint, since that’s the way it was trained. This is used by the non-max suppression algorithm to eliminate redundant detections and their bounding boxes. Here is a description of the algorithm.
The input to the algorithm is O, the complete YOLO output for an input image, including output vectors for all anchor boxes from all the grid cells. The algorithm proceeds as follows:
- First eliminate all the output vectors in O having a prediction probability pc < Tc . This gets rid of output vectors that are not likely to be associated with actual objects. (Tc is typically set to 0.6.). The output of this step is a set C of output vectors associated with candidate objects
- Cycle through C, performing the following steps, until there are no more output vectors left in C:
- Find the output vector Vmax in C with the highest pc. Move Vmax from C to A, the set of output vectors accepted as objects.
- Remove from C and discard any output vector that has a bounding box that overlaps Vmax’s bounding box with an IOU> 0.5. (See the definition of Intersection over Union above.)
- Go back to a. above, find the next Vmax, and repeat
3. The output vectors remaining in A are the final YOLO output.
In this article we have explained YOLO object detection, one of the success stories in the modern deep learning approach to artificial intelligence. The YOLO algorithm classifies and locates multiple objects in one pass through an input image, making it fast enough to process real-time video.
YOLO is built upon a number of engineering innovations. It starts with a convolutional neural network, itself an important innovation, and adds elements to the ConvNet’s output that specify the identity and location of multiple objects of various shapes and sizes.
Since YOLOv1 was described in the original 2015 paper, additional versions have been released that improve upon the original design, specialize the architecture for various applications, and take advantage of transfer learning and incompletely labeled training data.
In particular, YOLOv2 improved upon the original by taking advantage of higher resolution images, using batch normalization, and making greater use of anchor boxes. YOLOv3 added criteria for improved selection of bounding boxes, used variable levels of granularity to improve detection of small objects, and moved to Darknet-53, an improved ConvNet architecture.
YOLOv4 was the result of experimenting with and adopting a variety of modifications to make YOLO more efficient, such as batch normalization and MISH activation. YOLOv5 followed the YOLOv4 architecture, but offered four versions: small, medium, large, and extra-large, that offer tradeoffs between training/run time and accuracy.
Of course, YOLO depends on training data – images annotated with object bounding boxes and classifications. To learn how to efficiently develop accurate training data sets, call iMerit and talk to an expert!