YOLO Made Simple: Interpreting the You Only Look Once Paper

Going through the nitty-gritty details in the paper and facts that are often overlooked explained simply.

Feb 2 ·10min read

eyE3uyj.jpg!web

YOLO: You Only Look Once — Source Source: Kanielse in pixabay

Unlike the state of the art R-CNN model, the “YOLO: Unified, Real-Time Object Detection” or “YOLOv1” presents an end-to-end solution to object detection and classification. Meaning that we can train a single model to detect and classify directly from the input image and is fully differentiable. Examples of traditional methods of object detection are running a classifier on different parts of an image and at different scales. Neat, right?! All you need is just a classifier.

As simple as it sounds, it’s extremely inefficient to run classifiers hundreds of times on a single image to localize objects. But YOLOv1 deals with it smartly. It detects and classifies with a single forward pass of the image and runs in real-time. Hence the name “You Only Look Once”. We’ll look at everything described by the paper in detail.

Introduction

The authors compare YOLO’s working to human perception. We, humans, glance at a scene and instantly get an overview of what’s present, where, who’s doing what and a whole lot more. The human visual cortex is amazing, isn’t it? YOLOv1 predicts what objects are present and where they’re in the image in just one go by treating the object detection and classification problems as regression. Simply put, you give an image to the YOLO model, it passes through a bunch of layers and the final output will be the class predictions and bounding box coordinates. Here, the authors crisply define YOLO’s working as

Straight from image pixels to bounding box coordinates and class probabilities.

YOLO’s Way of Object Detection

YOLO deals with object detection by using an elegant process of dividing the image into a grid of S x S cells. And YOLO restricts the input to square images only.

yqINnm7.jpg!web

The S x S grid

Each cell produces class and bounding prediction for objects if their centre falls inside that particular cell. This method is powerful as it enables YOLO to detect multiple objects in an image and classify them simultaneously. Yet, dividing the image into a greater number of cells will produce more fine-grained predictions. Each cell in the grid is responsible for predicting the bounding box parameters, the confidence that an object is present and the class probabilities. The resulting bounding box prediction consists of the x and y coordinates of the box’s centre, sqrt(width), sqrt(height) and an object probability score.

Note: YOLOv1 predicts the square root of the bounding box’s width and height relative to the image. The reason is explained in the Loss Function section below.

If there are 20 classes ( C =20), the output of the cell is [x, y, √ w, √ h, object probability, C1, C2, C3,……., C20 ]. The aforementioned probabilities are conditional class probabilities. To clarify, it is the probability that the object belongs to a particular class given that the object is present in the cell. Of course, each cell in the grid of cells predicts a similar list of items.

But there’s one more thing. In YOLOv1, each cell predicts not one but B bounding boxes. And each of these bounding boxes has [x, y, √ w, √ h, object probability]. However, YOLO predicts the class probabilities only once per cell irrespective of the number of bounding boxes. Consequently, each cell’s output now has more items. To illustrate, if B =2 and C =20, the output grows to become [x1, y1, √ w1, √ h1, obj. prob1, x2, y2, √ w2, √ h2, obj. prob2, C1, C2,…., C20 ].

For detection on PASCAL VOC with 20 classes, it predicts a 7 x 7 grid of cells with 2 bounding boxes for each one of them. Considering the predictions of all the cells, the output shape will be a cuboidal volume having dimensions ( 7 x 7 x 30 ). The two bounding boxes contribute ten terms and the class probabilities have twenty terms since C =20. This sums up to thirty which explains the “ 30 ” in the third dimension.

Interpreting YOLO’s Output Prediction

The x and y coordinates of the centre of the bounding box are relative to the top-left corner of that grid cell rather relative to the image’s top-left corner. Each cell predicts the coordinates relative to its position and the coordinates act as offsets to the cell’s position.

jEnEBra.jpg!web

Photo by Charis Gegelman on Unsplash

If we divide an image into 3 x 3 grid cells, as shown above, the centre of the object falls inside the centre grid cell. And if we again assume that each grid cell’s width and height is A, the coordinates of the object’s centre is ( 0.6A, 0.6A ) relative to the cell’s top-left corner. The model predicts the coordinates with a value between 0 and 1 which is a fraction of A. Therefore, coordinates ( 0.6, 0.6 ) denote 60% of A’s length to the right and 60% down. These coordinates can be converted relative to the whole image since we know which cell predicts the box and its relative coordinates.

FFz2iy7.jpg!web

The coordinates of the object’s centre relative to the image

For the above example, the box’s centre relative to the image is (A+0.6*A, A+0.6*A). The former A that’s added to 0.6*A is the distance of the cell’s top-left corner from that of the image’s top-left corner. Thus, the sum gives us the coordinates of the box’s centre relative to the whole image. But the height and width of the bounding boxes are predicted relative to the whole image. For the above “cat” example, the bounding box’s height is almost two-thirds of the image’s height. And the box’s width is one-third of the image’s width.

Therefore, YOLO will predict the width and height as 1/3rd and 2/3rd of the image’s width ( W ) and height ( H ) respectively. Thus the width and height prediction is represented as √( 0.33*W), √( 0.66*H). Finally, the probability that an object is also represented as a number between 0 and 1 . This object probability is multiplied with the Intersection over Union (IoU) of the predicted box with the ground truth to give the confidence score . The IoU is a score that tells how much the predicted box overlaps with the ground truth box. Its value also falls between 0 and 1 denoting no overlap and complete overlap respectively.

A confidence score of 1 represents 100% confidence and 0 , 0% confidence. The higher this value is, the more confident the cell is that there’s an object. This confidence score is multiplied with the conditional class probability to produce the probability score that a given class is present.

Ultimately the predicted bounding box parameters should be ( 0.6, 0.6, √ 0.33, √ 0.66, 1 ) representing (x, y, width, height, obj. prob).

The Network Architecture

JriA32M.jpg!web

YOLOv1 Architecture Source: https://arxiv.org/pdf/1506.02640.pdf The paper’s authors own this image. I’m just using it to illustrate their work!

Since YOLOv1 came out in 2015, it follows a typical convolutional architecture but innovated the way it predicts the output. It has 24 convolutional layers, 4 max-pooling layers and two fully connected layers, one with 4,096 neurons and the other with 1,470 neurons. The model takes in input colour images of size 448 x 448 for object detection. As we saw earlier, the YOLOv1 predicts a cuboidal output from its final fully connected layer. That’s done by reshaping the output of the last fully connected layer with 1,470 neurons into a ( 7 x 7 x 30 ) cuboid for PASCAL VOC. Explicitly, we can see that the final layer has 1,470 neurons because it needs to be reshaped to 7 x 7 x 30=1,470.

The feature extractor is built with convolution layers of different filter sizes, with follow up max-pooling layers after some of them for spatial reduction. The usual stuff! Only the first convolution layer has 7 x 7 filters in the YOLO model. And all the others have 3 x 3 filters . Rather than just using alternate 3 x 3 convolutions and max pool layers, the network uses 1 x 1 convolution.

The authors mention that their architecture was inspired by the GoogLeNet which introduced the Inception module.

Our network architecture is inspired by the GoogLeNet model for image classification

But unlike GoogLeNet, YOLOv1 doesn’t use inception blocks. Instead, it employs 1 x 1 convolution to reduce the channel depth of the feature maps after applying a large number of 3 x 3 filters. The 1 x 1 filters have a very small receptive field (just a single pixel) but they’re used mainly to reduce the computation load on the layers that follow the 3 x 3 convolution layers. Also, they help in introducing a non-linearity without changing the receptive field.

Don’t quite get it? We’ll look in detail why it’s beneficial.

1 x 1 Convolution

The concept of 1 x 1 convolution was introduced in the paper “ Network-in-Network” by Min et al. Take a look at the paper here https://arxiv.org/abs/1312.4400

Note: The below example explains 1 x 1 convolution with a small model with convolution layers. Do not confuse this example model with YOLO’s architecture

For example, consider that an input image of size (100,100,3) is fed to a 3 x 3 convolution layer with 128 filters with zero paddings. We zero pad to produce output feature maps of the same spatial dimensions as the input (100,100) . Let’s ignore the batch dimension for simplicity. Each filter produces a (100,100,1) output feature map after convolving with the input image. Since we have 128 such filters, the filter outputs append to the channel dimension to produce an output of shape (100,100,128) . The weights of this layer should be of size (3, 3, 3, 128) which is (filter_x_size, filter_y_size, input_channels, number of filters).

So far so good (just the normal convolution!).

Now, if we feed this output again to the next 3 x 3 convolution layer with 128 filters, its weights will have to be (3, 3, 128, 128).

Do you see it now? The first layer has just 3,456 parameters (3x3x3x128=3,456). But, the second layer, since it operates on input with 128 channels, has a whopping 147,456 parameters (3x3x128x128=147,456)! Now think how much parameters would the subsequent layers have? To reduce this booming effect, 1 x 1 convolutions are applied before feeding the (100,100,128) output of the first convolution layer to the next layer.

Applying 32 1 x 1 filters to the (100,100,128) will reduce the channel depth to (100,100,32). Now, the next 3 x 3 convolution layer’s parameters will be (3,3,32,128) . The number of parameters has come down from 147,456 to 36,864 (3x3x32x128) . In addition to this, the 1 x 1 convolution layer has 128x32=4096 parameters. In total, now there are only 40,960 parameters which is 3.6 times lesser!

The YOLOv1 model uses a dropout between the two fully connected layers to prevent overfitting. But it doesn’t use any other techniques like Batch Normalization which can accelerate training.

Now that we’ve seen what’s a 1 x 1 convolution, let’s move on to cover other stuff about the network.

Training

We saw at the start that the network has 24 convolution layers, 4 max-pool layers and 2 FC layer.

Jump here and learn

How to build a simple Convolution Neural Network, train it on a kaggle dataset and achieve 99.9% accuracy!

The authors pre-trained the first twenty convolution layers on the ImageNet dataset at an input resolution of 224 x 224. It is only half the resolution of YOLOv1’s detection input which is 448 x 448. Which is obviously because the ImageNet images are of size 224 x 224.

We pre-train the convolutional layers on the ImageNet classification task at half the resolution (224 × 224 input image) and then double the resolution for detection — The YOLOv1 Authors

Since only the first 20 convolution layers are used for transfer learning, it can operate on input images of any resolution.

The pretraining helps the convolution filters to learn patterns from the ImageNet dataset. As it contains a huge number of images belonging to over a thousand classes, convolution layers can learn a lot of useful features. Pretraining and transfer learning give a good performance boost to YOLO for detection.

For pretraining we use the first 20 convolutional layers from the Figure followed by a average-pooling layer and a fully connected layer — The YOLOv1 Authors

After pretraining on the ImageNet, the average-pooling and the fully connected prediction layer are removed. And they’re replaced with four 3 x 3 convolution layers and two fully connected layers.

Note: The output dimension is 7 x 7 x 30 only for PASCAL VOC dataset using S=7, B=2. If any of the parameters S (number of grid cells S x S), B or the number of classes in the dataset changes, this output shape would also change.

These cascaded convolutions and max-pool layers reduce the spatial dimension of the feature map from 448 x 448 to the required 7 x 7 size .

All the layers in the network except the final fully connected layer use the “Leaky-Relu” activation function. And the final layer has a linear activation.

RVz2eeb.jpg!web

Leaky-Relu’s mathematical expression

Leaky-relu’s graph is slightly different from the Rectified Linear Unit (Relu). Have a look!

rQjEziY.jpg!web

Leaky-Relu’s activation function graph

One important thing to notice is the final layer’s output. As we noted before, it has a linear activation and its output is reshaped to form a 7 x 7 x 30 tensor. Finally, the YOLOv1 model was trained for 135 epochs on the PASCAL VOC dataset. Some of the training techniques followed by the authors are quoted below.

Throughout the training, we use a batch size of 64, a momentum of 0.9 and a decay of 0.0005. For the first epochs, we slowly raise the learning rate from 10−3 to 10−2. We continue training with 10−2 for 75 epochs, then 10−3 for 30 epochs, and finally 10−4 for 30 epochs.

YOLO Made Simple: Interpreting the You Only Look Once Paper