Beginner’s Guide to Object Detection Algorithms

When I first came to Centelon, The Director for Data Science, Mr. Prabhash Thakur assigned me with an Object Detection Proposition. I was completely new to this field back then and so he told me about three main algorithms that are used in the industry.

Faster R-CNN
YOLO
SSD

There are many more algorithms in use and I had to figure out which to use because every algorithm has its pros and cons. These are the algorithms that I found online :

R-CNN

Region-CNN (R-CNN)is one of the state-of-the-art CNN-based deep learning object detection approaches . Based on this, there are fast R-CNN and faster R-CNN for faster speed object detection.

Conventionally, for each image, there is a sliding window to search every position within the image as below. It is a simple solution. However, different objects or even the same kind of objects can have different aspect ratios and sizes depending on the object size and distance from the camera. And different image sizes also affect the effective window size. This process will be extremely slow if we use deep learning CNN for image classification at each location.

First, R-CNN uses selective search to generate about 2000 region proposals , i.e. bounding boxes for image classification.
Then, for each bounding box, image classification is done through CNN.
Finally, each bounding box can be refined using regression.

Problems with R-CNN

It takes a huge amount of time to train the network as you would have to classify 2000 region proposals per image.
It cannot be implemented in real time as it takes around 47 seconds for each test image.
The selective search algorithm is a fixed algorithm. Therefore, no learning is happening at that stage. This could lead to the generation of bad candidate region proposals.

Fast R-CNN

Fast R-CNN

The approach to Fast R-CNN is similar to the R-CNN algorithm. But, instead of feeding the region proposals to the CNN, we feed the input image to the CNN to generate a convolutional feature map. From the convolutional feature map, we identify the region of proposals and warp them into squares and by using an RoI pooling layer we reshape them into a fixed size so that it can be fed into a fully connected layer. From the RoI feature vector, we use a softmax layer to predict the class of the proposed region and also the offset values for the bounding box.

The advantages of Fast R-CNN

Faster than R-CNN, because you don’t have to feed 2000 region proposals to the convolutional neural network every time.
The convolution operation is done only once per image and a feature map is generated from it.

Faster R-CNN

Similar to Fast R-CNN, the image is provided as an input to a convolutional network which provides a convolutional feature map. Instead of using a selective search algorithm on the feature map to identify the region proposals, a separate network is used to predict the region proposals. The predicted region proposals are then reshaped using an RoI pooling layer which is then used to classify the image within the proposed region and predict the offset values for the bounding boxes.

Anchors play an important role in Faster R-CNN. An anchor is a box. In the default configuration of Faster R-CNN, there are 9 anchors at a position of an image. The following graph shows 9 anchors at the position (320, 320) of an image with size (600, 800).

Let’s look closer:

Three colors represent three scales or sizes: 128×128, 256×256, 512×512.
Let’s single out the red boxes/anchors. The three boxes have height-width ratios 1:1, 1:2 and 2:1 respectively.

If we choose one position at every stride of 16, there will be 1989 (39×51) positions. This leads to 17901 (1989 x 9) boxes to consider. The sheer size is hardly smaller than the combination of sliding window and pyramid. Or you can reason this is why it has coverage as good as other state of the art methods. The bright side here is that we can use region proposal network, the method in Fast RCNN, to significantly reduce the number.

YOLO — You Only Look Once

All of the previous object detection algorithms use regions to localize the object within the image. The network does not look at the complete image. Instead, parts of the image which have high probabilities of containing the object. YOLO or You Only Look Once is an object detection algorithm much different from the region based algorithms seen above. In YOLO a single convolutional network predicts the bounding boxes and the class probabilities for these boxes.

How YOLO works is that we take an image and split it into an SxS grid, within each of the grid we take m bounding boxes. For each of the bounding box, the network outputs a class probability and offset values for the bounding box. The bounding boxes having the class probability above a threshold value is selected and used to locate the object within the image.

Advantages and disadvantages of YOLO

YOLO is orders of magnitude faster(45 frames per second) than other object detection algorithms.
The limitation of YOLO algorithm is that it struggles with small objects within the image, for example, it might have difficulties in detecting a flock of birds. This is due to the spatial constraints of the algorithm.

SSD- Single Shot MultiBox Detector

The tasks of object localization and classification are done in a single forward pass of the network. MultiBox is the name of a technique for bounding box regression. The network is an object detector that also classifies those detected objects.

SSD Architecture

When using Single Shot Detectors (SSDs) you have components and sub-components such as:

MultiBox
Priors
Fixed priors

The base network is just one of the many components that fit into the overall deep learning object detection framework — the figure at the top of this section depicts the VGG16 base network inside the SSD framework.

Typically, “network surgery” is performed on the base network. This modification:

Forms it to be fully-convolutional (i.e., accept arbitrary input dimensions).
Eliminates CONV/POOL layers deeper in the base network architecture and replaces them with a series of new layers (SSD), new modules (Faster R-CNN), or some combination of the two.

The term “network surgery” is a colloquial way of saying we remove some of the original layers of the base network architecture and supplant them with new layers. Network surgery is also very tactical — we remove parts of the network we do not need and replace it with a new set of components. Then, when we go to train our framework to perform object detection, both the weights of the new layers/modules and base network are modified.

Advantages of SSD

SSD attains a better balance between swiftness and precision. SSD runs a convolutional network on input image only one time and computes a feature map.
SSD also uses anchor boxes at a variety of aspect ratio comparable to Faster-RCNN and learns the off-set to a certain extent than learning the box. In order to hold the scale, SSD predicts bounding boxes after multiple convolutional layers. Since every convolutional layer functions at a diverse scale, it is able to detect objects of a mixture of scales.

R-FCN

For traditional region proposal network (RPN) approaches such as R-CNN, Fast R-CNN, and Faster R-CNN, region proposals are generated by RPN first. Then ROI pooling is done, and going through fully connected (FC) layers for classification and bounding box regression.

The process (FC layers) after ROI pooling does not share among ROI and takes time, which makes RPN approaches slow. And the FC layers increase the number of connections (parameters) which also increase the complexity.

Advantages of R-FCN

In R-FCN, we still have RPN to obtain region proposals, but unlike R-CNN series, FC layers after ROI pooling are removed . Instead, all major complexity is moved before ROI pooling to generate the score maps .
All region proposals, after ROI pooling, will make use of the same set of score maps to perform average voting , which is a simple calculation. Thus, No learnable layer after ROI layer which is nearly cost-free. As a result, R-FCN is even faster than Faster R-CNN.

Conclusion

These were some of the Algorithms that I found online. I would like to give credit to all the bloggers who posted about these algorithms and helped me combine all of them into one article.

R-CNN

Problems with R-CNN

Fast R-CNN

The advantages of Fast R-CNN

Faster R-CNN

YOLO — You Only Look Once

Advantages and disadvantages of YOLO

SSD- Single Shot MultiBox Detector

Advantages of SSD

R-FCN

Advantages of R-FCN

Conclusion

Recommend

3 Machine Learning Books that Helped me Level Up as a Data Scientist

KDE Usability & Productivity: Week 68

Why isn't 1 a prime number?

Creating an animated scroll cue

王石谈“996”：我们的民族不会休息

GitHub - H4ckForJob/dirmap: 一个高级web目录扫描工具，功能将会强于DirBuster、Dirs...

小学生不慎坠河，四名路人从三个方向冲刺救人

滴滴上线“特惠拼车”功能官方回应称在试运营阶段

韩国半导体是如何崛起的？

new balance X-90 男款休闲运动鞋低至$34.39（用码，约￥310）_海淘new balance__海...

About Joyk