Real time object detection [PyTorch]||[YOLO]

Unified real time object detection with YOLO, implemented with PyTorch on Python

Apr 24 ·6min read

6vimqe3.jpg!web

Photo by Alex Knight on Unsplash

Can computers understand what they see? Can they tell a dog from a cat, a man from a woman or a car from a bike? Let’s find it out here!

Object Detection and recognition is one of the leading areas of computer vision research today. Researchers are finding new ways to make computers understand what they see. New state of the art models are being formulated that beat their previous models by large margins. But, computers are still actually far from ‘seeing’ what they see.

In this article, I am going to elaborate on YOLO [You Only Look Once] — a research work from 2016 that set the new high for real time object detection. This article is a brief description and implementation of the YOLO model to get you started into the field of computer vision and object detection.

YOLO: An introduction

YOLO re-frames object detection as a single regression problem, straight from image pixels to bounding box coordinates and class probabilities. YOLO uses a single convolutional layer to simultaneously predict multiple bounding boxes and class probabilities for those boxes. Thus, one of the main plus points of YOLO becomes the speed at which we can detect frames with it. Even at this speed, YOLO manages to achieve more than twice the mean average precision (mAP) of other real-time systems!

Mean Average precision is the average of the average precision taken for each class. In other words, mean average precision is the average precision taken over all classes.

equation for mean-average precision

Detection

YOLO is a unified object detection process — Unified? YOLO unifies the several tasks involving object detection into a single neural network. The network takes into consideration the entire image while making bounding box predictions. It thus is able to predict all the bounding boxes across all classes at the same time.

At first, the image is divided into an S x S grid. If the center of the grid falls in the grid cell, the grid cell should be able to detect the object. Now, what do we mean by detection? To be precise, the grid cell predicts B bounding boxes with confidence scores for each box that tell us how much confidence the model has that the box contains an object and how accurate the box is in terms of covering the object. Thus the box gives us 5 predictions — x,y,w,h and the confidence score . The grid cell thus gives us B such predictions.

Along with the predictions on the bounding boxes, the grid cell also gives us C conditional class probabilities . These give basically the probability of the class, given that the grid cell contains an object. In other words, given that an object is present, the conditional class probabilities give us the probability to which class this object might belong to. Only one set of class probabilities are predicted per grid cell — regardless of the number of bounding boxes present

aMfMNnQ.png!web

An overview of what the model does — https://arxiv.org/abs/1506.02640

Network

The model is inspired by the GoogleNet model for image classification. It has 24 convolutional layers followed by 2 fully connected layers. Instead of the inception modules used by GoogLeNet, it uses 1 × 1 reduction layers followed by 3 × 3 convolutional layers.

amAJfyn.png!web

Model architecture overview — https://arxiv.org/abs/1506.02640

The alternating 1 × 1 convolutional layers reduce the features space from the preceding layers. The entire network is pre-trained on the ImageNet classification problem at half the resolution and then the resolution is doubled for detection.

Loss and activation

The final layer uses a linear activation function whereas all the other layers use the Leaky-ReLU activation function. The Leaky-ReLU activation can be expressed as:

Leaky-ReLU activation

The loss function is the simple sum squared error, written as —

M7ry2eq.png!web

Sum Squared error

However, if we define loss function like this, it would club together classification and localization error and would attach equal importance to both of them from the viewpoint of the model.

To prevent this from happening, the loss from bounding box coordinate predictions is increased and the loss from confidence predictions for boxes that don’t contain objects is decreased. Two parameters λcoord and λnoobj are used to accomplish this — λcoord is set to 5 and λnoobj is set to 0.5 (see more about these hyper-parameters in the loss function equation)

SSE also brings in the problem of equal weighting error in large and small bounding boxes. To visualize this problem look at the following image —

UNZRNfY.png!web

Dog and bowl image (for illustration only) — Google images

Here we can see that the dog and the bowl have been annotated as objects in the image. The bounding box of the dog has a bigger size and the bowl has a smaller size. Now if the bounding box of the dog and of the bowl were to decrease by the same square pixels, it would feel that the bowl has worse annotation as compared to the box. Why is that? because the bounding box accuracy in terms of shift and size should be in comparison to the bounding box size and not the image size.

Now, coming to our loss function, we can see that the loss function does not pay any specific attention to the bounding box size. It does not take that into consideration. To overcome this hurdle, the network does not predict the bounding box height and width, but predicts the square root of the same. This helps in keeping the differences to a minimum.

Taking all of this into consideration, the multi-part loss function can be written as —

aQ3MNzr.png!web

Total loss function of the network— https://arxiv.org/abs/1506.02640

Now that we have completed the basics of the YOLO object detection model, let's dive into the code!

Build it!

The YOLO network is simple and easy to build. The problem is with the bounding boxes. Drawing the bounding boxes and saving the images, writing confidence scores and labels along with configuring the entire training code would make this article unnecessarily long. I will thus just implement the model as is. Bring it upon yourself to complete the code as a project this weekend and tell me about the results in the comment section! I will upload a sequel to this article featuring your code and your results with full credits ;)

The YOLO network module

Here’s a short video about the YOLOv3 (an updated an improvised version of YOLO) results—

YOLOv3

So, excited huh? Well, you should be because computer vision is advancing by leaps and bounds! This is the time to get started. Let me know if you get stuck with the implementation of YOLO in the comment section. Here to help :)

Hmrishav Bandyopadhyay is a 2nd year Undergraduate at the Electronics and Telecommunication department of Jadvapur University, India. His interests lie in Deep Learning, Computer Vision, and Image Processing. He can be reached at — [email protected].

Unified real time object detection with YOLO, implemented with PyTorch on Python

YOLO: An introduction

Detection

Network

Loss and activation

Build it!

Recommend

Disabling Snaps in Ubuntu 20.04

你还在为TCP重传、滑动窗口、流量控制、拥塞控制发愁吗？看完图解就不愁了

能够装在U盘上的五大Linux发行版

[闲谈] 迫于无聊，求只朋友

彭博社：采用 ARM 架构苹果自研芯片的 Mac 电脑会在明年问世

发现很多码农对于操作系统处于一无所知的状态，即使薪水 60 万了也基本上对操作系统没...

你是怎么看这种同事的？

CVE-2020-0022 an Android 8.0-9.0 Bluetooth Zero-Click RCE – BlueFrag

GitHub - material-theme/vsc-material-theme: Material Theme, the most epic theme...

GitHub - chuling/vim-equinusocio-material: Equinusocio's material theme...

About Joyk