Conv2d: Finally Understand What Happens in the Forward Pass

A visual and mathematical explanation of the 2D convolution layer and its arguments

May 2 ·9min read

Introduction

Deep Learning’s libraries and platforms such as Tensorflow , Keras , Pytorch , Caffe or Theano help us in our daily lives so that every day new applications make us think “Wow!”. We all have our favorite framework, but what they all have in common is that they make things easy for us with functions that are easy to use that can be configured as needed. But we still need to understand what the arguments available are to take advantage of all the power these frameworks give us.

In this post, I will try to list all these arguments. This post is for you if you want to see their impact on the computation time , the number of trainable parameters and the size of the convolved output channels.

Input Shape : (3, 7, 7) — Output Shape : (2, 3, 3) — K : (3, 3) — P : (1, 1) — S : (2, 2) — D : (2, 2) — G : 1

All GIFs are made with python. You will be able to test each of these arguments and visualize by yourself their impact with the scripts pushed on my Github (or to make your own GIFs).

The parts of this post will be divided according to the following arguments. These arguments can be found in the Pytorch documentation of the Conv2d module :

in_channels ( int ) — Number of channels in the input image
out_channels ( int ) — Number of channels produced by the convolution
kernel_size ( int or tuple ) — Size of the convolving kernel
stride ( int or tuple , optional ) — Stride of the convolution. Default: 1
padding ( int or tuple , optional ) — Zero-padding added to both sides of the input. Default: 0
dilation ( int or tuple , optional ) — Spacing between kernel elements. Default: 1
groups ( int , optional ) — Number of blocked connections from input channels to output channels. Default: 1
bias ( bool , optional ) — If True , adds a learnable bias to the output. Default: True

Finally, we will have all the keys to calculate the size of the output channels according to the arguments and the size of the input channels.

What is a Kernel?

2Irqmay.png!web

Convolution between an input image and a kernel

Let me introduce what a kernel is (or convolution matrix ). A kernel describes a filter that we are going to pass over an input image. To make it simple, the kernel will move over the whole image, from left to right, from top to bottom by applying a convolution product . The output of this operation is called a filtered image .

Convolution product

Input shape : (1, 9, 9) — Output Shape : (1, 7, 7) — K : (3, 3) — P : (0, 0) — S : (1, 1) — D : (1, 1) — G : 1

To take a very basic example, let’s imagine a 3 by 3 convolution kernel filtering a 9 by 9 image. Then this kernel moves all over the image to capture in the image all squares of the same size (3 by 3). The convolution product is an element-wise (or point-wise) multiplication. The sum of this result is the resulting pixel on the output (or filtered) image.

If you are not already familiar with filters and convolution matrices, then I strongly advise you to take a little more time to understand the convolution kernels. They are the core of the 2D convolution layer .

Trainable Parameters and Bias

The trainable parameters , which are also simply called “parameters”, are all the parameters that will be updated when the network is trained. In a Conv2d, the trainable elements are the values that compose the kernels . So for our 3 by 3 convolution kernel, we have 3*3=9 trainable parameters.

n2emMzj.png!web

To be more complete. We can include bias or not. The role of bias is to be added to the sum of the convolution product. This bias is also a trainable parameter which makes the number of trainable parameters for our 3 by 3 kernel rise to 10.

Number of Input and Output Channels

Input Shape: (1, 7, 7) — Output Shape : (4, 5, 5) — K : (3, 3) — P : (0, 0) — S : (1, 1) — D : (1, 1) — G : 1

The benefit of using a layer is to be able to perform similar operations at the same time. In other words, if we want to apply 4 different filters of the same size to an input channel, then we will have 4 output channels. These channels are the result of 4 different filters. So resulting from 4 distinct kernels .

In the previous section, we saw that the trainable parameters are what make up the convolution kernels. So the number of parameters increases linearly with the number of convolution kernels. Hence linearly with the number of desired output channels. Note also that the computing time also varies proportionally with the size of the input channel and proportionally with the number of kernels.

MjI7veN.png!web

Note that the curves in the Parameters graph are the same

The same principle applies to the number of input channels . Let’s consider the situation of an RGB encoded image. This image has 3 channels: red, blue and green. We can decide to extract information with filters of the same size on each of these 3 channels to obtain four new channels. The operation is thus 3 times the same, for 4 output channels.

Input Shape: (3, 7, 7) — Output Shape : (4, 5, 5) — K : (3, 3) — P : (0, 0) — S : (1, 1) — D : (1, 1) — G : 1

Each output channel is the sum of the filtered input channels.For 4 output channels and 3 input channels, each output channel is the sum of 3 filtered input channels. In other words, the convolution layer is composed of 4*3=12 convolution kernels.

As a reminder, the number of parameters and the computation time changes proportionally to the number of output channels. This is due to the fact that each output channel is linked to kernels distinct from the other channels. The same is true for the number of input channels . The calculation time and the number of parameters grows proportionally.

Z7JRNjM.png!web

Kernel size

So far, all examples have been given with 3 by 3 size kernels. In fact, the choice of its size is entirely up to you . It is possible to create a convolution layer with a core size 1*1 or 19*19.

Input Shape: (3, 7, 9) — Output Shape : (2, 3, 9) — K : (5, 2) — P : (0, 0) — S : (1, 1) — D : (1, 1) — G : 1

But it is also absolutely possible not to have square kernels. It is a possibility to decide to have kernels with different heights and widths . This is often the case in signal image analysis. If we know that we want to scan the image of a signal, of a sound, then we may want to prefer a 5*1 size kernel for example.

Finally, you will have noticed that all sizes are defined by an odd number. It is just as acceptable to define an even kernel size. In practice, this is rarely done. Usually, an odd size kernel is chosen because there is symmetry around a central pixel.

FzmI73b.png!web

Since all the (classical) trainable parameters of a convolution layer are in the kernels, the number of parameters grows linearly with the size of the kernels. The computation time also varies proportionally.

Strides

The kernels, by default, move from left to right, from bottom to top from pixel to pixel. But this movement can also be changed. Often used to down sample the output channel. For example with strides of (1, 3), the filter is shifted from 3 to 3 horizontally and from 1 to 1 vertically. This produces output channels down sampled by 3 horizontally.

Input Shape: (3, 9, 9) — Output Shape : (2, 7, 3) — K : (3, 3) — P : (0, 0) — S : (1, 3) — D : (1, 1) — G : 1

The strides have no impact on the number of parameters but the calculation time, logically, decreases linearly with the strides.

ZFbUby2.png!web

Note that the curves in the Parameters graph are the same

Padding

The padding defines the number of pixels added to the sides of the input channels before their convolution filtering. Usually, the padding pixels are set to zero. The input channel is extended .

Input Shape : (2, 7, 7) — Output Shape : (1, 7, 7) — K : (3, 3) — P : (1, 1) — S : (1, 1) — D : (1, 1) — G : 1

This is very useful when you want the size of the output channels to be equal to the size of the input channels. To make it simple, when the kernel is 3*3 then the output channel size decreases by one on each side. To overcome this problem we can use a padding of 1.

The padding, therefore, has no impact on the number of parameters, but generates an additional calculation time proportional to the size of the padding. But generally speaking, the padding is often small enough in comparison to the size of the input channel to consider there is no impact on the computation time.

iYvyeym.png!web

Note that the curves in the Parameters graph are the same

Dilation

The dilation is, in a way, the width of the core. By default equal to 1, it corresponds to the offset between each pixel of the kernel on the input channel during convolution .

Input Shape: (2, 7, 7) — Output Shape : (1, 1, 5) — K : (3, 3) — P : (1, 1) — S : (1, 1) — D : (4, 2) — G : 1

I exaggerated a bit on my GIF, but if we take the example of a dilation of (4, 2) then the receptive field of the kernel on the input channel is widened by 4 * ( 3 -1)=8 vertically and 2 * (3–1)=4 horizontally (for a kernel of 3 by 3).

Just like padding, dilation has no impact on the number of parameters and very limited impact on the calculation time.

b6zqa2Q.png!web

Note that the curves in the Parameters graph are the same

Groups

Groups can be very useful in specific cases. For example, if we have several concatenated data sources. When it is not necessary to treat them dependent on each other. Input channels can be grouped and processed independently . Finally, the output channels are concatenated at the end.

If there are 2 input channels and 4 output channels with 2 groups. Then this is like dividing the input channels into two groups (so 1 input channel in each group) and making it go through a convolution layer with half as many output channels. The output channels are then concatenated.

Input Shape : (2, 7, 7) — Output Shape : (4, 5, 5) — K : (3, 3) — P : (2, 2) — S : (2, 2) — D : (1, 1) — G : 2

It is important to note two things. Firstly, the number of groups must perfectly divide the number of input channels and the number of output channels ( common divisor ). Secondly, the kernels are shared with each group.

The number of parameters is therefore divided by the number of groups. Concerning the computation time with Pytorch, the algorithm is optimized for groups and therefore should reduce the computation time. However, it should also be taken into account that must sum up with the calculation time for group formation and concatenation of the output channels.

qIrQvmI.png!web

Output Channel Size

With the knowledge of all the arguments, the size of the output channels can be calculated from the size of the input channels.

Zjayi2Y.png!web

Sources

Deep Learning Tutorial , Y. LeCun

Documentation torch.nn , Pytorch

Convolutional Neural Networks , cs231n

Convolutional Layers , Keras

All the images are home made

All computation time tests have been run with Pytorch, on my GPU (GeForce GTX 960M) and are available on this GitHub repository if you want to run them yourself or perform alternative tests.

A visual and mathematical explanation of the 2D convolution layer and its arguments

Introduction

What is a Kernel?

Trainable Parameters and Bias

Number of Input and Output Channels

Kernel size

Strides

Padding

Dilation

Groups

Output Channel Size

Sources

Recommend

go 开发小结

Dubbo 负载均衡的实现

Github标星17K的国产开源项目作者月薪却不到5k？

if name == 'main'到底是什么？

MySQL性能测试 : 新的InnoDB Double Write Buffer

常见算法总结 - 链表篇

芒格缺席线上股东会，谁会是巴菲特4千亿金融帝国接班人？

面向接口编程，你考虑过性能吗？

网约配送员能兜底我们的职业吗？

苹果告别奢侈品时代：中国需求不减，服务收入撑起增长

About Joyk