A visualization of the basic elements of a Convolutional Neural Network

May 26 ·8min read

Visualization is a great tool in understanding rich concepts, especially for beginners in the area. In this article, we will go through the basic elements of a convolutional neural network using visual aids. The article begins with providing a template (visually) for a basic CNN with different building blocks and then discusses the most commonly used elements for each of the building blocks.

Basic CNN Template:

A basic CNN consists of three kinds of layers. Input, hidden, and output as shown below. The data gets into the CNN through the input layer and passes through various hidden layers before getting to the output layer. The output layer is the prediction of the network. The output of the network is compared to the actual labels in terms of loss or error. For the network to learn, the partial derivatives of this loss w.r.t the trainable weights are calculated and the weights are updated through one of the various methods using backpropagation.

The complete visual template for a basic CNN can be seen below.

3iARVvI.png!web

Template for a basic CNN

Hidden Layers of CNN

The hidden layers in the network provide a basic building block to transform the data (input layer or the output of the previously hidden layer). Most of the commonly used hidden layers (not all) follow a pattern. It begins with applying a function to its input, moving onto pooling, normalization, and finally applying the activation before it can be fed as input to the next layer. Thus, each layer can be decomposed into the following 4 sub-functions

Layer function: Basic transforming function such as convolutional or fully connected layer.
Pooling: Used to change the spatial size of the feature map either increasing (up-sampling) or decreasing (most common) it. For example maxpooling, average pooling, and unpooling.
Normalization: This subfunction normalizes the data to have zero mean and unit variance. This helps in coping up with problems such as vanishing gradient, internal covariate shift, etc.(more information). The two most common normalization techniques used are local response normalization and batch normalization.
Activation: Applies non-linearity and bounds the output from getting too high or too low.

We will go through each of the sub-functions explaining their most common examples.

There are much more complex CNN architectures out there which have various other layers and rather complex arcvitecture. Not all the CNN architectures follow this template.

1. Layer functions

The most commonly used layer functions are the fully connected, convolutional, and transposed convolutional (wrongfully known as deconvolutional) layers.

a. Fully Connected Layers:

These layers consist of linear functions between the input and the output. For i input nodes and j output nodes, the trainable weights are wij and bj. The figure on the left illustrates how a fully connected layer between 3 input and 2 output nodes work.

b. Convolutional Layers:

These layers are applied to 2D (and 3D) input feature maps. The trainable weights are a 2D (or 3D) kernel/filter that moves across the input feature map, generating dot products with the overlapping region of the input feature map. Following are the 3 parameters used to define a convolutional layer

Kernel Size K: The size of the sliding kernel or filter.
Stride Length S: Defines how much is the kernel slid before the dot product is carried out to generate the output pixel
Padding P: The frame size of zeros inserted around the input feature map.

The 4 figures below visually explain the convolutional layer on an input of size ( i ) 5x5 for a kernel size ( k ) of 3x3 and varying strides ( s ) and padding ( p )

Animated convolutional layer (Source: Aqeel Anwar)

The stride and padding along with the input feature map control the size of the output feature map. The output size is given by

c. Transposed Convolutional (DeConvolutional) Layer:

Usually used to increase the size of the output feature map (Upsampling). The idea behind the transposed convolutional layer is to undo (not exactly) the convolutional layer. Just as the convolutional layer, it is also defined by the stride length and the padding. If we apply the provided stride and padding on the output and apply the convolutional kernel of the provided size, it will generate the input.

v6VRzaI.png!web

Transposed Convolutional Layer (Source: Aqeel Anwar)

In order to generate the output, two things are carried out

zero insertion ( z ): The number of zeros inserted between rows and cols of the original input
padding ( p’ ): The frame size of zeros inserted around the input feature map.

The 4 figures below visually explain the transposed convolutional layer on an input of varying size ( i ), for a kernel size ( k ) of 3x3 and varying strides ( s ) and padding ( p ) while the output (o) is fixed to 5x5

Animated transposed convolutional layer (Source: Aqeel Anwar)

In-depth details on transposed convolutional layers can be found here

2. Pooling

The most commonly used poolings are Max, average pooling, and max average unpooling.

Max/Average Pooling:

A non-trainable layer used to decrease the spatial size of the input layer based on selecting the maximum/average value in a receptive field defined by the kernel. A kernel is slid across the input feature map with a given stride. For each position, the maximum/average value of the part of the input feature map overlapping the kernel is the corresponding output pixel.

Animated Max pooling layer (Source: Aqeel Anwar)

UnPooling:

A non-trainable layer used to increase the spatial size of the input layer based on placing the input pixel at a certain index in the receptive field of the output defined by the kernel. For an unpooling layer, there needs to be a corresponding pooling layer earlier in the network. The index of maximum/average value from the corresponding pooling layer is saved and used in the unpooling layer. In the unpooling layer, each input pixel is placed in the output at the index where the maximum/average occurred in the pooling layer while the other pixels are set to zero

3. Normalization

Normalization is usually used just before the activation functions to limit the unbounded activation from increasing the output layer values too high. There are two types of normalization techniques usually used

a. Local Response Normalization LRN:

LRN is a non-trainable layer that square-normalizes the pixel values in a feature map within a local neighborhood. There are two types of LRN based on the neighborhood defined Inter-channel and Intra-channel and can be seen in the figure below.

j2Y3Qfq.png!web

RNFvumN.png!web

Left: Intra-channel LRN … Right : Inter-channel LRN

b. Batch Normalization BN:

BN, on the other hand, is a trainable approach to normalizing the data. In batch normalization, the output of hidden neurons is processed in the following manner before being fed to the activation function.

Normalize the entire batch B to be zero mean and unit variance

Calculate the mean of the entire mini-batch output: u_B
Calculate the variance of the entire mini-batch output: s igma_B
Normalize the mini-batch by subtracting the mean and dividing with variance

2. Introduce two trainable parameters ( Gamma: scale_variable and Beta: shift_variable) to scale and shift the normalized mini-batch output

3. Feed this scaled and shifted normalized mini-batch to the activation function.

A summary of the two normalization techniques can be seen below

A detailed article on these normalization techniques can be found here

4. Activation

The main purpose of activation functions is to introduce non-linearity so CNN is able to efficiently map non-linear complex mapping between the input and output. Multiple activation functions are available and used based on the underlying requirements.

Non-parametric/Static functions: Linear, ReLU
Parametric functions: ELU, tanh, sigmoid, Leaky ReLU
Bounded functions: tanh, sigmoid

The gif below visually explains the nature of the most commonly used activation functions.

Animated Activation functions (Source: Aqeel Anwar)

The most commonly used activation function is ReLU. Bounded activation functions such as tanh and sigmoid suffer from the problem of vanishing gradient when it comes to deeper neural networks, and is normally avoided.

5. Loss Calculation:

Once you have defined your CNN, a loss function needs to be picked that quantifies how far off the CNN prediction is from the actual labels. This loss is then used in the gradient descent method to train the network variables. Like the activation functions, there are multiple candidates available for loss functions.

Regression Loss Functions

Mean Absolute Error: The estimated value and labels are real numbers
Mean Square Error: The estimated value and labels are real numbers
Huber Loss: The estimated value and labels are real numbers

Classification Loss Functions

Cross-Entropy: The estimated value and labels are probability (0,1)
Hinge Loss: The estimated value and labels are real numbers

The details on these loss functions can be seen in the plot below

Animated ML Loss Functions (Source: Aqeel Anwar)

6. Backpropagation

Backpropagation is not a structural element of the CNN, rather its the methodology through which we learn the underlying problem by updating the weights in the opposite direction of the change in gradient (gradient descent). In-depth detail on different gradient descent algorithms can be found here .

Summary:

In this article, animated visualizations of different elements of a basic CNN have been presented which will help understand their functions better.

A visualization of the basic elements of a Convolutional Neural Network