WTF is image classification?

“One thing that struck me early is that you don’t put into a photograph what’s going to come out. Or, vice versa, what comes out is not what you put in.”

― Diane Arbus

A notification pops up on your favorite social network that someone posted a picture that might have you in it.

It’s right.

It’s the worst picture of you ever.

GIF via GIPHY

How did that happen?

Image classification!

The convolutional neural network (CNN) is a class of deep learning neural networks . CNNs represent a huge breakthrough in image recognition. They’re most commonly used to analyze visual imagery and are frequently working behind the scenes in image classification. They can be found at the core of everything from Facebook’s photo tagging to self-driving cars. They’re working hard behind the scenes in everything from healthcare to security.

They’re fast and they’re efficient. But how do they work?

Image classification is the process of taking an input (like a picture) and outputting a class (like “cat”) or a probability that the input is a particular class (“there’s a 90% probability that this input is a cat”). You can look at a picture and know that you’re looking at a terrible shot of your own face, but how can a computer learn to do that?

With a convolutional neural network!

A CNN has

Convolutional layers
ReLU layers
Pooling layers
a Fully connected layer

A classic CNN architecture would look something like this:

Input ->Convolution ->ReLU ->Convolution ->ReLU ->Pooling ->
ReLU ->Convolution ->ReLU ->Pooling ->Fully Connected

A CNN convolves (not convolutes…) learned features with input data and uses 2D convolutional layers. This means that this type of network is ideal for processing 2D images. Compared to other image classification algorithms, CNNs actually use very little preprocessing. This means that they can learn the filters that have to be hand-made in other algorithms. CNNs can be used in tons of applications from image and video recognition, image classification, and recommender systems to natural language processing and medical image analysis.

CNNs are inspired by biological processes. They’re based on some cool research done by Hubel and Wiesel in the 60s regarding vision in cats and monkeys. The pattern of connectivity in a CNN comes from their research regarding the organization of the visual cortex. In a mammal’s eye, individual neurons respond to visual stimuli only in the receptive field, which is a restricted region. The receptive fields of different regions partially overlap so that the entire field of vision is covered. This is the way that a CNN works!

Image by NatWhitePhotography on Pixabay

CNNs have an input layer, and output layer, and hidden layers. The hidden layers usually consist of convolutional layers, ReLU layers, pooling layers, and fully connected layers.

Convolutional layers apply a convolution operation to the input. This passes the information on to the next layer.
Pooling combines the outputs of clusters of neurons into a single neuron in the next layer.
Fully connected layers connect every neuron in one layer to every neuron in the next layer.

In a convolutional layer, neurons only receive input from a subarea of the previous layer. In a fully connected layer, each neuron receives input from every element of the previous layer.

A CNN works by extracting features from images. This eliminates the need for manual feature extraction. The features are not trained! They’re learned while the network trains on a set of images. This makes deep learning models extremely accurate for computer vision tasks. CNNs learn feature detection through tens or hundreds of hidden layers. Each layer increases the complexity of the learned features.

GIF via GIPHY

A CNN

starts with an input image
applies many different filters to it to create a feature map
applies a ReLU function to increase non-linearity
applies a pooling layer to each feature map
flattens the pooled images into one long vector.
inputs the vector into a fully connected artificial neural network.
processes the features through the network. The final fully connected layer provides the “voting” of the classes that we’re after.
trains through forward propagation and backpropagation for many, many epochs. This repeats until we have a well-defined neural network with trained weights and feature detectors.

So what does that mean?

At the very beginning of this process, an input image is broken down into pixels.

GIF via GIPHY

For a black and white image, those pixels are interpreted as a 2D array (for example, 2x2 pixels). Every pixel has a value between 0 and 255. (Zero is completely black and 255 is completely white. The greyscale exists between those numbers.) Based on that information, the computer can begin to work on the data.

For a color image, this is a 3D array with a blue layer, a green layer, and a red layer. Each one of those colors has its own value between 0 and 255. The color can be found by combining the values in each of the three layers.

What are the basic building blocks of a CNN?

Convolution

The main purpose of the convolution step is to extract features from the input image. The convolutional layer is always the first step in a CNN.

You have an input image, a feature detector, and a feature map. You take the filter and apply it pixel block by pixel block to the input image. You do this through the multiplication of the matrices.

Let’s say you have a flashlight and a sheet of bubble wrap. Your flashlight shines a 5-bubble x 5-bubble area. To look at the entire sheet, you would slide your flashlight across each 5x5 square until you’d seen all the bubbles.

Photo by stux on Pixabay

The light from the flashlight here is your filter and the region you’re sliding over is the receptive field . The light sliding across the receptive fields is your flashlight convolving . Your filter is an array of numbers (also called weights or parameters). The distance the light from your flashlight slides as it travels (are you moving your filter over one row of bubbles at a time? Two?) is called the stride . For example, a stride of one means that you’re moving your filter over one pixel at a time. The convention is a stride of two.

The depth of the filter has to be the same as the depth of the input, so if we were looking at a color image, the depth would be 3. That makes the dimensions of this filter 5x5x3. In each position, the filter multiplies the values in the filter with the original values in the pixel. This is element wise multiplication. The multiplications are summed up, creating a single number. If you started at the top left corner of your bubble wrap, this number is representative of the top left corner. Now you move your filter to the next position and repeat the process all around the bubble wrap. The array you end up with is called a feature map or an activation map ! You can use more than one filter, which will do a better job of preserving spatial relationships.

GIF via GIPHY

You’ll specify parameters like the number of filters, the filter size, the architecture of the network, and so on. The CNN learns the values of the filters on its own during the training process. You have a lot of options that you can work with to make the best image classifier possible for your task. You can choose to pad the input matrix with zeros ( zero padding ) to apply the filter to bordering elements of the input image matrix. This also allows you to control the size of the feature maps. Adding zero padding is wide convolution . Not adding zero padding is narrow convolution .

This is basically how we detect images! We don’t look at every single pixel of an image. We see features like a hat, a red dress, a tattoo, and so on. There’s so much information going into our eyes at all times that we couldn’t possibly deal with every single pixel of it. We’re allowing our model to do the same thing.

The result of this is the convolved feature map . It’s smaller than the original input image. This makes it easier and faster to deal with. Do we lose information? Some, yes. But at the same time, the purpose of the feature detector is to detect features, which is exactly what this does.

We create many feature maps to get our first convolutional layer. This allows us to identify many different features that the program can use to learn.

Feature detectors can be set up with different values to get different results. For example, a filter can be applied that can sharpen and focus an image or blur an image. That would give equal importance to all the values. You can do edge enhancement, edge detection, and more. You would do that by applying different feature detectors to create different feature maps. The computer is able to determine which filters make the most sense and apply them.

The primary purpose here is to find features in your image, put them into a feature map, and still preserve the spatial relationship between pixels. That’s important so that the pixels don’t get all jumbled up.

Let’s visualize this stuff!

Say hello to my little friend:

Photo by Kirgiz03 on Pixabay

We’re going to use this guy for our input image.

We’ll make him black and white

import cv2
import matplotlib.pyplot as plt
%matplotlib inline

img_path = 'data/pixabay_Kirgiz03.jpg'

# Load color image 
bgr_img = cv2.imread(img_path)

# Convert to grayscale
gray_img = cv2.cvtColor(bgr_img, cv2.COLOR_BGR2GRAY)

# Normalize, rescale entries to lie in [0,1]
gray_img = gray_img.astype("float32")/255

# Plot image
plt.imshow(gray_img, cmap='gray')
plt.show()

Let’s define and visualize our filters

import numpy as np

filter_vals = np.array([[-1, -1, 1, 1], [-1, -1, 1, 1], [-1, -1, 1, 1], [-1, -1, 1, 1]])

print('Filter shape: ', filter_vals.shape)

Filter shape: (4, 4)

# Define four different filters, all of which are linear combinations of the `filter_vals` defined above

filter_1 = filter_vals
filter_2 = -filter_1
filter_3 = filter_1.T
filter_4 = -filter_3
filters = np.array([filter_1, filter_2, filter_3, filter_4])

# Print out the values of filter 1 as an example
print('Filter 1: \n', filter_1)

and we see:

Filter 1: 
 [[-1 -1  1  1]
 [-1 -1  1  1]
 [-1 -1  1  1]
 [-1 -1  1  1]]

Here’s a visualization of our four filters

Now let’s define a convolutional layer (I’m loving PyTorch right now, so that’s what we’re using here.)

import torch
import torch.nn as nn
import torch.nn.functional as F
    
# Neural network with one convolutional layer with four filters
class Net(nn.Module):
    
    def __init__(self, weight):
        super(Net, self).__init__()
        # Initializes the weights of the convolutional layer to be the weights of the 4 defined filters
        k_height, k_width = weight.shape[2:]
        # Assumes there are 4 grayscale filters
        self.conv = nn.Conv2d(1, 4, kernel_size=(k_height, k_width), bias=False)
        self.conv.weight = torch.nn.Parameter(weight)

def forward(self, x):
        # Calculates the output of a convolutional layer pre- and post-activation
        conv_x = self.conv(x)
        activated_x = F.relu(conv_x)
        
        # Returns both layers
        return conv_x, activated_x
    
# Instantiate the model and set the weights
weight = torch.from_numpy(filters).unsqueeze(1).type(torch.FloatTensor)
model = Net(weight)

# Print out the layer in the network
print(model)

We’ll see

Net(
  (conv): Conv2d(1, 4, kernel_size=(4, 4), stride=(1, 1), bias=False)
)

Add a little more code

def viz_layer(layer, n_filters= 4):
    fig = plt.figure(figsize=(20, 20))
    
    for i in range(n_filters):
        ax = fig.add_subplot(1, n_filters, i+1, xticks=[], yticks=[])
        # Grab layer outputs
        ax.imshow(np.squeeze(layer[0,i].data.numpy()), cmap='gray')
        ax.set_title('Output %s' % str(i+1))

Then a little more

# Plot original image
plt.imshow(gray_img, cmap='gray')

# Visualize all of the filters
fig = plt.figure(figsize=(12, 6))
fig.subplots_adjust(left=0, right=1.5, bottom=0.8, top=1, hspace=0.05, wspace=0.05)
for i in range(4):
    ax = fig.add_subplot(1, 4, i+1, xticks=[], yticks=[])
    ax.imshow(filters[i], cmap='gray')
    ax.set_title('Filter %s' % str(i+1))

# Convert the image into an input tensor
gray_img_tensor = torch.from_numpy(gray_img).unsqueeze(0).unsqueeze(1)

# Get the convolutional layer (pre and post activation)
conv_layer, activated_layer = model(gray_img_tensor)

# Visualize the output of a convolutional layer
viz_layer(conv_layer)

And we can visualize the output of a convolutional layer before a ReLu activation function is applied!

Now let’s create a custom kernel using a Sobel operator as an edge detection filter. The Sobel filter is very commonly used in edge detection. It does a good job of finding patterns in intensity in an image. Applying a Sobel filter to an image is a way of taking an approximation of the derivative of the image separately in the x- or y-direction.

We’ll convert our little dude to grayscale for filtering

gray = cv2.cvtColor(image, cv2.COLOR_RGB2GRAY)

plt.imshow(gray, cmap='gray')

Here we go!

# 3x3 array for edge detection
sobel_y = np.array([[ -1, -2, -1], 
                   [ 0, 0, 0], 
                   [ 1, 2, 1]])

sobel_x = np.array([[ -1, 0, 1], 
                   [ 0, 0, 0], 
                   [ 1, 2, 1]])
  
filtered_image = cv2.filter2D(gray, -1, sobel_y)

plt.imshow(filtered_image, cmap='gray')

Want to check out the math? Take a look at Introduction to Convolutional Neural Networks by Jianxin Wu

ReLU layer

The ReLU (rectified linear unit) layer is another step to our convolution layer. You’re applying an activation function onto your feature maps to increase non-linearity in the network. This is because images themselves are highly non-linear! It removes negative values from an activation map by setting them to zero.

Convolution is a linear operation with things like element wise matrix multiplication and addition. The real-world data we want our CNN to learn will be non-linear. We can account for that with an operation like ReLU. You can use other operations like tanh or sigmoid . ReLU, however, is a popular choice because it can train the network faster without any major penalty to generalization accuracy.

Check out C.-C. Jay Kuo Understanding Convolutional Neural Networks With a Mathematical Model .

Want to dig deeper? Try Kaiming He, et al. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification .

If you need a little more info about the absolute basics of activation functions, you can find that here !

Here’s how our little buddy is looking after a ReLU activation function turns all of the negative pixel values black

viz_layer(activated_layer)

Image classification!

Input ->Convolution ->ReLU ->Convolution ->ReLU ->Pooling ->
ReLU ->Convolution ->ReLU ->Pooling ->Fully Connected

A CNN

So what does that mean?

What are the basic building blocks of a CNN?

Convolution

Let’s visualize this stuff!

Filter shape: (4, 4)

ReLU layer

Recommend

PicaComic接口分析手记

GitHub - jerrypnz/major-mode-hydra.el: Spacemacs-esque major mode leader key pow...

【《我爱我家》客厅1：40 纸模型】今年是《我爱我家》开播25周年，作为老观众，做了个...

【打破外国科研垄断，中国科学家首次发现藏在病毒中的朊病毒】这一研究，将朊病毒的发...

如果哈利一开始就被小天狼星抚养，剧情会怎么发展？ - 知乎

什么是 p 波超导？ - 知乎

刘强东“消失”，京东徐雷“上位”胜算几成？

过年了，给亲朋好友解释「啥是程序员」 - Java3y

GitHub - krakjoe/pcov: PCOV - CodeCoverage compatible driver for PHP

年终奖很失望

About Joyk

WTF is image classification?

Image classification!

Input ->Convolution ->ReLU ->Convolution ->ReLU ->Pooling -> ReLU ->Convolution ->ReLU ->Pooling ->Fully Connected

A CNN

So what does that mean?

What are the basic building blocks of a CNN?

Convolution

Let’s visualize this stuff!

Filter shape: (4, 4)

ReLU layer

Recommend

About Joyk

Input ->Convolution ->ReLU ->Convolution ->ReLU ->Pooling ->
ReLU ->Convolution ->ReLU ->Pooling ->Fully Connected