Visual Attention in Deep Learning

Introduction

It’s well known that the CNN becomes popular model after AlexNet beat the competitors and won the championship in 2013. However, the sliding window process is never changing until now. On the contrary, the DeepMind company launched the attention idea to try to stimulate human’s hobby. In this article, I will share some idea from original attention model to the contemporary one in order.

The logo of DeepMind company

RAM: a interesting idea

In the computation of convolution layer, the kernel filter will use sliding window mechanism that work on the whole feature map. Respective field and shared weights are the two advantages of the CNN. However, the slow speed of sliding window is the major disadvantage.

The Deepmind company tried to describe the action about how the human see toward the front object. Take the following picture for example. what is the process that you know there’s a dog in the picture?

The image within a dog

In the original CNN program, the kernel will try to realize from the top-right corner, and slide to the most right position. If the kernel sees the dog in the middle of the image, it will keeps sliding the field to the most right as usual. However, did you realize the dog by such these process?

I think the answer is not. For the usual person, after he realizes the head of the dog, he might try to watch the body part or the tail part rather than seeing right white floor. On the other words, the human will try to determine where to see in the next period instead of watching the other parts.

DeepMind gives this hobby a creative name: attention. They tried to stimulate the human that the people will pay attention to the object which they’re focus on. As the result, recurrent attention model (RAM) had been launched[1].

The structure of recurrent attention model (re-drawn)

This is the structure of RAM which are drawn by myself. There’re 4 parts in the RAM: glimpse network, core network, action network and location network. In the idea of DeepMind, the model can only have limited bandwidth to see the image. As the result, the model should try to generate the more accurate location in the next time period. This idea can reduce the computation rather than sliding toward each regions.

Another property of this paper is: DeepMind changes this vision problem to become a reinforcement learning task! We can use reinforcement learning concept to explain this process: the agent will try to interact with the environment(whole image). In each iterator, the agent will classify the small image as one tag, and determine the next action(next location). For the agent, the goal is trying to get the highest reward(classify the tag correctly).

The process of glimpse sensor

In the RAM, the image and location coordinate will be sent into glimpse sensor to generate the retina-like representation firstly. Second, the location coordinate and retina-like representation will be merged in glimpse network and generate glimpse vector. Next, the LSTM unit will compute the result by the last state vector and glimpse vector. In the last part, the action network and location network will generate next location coordinate and predicting tag respectively. The above image shows how the glimpse sensor generate the retina-like representation.

Monte-Carlo sampling method

In the last paragraph of RAM, I want to talk about the loss function slightly. The goal of RAM is trying to maximize the log likelihood function toward the particular actions. However, It’s hard to estimate the expected value of probability. As the result, Monte-Carlo sampling method is adopt to approximate the original function.

DRAM: make the task more general

In the previous RAM, the glimpse is defined as the multi-resolution crops of the image. Moreover, the idea of combination between deep learning and reinforcement learning was a creative one in the beginning. However, there’s a critical problem: the RAM can only solve the very simple classification problem. The comment which criticize RAM is tricky:

While RAM was shown to learn successful gaze strategies on cluttered digit classification tasks and on a toy visual control problem it was not shown to scale to real-world image tasks or multiple objects.

Simplify to say, the DeepMind want to expand the usage situation this time. The goal is to recognize multiple objects in the single image. As the result, the deep recurrent attention model (DRAM) was born[2].

The structure of deep recurrent attention model

There’re 5 parts in DRAM: glimpse network, recurrent network, context network, classification network and emission network. The biggest difference between RAM and DRAM is that it uses two stacked LSTM unit. The first unit is responsible toward the classification task, and the second one is responsible toward the location mission.

Another point you should aware: the initial state of second LSTM unit is generated by the context network. The context network is composed by 3 convolution layers and down-sampling it as the coarse image. The coarse image would provide some hints to give the location coordinate. On the other hand, we wish the model to predict the tag only by the single glimpse. As the result, the zero initial state is only adopt in the first LSTM unit.

In RAM, the model is only trying to generate a prediction tag for the single image; In DRAM, the model will generate a label sequence for the multiple objects. The concept is alike seq2seq model. The detecting process will end until the recurrent network generates stop signal.

STN: make the CNN more robust

In the previous computer vision problem, the spacial invariant is a important property. The following image shows the Haar-like features which are the popular patterns that contain spacial invariant. How about the deep learning model? Is the deep learning model capable toward spacial invariant?

The haar-like features

The answer is yes, but not very well. In the max-pooling mechanism, the model will select the critical pixel which is the most representative. If the other column or row pixel is selected, it reach the spacial invariant. However, the lower layer cannot learn this property. Since the considering field in pooling is very small, the layer can only have limited capability to resist this spacial difference.

This time, DeepMind give another question: can we add the layer to help CNN learn the spacial invariant? The answer is yes, and the spacial transformation network (STN) is here[3]. The idea of DeepMind is to use a kind of image transformation(e.g. Affine transformation) to transform the feature map.

The structure of spatial transformation network

The work of spatial transformer is to transform the feature map into another vector space representation. There’re 3 parts in STN: localisation network, grid generator and sampler. The localisation network which is composed by fully-connected layers or convolution layers will generate the transformation parameters.

The spacial transformation process

The second part is grid generator. After we get the parameter of Affine transformation, we can compute the opposite coordinate. To be notice, the input of the transformation function is the coordinate of target feature map! But why is the target coordinate? We do know the source coordinates but don’t know the target coordinates!

The computing formula of Affine transformation

This is a tricky method. The DeepMind wants to compute where is the original coordinate for all target pixels. As the result, the task of grid generator is only computing the coordinate in source feature map for each target pixel.

The computing formula of bi-linear interpolation

So what is the intensity for each pixel in target feature map? It’s the duty of sampler. The sampler needs to generate the intensity for each target coordinate by bi-linear interpolation.

A simple demonstration about the STN computation

Here we shows a simple example to let you understand how the sampler generate the pixel intensity.

The example image

The above is a 4*4 image. We assume it as a gray-scale picture. The right part is the actual image toward the left value one.

The intensity matrix toward the example image

To simplify explaining, we can re-write the image as the matrix form. The above image shows the matrix of intensity whose size is 4*4.

The coordinate for each grids

We can assume the coordinate as the above shows. The top left one is the
(0, 0) position.

The example image with representative center points

We can regard each grid as a single pixel, and the center position represents the particular pixel coordinate. As the above image shows, the pink points in each grid are the center points to represent the boxes.

The example of Affine transformation

Suppose we have determined the theta parameters of Affine transformation. In the above image, the left most matrix is the transformation matrix. The middle one is the target coordinate. To reach the shifting action, we pad the coordinate in the last so that it’s the vector whose length is 3. By the calculation of Affine transformation, the source coordinate is [2.5, 2.5].

The illustration about the transformation

The above image illustrates this computing example again. By the Affine transformation, we can map the [1, 1] target coordinate into [2.5, 2.5] source coordinate. However, there’s no pink points in such area! The coordinate is a factorial value! So how to determine the intensity in this factorial coordinate? We should use bi-linear interpolation.

The idea of bi-linear interpolation by whole pixel (2D)

You can regard the bi-linear interpolation formula as the above image. Each pixel will contribute some weights toward this factorial points. The traditional bi-linear interpolation only consider the nearest neighbors around this coordinate. However, the concept which was launched by DeepMind is different. They wants the whole points will give the influences toward this point!

The idea of bi-linear interpolation by whole pixel (3D)

I change this situation as 3D image. The z-axis represent the level of influence toward this points. Simplify to say, the intensity of this point is the weighted sum of the distance difference multiply the pixel intensity for each pink points.

EDRAM: throw the RL away!

In 2017, NTU in Singapore tried to combine the STN and DRAM and launched the new model. After this combination, enriched deep recurrent attention model (EDRAM) was generated[4].

The structure of enrich deep recurrent attention model

The above image shows the structure of EDRAM. As you can see, there’s no big structure changing in EDRAM. The only difference is that the EDRAM adds the attention mechanism (spacial transformer).

The loss function of enrich deep recurrent attention model

The loss function is another big change. In EDRAM, it purposed the loss function is fully differential and the whole model doesn’t need reinforcement learning anymore! I think it’s a big contribution about the vision attention territory. The loss function is shown above, the left one is composed by the middle one and right one.

The loss of classification part is just cross entropy. The loss of location is the weighted sum of the square error. However, to my surprise, the location loss should need grown truth theta to evaluate the error degree. I think it’s not a good idea since it’s hard to get the grown truth theta values.

Conclusion

This article demonstrates the progress toward visual attention. The apply topic is almost around MNIST dataset. Maybe in the future, the attention mechanism will be used in the vision problem widely.

On the other hand, I re-write the RAM and EDRAM by newest tensorflow version(1.3.0). There might be some error, but I think it might be a typical example to understand the structure of these models.

Reference

[1] V. Mnih, N. Heess, A. Graves, and K. Kavukcuoglu, “Recurrent Models of Visual Attention,” Arxiv:1406.6247v1 [cs.LG], June 2014.

[2] J. L. Ba, V. Mnih, and K. Kavukcuoglu, “Multiple Object Recognition With Visual Attention,” ArXiv:1412.7755v2 [cs.LG], Apr. 2015.

[3] M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu, “Spatial Transformer Networks.” ArXiv:1506.02025v3 [cs.CV], Feb. 2016.

[4] A. Ablavatski, S. Lu, and J. Cai, “Enriched Deep Recurrent Visual Attention Model for Multiple Object Recognition,” Arxiv:1706.03581v1 [cs.CV], June 2017.

Visual Attention in Deep Learning

Visual Attention in Deep Learning

Introduction

RAM: a interesting idea

DRAM: make the task more general

STN: make the CNN more robust

A simple demonstration about the STN computation

EDRAM: throw the RL away!

Conclusion

Reference

Recommend

Pay close attention to your download code——Visual Studio trick to run code when...

Attention Mechanism in Deep Learning: Simplified

Introduction to Attention Mechanism in Deep Learning — ELI5 Way

Github GitHub - OATML/non-parametric-transformers: Code for "Self-Attention...

Rosaria Silipo on Codeless Deep Learning and Visual Programming

How to get the most visual attention on your content

Beyond Self-attention: External Attention using Two Linear Layers for Visual Tas...

Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Caption...

Focus of Attention Improves Information Transfer in Visual Features

Generative AI in Deep Learning: Visual Storytelling from Text

About Joyk