

Visual Attention in Deep Learning
source link: https://medium.com/@sunnerli/visual-attention-in-deep-learning-77653f611855
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Visual Attention in Deep Learning
Introduction
It’s well known that the CNN becomes popular model after AlexNet beat the competitors and won the championship in 2013. However, the sliding window process is never changing until now. On the contrary, the DeepMind company launched the attention idea to try to stimulate human’s hobby. In this article, I will share some idea from original attention model to the contemporary one in order.
The logo of DeepMind companyRAM: a interesting idea
In the computation of convolution layer, the kernel filter will use sliding window mechanism that work on the whole feature map. Respective field and shared weights are the two advantages of the CNN. However, the slow speed of sliding window is the major disadvantage.
The Deepmind company tried to describe the action about how the human see toward the front object. Take the following picture for example. what is the process that you know there’s a dog in the picture?
The image within a dogIn the original CNN program, the kernel will try to realize from the top-right corner, and slide to the most right position. If the kernel sees the dog in the middle of the image, it will keeps sliding the field to the most right as usual. However, did you realize the dog by such these process?
I think the answer is not. For the usual person, after he realizes the head of the dog, he might try to watch the body part or the tail part rather than seeing right white floor. On the other words, the human will try to determine where to see in the next period instead of watching the other parts.
DeepMind gives this hobby a creative name: attention. They tried to stimulate the human that the people will pay attention to the object which they’re focus on. As the result, recurrent attention model (RAM) had been launched[1].
The structure of recurrent attention model (re-drawn)This is the structure of RAM which are drawn by myself. There’re 4 parts in the RAM: glimpse network, core network, action network and location network. In the idea of DeepMind, the model can only have limited bandwidth to see the image. As the result, the model should try to generate the more accurate location in the next time period. This idea can reduce the computation rather than sliding toward each regions.
Another property of this paper is: DeepMind changes this vision problem to become a reinforcement learning task! We can use reinforcement learning concept to explain this process: the agent will try to interact with the environment(whole image). In each iterator, the agent will classify the small image as one tag, and determine the next action(next location). For the agent, the goal is trying to get the highest reward(classify the tag correctly).
The process of glimpse sensorIn the RAM, the image and location coordinate will be sent into glimpse sensor to generate the retina-like representation
firstly. Second, the location coordinate and retina-like representation
will be merged in glimpse network and generate glimpse vector
. Next, the LSTM unit will compute the result by the last state vector and glimpse vector
. In the last part, the action network and location network will generate next location coordinate and predicting tag respectively. The above image shows how the glimpse sensor generate the retina-like representation
.
In the last paragraph of RAM, I want to talk about the loss function slightly. The goal of RAM is trying to maximize the log likelihood function toward the particular actions. However, It’s hard to estimate the expected value of probability. As the result, Monte-Carlo sampling method is adopt to approximate the original function.
DRAM: make the task more general
In the previous RAM, the glimpse
is defined as the multi-resolution crops of the image. Moreover, the idea of combination between deep learning and reinforcement learning was a creative one in the beginning. However, there’s a critical problem: the RAM can only solve the very simple classification problem. The comment which criticize RAM is tricky:
While RAM was shown to learn successful gaze strategies on cluttered digit classification tasks and on a toy visual control problem it was not shown to scale to real-world image tasks or multiple objects.
Simplify to say, the DeepMind want to expand the usage situation this time. The goal is to recognize multiple objects in the single image. As the result, the deep recurrent attention model (DRAM) was born[2].
The structure of deep recurrent attention modelThere’re 5 parts in DRAM: glimpse network, recurrent network, context network, classification network and emission network. The biggest difference between RAM and DRAM is that it uses two stacked LSTM unit. The first unit is responsible toward the classification task, and the second one is responsible toward the location mission.
Another point you should aware: the initial state of second LSTM unit is generated by the context network. The context network is composed by 3 convolution layers and down-sampling it as the coarse image
. The coarse image
would provide some hints to give the location coordinate. On the other hand, we wish the model to predict the tag only by the single glimpse. As the result, the zero initial state is only adopt in the first LSTM unit.
In RAM, the model is only trying to generate a prediction tag for the single image; In DRAM, the model will generate a label sequence for the multiple objects. The concept is alike seq2seq model. The detecting process will end until the recurrent network generates stop signal.
STN: make the CNN more robust
In the previous computer vision problem, the spacial invariant is a important property. The following image shows the Haar-like features which are the popular patterns that contain spacial invariant. How about the deep learning model? Is the deep learning model capable toward spacial invariant?
The haar-like featuresThe answer is yes, but not very well. In the max-pooling mechanism, the model will select the critical pixel which is the most representative. If the other column or row pixel is selected, it reach the spacial invariant. However, the lower layer cannot learn this property. Since the considering field in pooling is very small, the layer can only have limited capability to resist this spacial difference.
This time, DeepMind give another question: can we add the layer to help CNN learn the spacial invariant? The answer is yes, and the spacial transformation network (STN) is here[3]. The idea of DeepMind is to use a kind of image transformation(e.g. Affine transformation) to transform the feature map.
The structure of spatial transformation networkThe work of spatial transformer is to transform the feature map into another vector space representation. There’re 3 parts in STN: localisation network, grid generator and sampler. The localisation network which is composed by fully-connected layers or convolution layers will generate the transformation parameters.
The spacial transformation processThe second part is grid generator. After we get the parameter of Affine transformation, we can compute the opposite coordinate. To be notice, the input of the transformation function is the coordinate of target feature map! But why is the target coordinate? We do know the source coordinates but don’t know the target coordinates!
The computing formula of Affine transformationThis is a tricky method. The DeepMind wants to compute where is the original coordinate for all target pixels. As the result, the task of grid generator is only computing the coordinate in source feature map for each target pixel.
The computing formula of bi-linear interpolationSo what is the intensity for each pixel in target feature map? It’s the duty of sampler. The sampler needs to generate the intensity for each target coordinate by bi-linear interpolation.
A simple demonstration about the STN computation
Here we shows a simple example to let you understand how the sampler generate the pixel intensity.
The example imageThe above is a 4*4 image. We assume it as a gray-scale picture. The right part is the actual image toward the left value one.
The intensity matrix toward the example imageTo simplify explaining, we can re-write the image as the matrix form. The above image shows the matrix of intensity whose size is 4*4.
The coordinate for each gridsWe can assume the coordinate as the above shows. The top left one is the(0, 0)
position.
We can regard each grid as a single pixel, and the center position represents the particular pixel coordinate. As the above image shows, the pink points in each grid are the center points to represent the boxes.
The example of Affine transformationSuppose we have determined the theta parameters of Affine transformation. In the above image, the left most matrix is the transformation matrix. The middle one is the target coordinate. To reach the shifting action, we pad the coordinate in the last so that it’s the vector whose length is 3. By the calculation of Affine transformation, the source coordinate is [2.5, 2.5]
.
The above image illustrates this computing example again. By the Affine transformation, we can map the [1, 1]
target coordinate into [2.5, 2.5]
source coordinate. However, there’s no pink points in such area! The coordinate is a factorial value! So how to determine the intensity in this factorial coordinate? We should use bi-linear interpolation.
You can regard the bi-linear interpolation formula as the above image. Each pixel will contribute some weights toward this factorial points. The traditional bi-linear interpolation only consider the nearest neighbors around this coordinate. However, the concept which was launched by DeepMind is different. They wants the whole points will give the influences toward this point!
The idea of bi-linear interpolation by whole pixel (3D)I change this situation as 3D image. The z-axis represent the level of influence toward this points. Simplify to say, the intensity of this point is the weighted sum of the distance difference multiply the pixel intensity for each pink points.
EDRAM: throw the RL away!
In 2017, NTU in Singapore tried to combine the STN and DRAM and launched the new model. After this combination, enriched deep recurrent attention model (EDRAM) was generated[4].
The structure of enrich deep recurrent attention modelThe above image shows the structure of EDRAM. As you can see, there’s no big structure changing in EDRAM. The only difference is that the EDRAM adds the attention mechanism (spacial transformer).
The loss function of enrich deep recurrent attention modelThe loss function is another big change. In EDRAM, it purposed the loss function is fully differential and the whole model doesn’t need reinforcement learning anymore! I think it’s a big contribution about the vision attention territory. The loss function is shown above, the left one is composed by the middle one and right one.
The loss of classification part is just cross entropy. The loss of location is the weighted sum of the square error. However, to my surprise, the location loss should need grown truth theta to evaluate the error degree. I think it’s not a good idea since it’s hard to get the grown truth theta values.
Conclusion
This article demonstrates the progress toward visual attention. The apply topic is almost around MNIST dataset. Maybe in the future, the attention mechanism will be used in the vision problem widely.
On the other hand, I re-write the RAM and EDRAM by newest tensorflow version(1.3.0). There might be some error, but I think it might be a typical example to understand the structure of these models.
Reference
[1] V. Mnih, N. Heess, A. Graves, and K. Kavukcuoglu, “Recurrent Models of Visual Attention,” Arxiv:1406.6247v1 [cs.LG], June 2014.
[2] J. L. Ba, V. Mnih, and K. Kavukcuoglu, “Multiple Object Recognition With Visual Attention,” ArXiv:1412.7755v2 [cs.LG], Apr. 2015.
[3] M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu, “Spatial Transformer Networks.” ArXiv:1506.02025v3 [cs.CV], Feb. 2016.
[4] A. Ablavatski, S. Lu, and J. Cai, “Enriched Deep Recurrent Visual Attention Model for Multiple Object Recognition,” Arxiv:1706.03581v1 [cs.CV], June 2017.
Recommend
-
6
Pay close attention to your download code——Visual Studio trick to run code when building
-
17
Attention Mechanism in Deep Learning : Simplified
-
29
In this article, we will discuss some of the limitations of Encoder-Decoder models which act as a motivation for the development of Attention Mechanism. After that, we will talk about the concepts of Attention Models and...
-
8
Self-Attention Between Datapoints: Going Beyond Individual Input-Output Pairs in Deep Learning Overview |
-
3
Sponsored by Camunda Manual business processes don’t cut it anymore — today’s customer-centric digital experiences hinge on automation. Companies from Atlassian to Capital One use Camunda’s open source automation platform to b...
-
9
Sponsored How to get the most visual attention on your content Published Oct. 25, 2021 By Serena Hernandez ...
-
10
Beyond Self-attention: External Attention using Two Linear Layers for Visual Tasks 2021-06-03 约 2350 字 预计阅读 5 分钟 34 次阅读
-
2
论文链接:https://arxiv.org/abs/1612.01887 Abstract基于注意力的神经编码-译码框架(Attention-based neural encoder-decoder frameworks)已经在图像标注任务...
-
4
Focus of Attention Improves Information Transfer in Visual FeaturesPublished in NeurIPS2020, 2020Recommended citation: Matteo Tiezzi, Stefano Melacci, Alessandro Betti, Marco Maggini, Marco Gori (2020). "Focus of Att...
-
6
Introduction Embark on an exciting journey as I reveal how to harness the power of deep learning to generate captivating images (Generative AI) from textual prompts using Python with Data Storytelling. Explore the extensive possibilities...
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK