Light on Math ML: Attention with Keras

Why Keras?

With the unveiling of TensorFlow 2.0 it is hard to ignore the conspicuous attention (no pun intended!) given to Keras. There was greater focus on advocating Keras for implementing deep networks. Keras in TensorFlow 2.0 will come with three powerful APIs for implementing deep networks.

Sequential API — This is the simplest API where you first call model = Sequential() and keep adding layers, e.g. model.add(Dense(...)) .
Functional API — Advance API where you can create custom models with arbitrary input/outputs. Defining a model needs to be done bit carefully as there’s lot to be done on user’s end. Model can be defined using model = Model(inputs=[...], outputs=[...]) .
Subclassing API — Another advance API where you define a Model as a Python class. Here you define the forward pass of the model in the class and Keras automatically compute the backward pass. Then this model can be used normally as you would use any Keras model.

For more information, get first hand information from TensorFlow team. However remember that while choosing advance APIs give more “wiggle room” for implementing complex models, they also increase the chances of blunders and various rabbit holes.

Why this post?

Recently I was looking for a Keras based attention layer implementation or library for a project I was doing. I grappled with several repos out there that already has implemented attention. However my efforts were in vain, trying to get them to work with later TF versions. Due to several reasons:

Either the way attention implemented lacked modularity (having attention implemented for the full decoder instead of individual unrolled steps of the decoder
Using deprecated functions from earlier TF versions

They are great efforts and I respect all those contributors. But I thought I would step in and implement an AttentionLayer that is applicable at more atomic level and up-to-date with new TF version. This repository is available here .

Note: This is an article from the series of light on math machine learning A-Z . You can find the previous blog posts linked to the letter below.

A B C D * E F G H I J K L * M N O P Q R S T U V W X Y Z

Introduction

In this article, first you will grok what a sequence to sequence model is, followed by why attention is important for sequential models? Next you will learn the nitty-gritties of the attention mechanism. This blog post will end by explaining how to use the attention layer.

Sequence to Sequence models

Sequence to sequence is a powerful family of deep learning models out there designed to take on the wildest problems in the realm of ML. For example,

Machine translation
Chatbots
Text summarization

Which have very unique and niche challenges attached to them. For example, machine translation has to deal with different word order topologies (i.e. subject-verb-object order). So they are an imperative weapon for combating complex NLP problems.

Let’s have a look at how a sequence to sequence model might be used for a English-French machine translation task.

A sequence to sequence model has two components, an encoder and a decoder . The encoder encodes a source sentence to a concise vector (called the context vector ) , where the decoder takes in the context vector as an input and computes the translation using the encoded representation.

Sequence to sequence model

Problem with this approach?

There is a huge bottleneck in this approach. The context vector has been given the responsibility of encoding all the information in a given source sentence in to a vector of few hundred elements. Now to give a bit of context, this vector needs to preserve:

Information about subject, object and verb
Interactions between these entities

This can be quite daunting especially for long sentences. Therefore a better solution was needed to push the boundaries.

Enter Attention!

What if instead of relying just on the context vector, the decoder had access to all the past states of the encoder ? That’s exactly what attention is doing. At each decoding step, the decoder gets to look at any particular state of the encoder. Here we will be discussing Bahdanau Attention . The following figure depicts the inner workings of attention.

Sequence to sequence with attention

So as the image depicts, context vector has become a weighted sum of all the past encoder states .

Introducing attention_keras

It can be quite cumbersome to get some attention layers available out there to work due to the reasons I explained earlier. attention_keras takes a more modular approach, where it implements attention at a more atomic level (i.e. for each decoder step of a given decoder RNN/LSTM/GRU).

Using the AttentionLayer

You can use it as any other layer. For example,

attn_layer = AttentionLayer(name='attention_layer')([encoder_out, decoder_out])

I have also provided a toy Neural Machine Translator (NMT) example showing how to use the attention layer in a NMT ( nmt.py ). But let me walk you through some of the details here.

Implementing an NMT with Attention

Here I will briefly go through the steps for implementing an NMT with Attention.

First define encoder and decoder inputs (source/target words). Both are of shape (batch_size, timesteps, vocabulary_size).

encoder_inputs = Input(batch_shape=(batch_size, en_timesteps, en_vsize), name='encoder_inputs')
decoder_inputs = Input(batch_shape=(batch_size, fr_timesteps - 1, fr_vsize), name='decoder_inputs')

Define the encoder (note that return_sequences=True )

encoder_gru = GRU(hidden_size, return_sequences=True, return_state=True, name='encoder_gru')
encoder_out, encoder_state = encoder_gru(encoder_inputs)

Define the decoder (note that return_sequences=True )

decoder_gru = GRU(hidden_size, return_sequences=True, return_state=True, name='decoder_gru')
decoder_out, decoder_state = decoder_gru(decoder_inputs, initial_state=encoder_state)

Defining the attention layer. Inputs to the attention layer are encoder_out (sequence of encoder outputs) and decoder_out (sequence of decoder outputs)

attn_layer = AttentionLayer(name='attention_layer')
attn_out, attn_states = attn_layer([encoder_out, decoder_out])

Concatenate the attn_out and decoder_out as an input to the softmax layer.

decoder_concat_input = Concatenate(axis=-1, name='concat_layer')([decoder_out, attn_out])

Define TimeDistributed Softmax layer and provide decoder_concat_input as the input.

dense = Dense(fr_vsize, activation='softmax', name='softmax_layer')
dense_time = TimeDistributed(dense, name='time_distributed_layer')
decoder_pred = dense_time(decoder_concat_input)

Define full model.

full_model = Model(inputs=[encoder_inputs, decoder_inputs], outputs=decoder_pred)
full_model.compile(optimizer='adam', loss='categorical_crossentropy')

That’s it!

Even supports attention visualization…

Not only this implements Attention, it also gives you a way to peek under the hood of the attention mechanism quite easily. This is possible because this layer returns both,

Attention context vector (used as an extra input to the Softmax layer of the decoder)
Attention energy values (Softmax output of the attention mechanism)

for each decoding step. So by visualizing attention energy values you get full access to what attention is doing during training/inference. Below, I’ll talk about some details of this process.

Infer from NMT and getting Attention weights

Inferring from NMT is cumbersome! Because you have to,

Get the encoder output
Define a decoder that performs a single step of the decoder (because we need to provide that step’s prediction as the input to the next step)
Use the encoder output as the initial state to the decoder
Perform decoding until we get an invalid word/<EOS> as output / or fixed number of steps

I’m not going to talk about the model definition. Please refer nmt.py for details. Let’s jump into how to use this for getting attention weights.

for i in range(20):

    dec_out, attention, dec_state = decoder_model.predict([enc_outs, dec_state, test_fr_onehot_seq])
    dec_ind = np.argmax(dec_out, axis=-1)[0, 0]

    ...

    attention_weights.append((dec_ind, attention))

So as you can see we are collecting attention weights for each decoding step.

Then you just have to pass this list of attention weights to plot_attention_weights ( nmt.py ) in order to get the attention heatmap with other arguments. The output after plotting will might like below.

Conclusion

In this article, I introduced you to an implementation of the AttentionLayer. Attention is very important for sequential models and even other types of models. However the current implementations out there are either not up-to-date or not very modular. Therefore, I dug a little bit and implemented an Attention layer using Keras backend operations. So I hope you’ll be able to do great this with this layer. If you have any questions/find any bugs, feel free to submit an issue on Github.

Welcoming contributors

I would be very grateful to have contributors, fixing any bugs/ implementing new attention mechanisms. So contributions are welcome!

Why Keras?

Why this post?

Introduction

Sequence to Sequence models

Problem with this approach?

Enter Attention!

Introducing attention_keras

Using the AttentionLayer

Implementing an NMT with Attention

Even supports attention visualization…

Infer from NMT and getting Attention weights

Conclusion

Welcoming contributors

Recommend

Go 语言解析 git config

Issue #353

Java Collection Interface

KDE Usability & Productivity: Week 62

负载均衡策略之一致性哈希

如何更有效地利用和监测K8s资源

Go 语言 for 和 range 的实现

打赏背后的心理学

亚马逊有多少经济学博士？可能仅次于美联储

花生日记是不是传销？

About Joyk