implementing Neural Machine Translation with Attention using Tensorflow

A step by step explanation of Tensorflow implementation of neural machine translation(NMT) using Bahdanau’s Attention.

In this article, you will learn how to implement sequence to sequence(seq2seq) neural machine translation(NMT) using Bahdanau’s Attention mechanism. We will implement the code in Tensorflow 2.0 using Gated Recurrent Unit(GRU).

Photo by Aaron Burden on Unsplash

Prerequisites

Sequence to Sequence Model using Attention Mechanism

An Intuitive explanation of Neural Machine Translation

Neural Machine Translation(NMT) is the task of converting a sequence of words from a source language, like English, to a sequence of words to a target language like Hindi or Spanish using deep neural networks.

NMT is implemented using a sequence to sequence(seq2seq) model consisting of Encoder and Decoder. The Encoder encodes the complete information of the source sequence into a single real-valued vector, also known as the context vector, which is passed to the Decoder to produce an output sequence, which is the target language like Hindi or Spanish.

The context vector has the responsibility to summarize the entire input sequence into a single vector, which is inefficient, so we use the Attention mechanism.

The basic idea of the Attention mechanism is to avoid attempting to learning a single vector representation for each sentence; instead, it pays attention to specific input vectors of the input sequence based on the attention weights.

For implementation purposes, we will use English as the source language and Spanish as the target language. The code will be implemented using TensorFlow 2.0, and data can be downloaded from here .

Steps for implementing NMT with an Attention mechanism

Load the data and preprocess it by removing spaces, special characters, etc.
Create the dataset
Create the Encoder, Attention layer and Decoder
Create the Optimizer and Loss function
Train the model
Make inferences

Import required libraries

import pandas as pd
import numpy as np
import string
from string import digits
import matplotlib.pyplot as plt
%matplotlib inline
import tensorflow as tf
import matplotlib.ticker as ticker
from sklearn.model_selection import train_test_split
import re
import os
import io
import time

Read the data from the file

Read the file for English-Spanish translations that can be downloaded from here .

I have stored the file in “spa.txt.”

data_path = "spa.txt"#Read the data
lines_raw= pd.read_table(data_path,names=['source', 'target', 'comments'])
lines_raw.sample(5)

Clean and preprocess the source and target sentences.

We apply the following text cleaning

Convert text to lower case
Remove quotes
Clean digits from the source and target sentences. If the source or the target language use different symbols for the numbers, then remove those symbols
Remove spaces
Add a space between the word and the punctuations like “?”
Add “start_” tag at the start of the sentence and “_end” tag at the end of the sentence

def preprocess_sentence(sentence):

 num_digits= str.maketrans('','', digits)

sentence= sentence.lower()
 sentence= re.sub(" +", " ", sentence)
 sentence= re.sub("'", '', sentence)
 sentence= sentence.translate(num_digits)
 sentence= re.sub(r"([?.!,¿])", r" \1 ", sentence)
 sentence = sentence.rstrip().strip()
 sentence= 'start_ ' + sentence + ' _end'

return sentence

Let’s take one of the sentences in English and preprocess them

print(preprocess_sentence(“Can you do it in thirty minutes?”))

Preprocessing the source and target sentences to have word pairs in the format: [ENGLISH, SPANISH]

def create_dataset(path, num_examples):

lines = io.open(path, encoding='UTF-8').read().strip().split('\n')

 word_pairs = [[preprocess_sentence(w) for w in l.split('\t')] for l in lines[:num_examples]]

return zip(*word_pairs)sample_size=60000
source, target = create_dataset(data_path, sample_size)

Tokenize source and target sentences

We need to vectorize the text corpus where the text is converted into a sequence of integers.

We first create the tokenizer and then apply the tokenizer on the source sentences

# create a tokenizer for source sentence
source_sentence_tokenizer= tf.keras.preprocessing.text.Tokenizer(filters='')# Fit the source sentences to the source tokenizer
source_sentence_tokenizer.fit_on_texts(source)

We now transform each word in the source sentences into a sequence of integers by replacing the word with its corresponding integer value.

Only words known by the tokenizer will be taken into account

#Transforms each text in texts to a sequence of integers.
source_tensor = source_sentence_tokenizer.texts_to_sequences(source)

We need to create the sequences with the same length, so we post pad sequences that are shorter in length with “0.”

#Sequences that are shorter than num_timesteps, padded with 0 at the end.
source_tensor= tf.keras.preprocessing.sequence.pad_sequences(source_tensor,padding='post' )

Tokenize the target sentences in a similar way

# create the target sentence tokenizer
target_sentence_tokenizer= # Fit tf.keras.preprocessing.text.Tokenizer(filters='')# Fit the tokenizer on target sentences
target_sentence_tokenizer.fit_on_texts(target)#conver target text to sequnec of integers
target_tensor = target_sentence_tokenizer.texts_to_sequences(target)# Post pad the shorter sequences with 0
target_tensor= tf.keras.preprocessing.sequence.pad_sequences(target_tensor,padding='post' )

Create training and test dataset

Split the dataset into a test and train. 80% of data is used for training and 20% for testing the model

source_train_tensor, source_test_tensor, target_train_tensor, target_test_tensor= train_test_split(source_tensor, target_tensor,test_size=0.2)

When the dataset is big, we want to create the dataset in memory to be efficient. We will use tf.data.Dataset.from_tensor_slices() method to get slices of the array in the form of an object.

Dataset is created in batches of 64.

#setting the BATCH SIZE
BATCH_SIZE = 64#Create data in memeory dataset=tf.data.Dataset.from_tensor_slices((source_train_tensor, target_train_tensor)).shuffle(BATCH_SIZE)# shuffles the data in the batch
dataset = dataset.batch(BATCH_SIZE, drop_remainder=True)

We iterate through all the elements in the dataset. The returned iterator implements the Python iterator protocol and therefore can only be used in eager mode

#Creates an Iterator for enumerating the elements of this dataset.
#Extract the next element from the dataset
source_batch, target_batch =next(iter(dataset))
print(source_batch.shape)

Each batch of source data will be of the size (BATCH_SIZE, max_source_length), and a batch size of target data will be (BATCH_SIZE, max_target_length). In our case the max-source_length is 11 and max_target_length is 16

Create the sequence to sequence model with Bahdanau’s Attention using Gated Recurrent Unit(GRU)

Difference between seq2seq model with attention and seq2seq model without attention

All hidden states of the Encoder (forward and backward) and the Decoder are used to generate the context vector, unlike eq2seq without attention, which uses the last Encoder hidden state.
The attention mechanism aligns the input and output sequences, with an alignment score parameterized by a feed-forward network. It helps to pay attention to the most relevant information in the source sequence.
Seq2Seq Attention model predicts a target word based on the context vectors associated with the source position and the previously generated target words unlike seq2seq without attention encodes all of the source sequences into a single context vector

setting a few parameters for the model

BUFFER_SIZE = len(source_train_tensor)
steps_per_epoch= len(source_train_tensor)//BATCH_SIZE
embedding_dim=256
units=1024
source_vocab_size= len(source_sentence_tokenizer.word_index)+1
target_vocab_size= len(target_sentence_tokenizer.word_index)+1

Create the Encoder

The Encoder takes the input as the source tokens, passes them to an embedding layer for the dense representation of the vector, which is then passed to GRU.

Set return_sequences and return_state as True for the GRU. By default, return_sequneces are set to False. When return_sequences are set to true, then it returns the entire sequence of outputs from all the units in the Encoder . When return_sequences are set to False, then we only return the hidden state of the last encoder unit.

seq2seq without Attention will have return_sequences of the Encoder set to False. Seq2seq with Attention will have return_sequences set to True for the Encoder.

To return the internal state of GRU, we set the retrun_state to True

class Encoder(tf.keras.Model):
 def __init__(self, vocab_size, embedding_dim, encoder_units, batch_size):
 super(Encoder, self).__init__()
 self.batch_size= batch_size
 self.encoder_units=encoder_units
 self.embedding=tf.keras.layers.Embedding(vocab_size, embedding_dim)
 self.gru= tf.keras.layers.GRU(encoder_units, 
 return_sequences=True,
 return_state=True, recurrent_initializer='glorot_uniform'
 )

def call(self, x, hidden):
 #pass the input x to the embedding layer
 x= self.embedding(x)
 # pass the embedding and the hidden state to GRU
output, state = self.gru(x, initial_state=hidden)
 return output, state

def initialize_hidden_state(self):
 return tf.zeros((self.batch_size, self.encoder_units))

Testing the Encoder class and printing the dimensions of the Encoder’s output and hidden state

encoder = Encoder(source_vocab_size, embedding_dim, units, BATCH_SIZE)sample_hidden = encoder.initialize_hidden_state()sample_output, sample_hidden= encoder(source_batch, sample_hidden)print ('Encoder output shape: (batch size, sequence length, units) {}'.format(sample_output.shape))
print ('Encoder Hidden state shape: (batch size, units) {}'.format(sample_hidden.shape))

Create the Bahdanau Attention layer

Attention layer consists of

Alignment Score
Attention weights
Context vector

We will implement these simplified equations in the Attention layer

Bahdanau’s Attention equations

class BahdanauAttention(tf.keras.layers.Layer):
 def __init__(self, units):
 super( BahdanauAttention, self).__init__()
 self.W1= tf.keras.layers.Dense(units) # encoder output
 self.W2= tf.keras.layers.Dense(units) # Decoder hidden
 self.V= tf.keras.layers.Dense(1)

def call(self, query, values):
 #calculate the Attention score

score= self.V(tf.nn.tanh(self.W1(values) + self.W2(hidden_with_time_axis)))

 # attention_weights shape == (batch_size, max_length, 1)
 attention_weights= tf.nn.softmax(score, axis=1)

 #context_vector 
context_vector= attention_weights * values

 #Computes the sum of elements across dimensions of a tensor
 context_vector = tf.reduce_sum(context_vector, axis=1)
 return context_vector, attention_weights

Test the Bahdanau attention layer with ten units

attention_layer= BahdanauAttention(10)
attention_result, attention_weights = attention_layer(sample_hidden, sample_output)print("Attention result shape: (batch size, units) {}".format(attention_result.shape))
print("Attention weights shape: (batch_size, sequence_length, 1) {}".format(attention_weights.shape))

Create the Decoder

The Decoder has an embedding layer, a GRU layer, and a fully connected layer.

To predict the target word Decoder uses

Context vector: the sum of Attention weights and Encoder output
Decoder’s output from the previous time step and
Previous Decoder’s hidden state

class Decoder(tf.keras.Model):
 def __init__(self, vocab_size, embedding_dim, decoder_units, batch_sz):
 super (Decoder,self).__init__()
 self.batch_sz= batch_sz
 self.decoder_units = decoder_units
 self.embedding = tf.keras.layers.Embedding(vocab_size, 
 embedding_dim)
 self.gru= tf.keras.layers.GRU(decoder_units, 
 return_sequences= True,
 return_state=True,
 recurrent_initializer='glorot_uniform')
 # Fully connected layer
self.fc= tf.keras.layers.Dense(vocab_size)

 # attention
self.attention = BahdanauAttention(self.decoder_units)

 def call(self, x, hidden, encoder_output):

context_vector, attention_weights = self.attention(hidden, 
 encoder_output)

 # pass output sequnece thru the input layers
x= self.embedding(x)

 # concatenate context vector and embedding for output sequence
x= tf.concat([tf.expand_dims( context_vector, 1), x], 
 axis=-1)

 # passing the concatenated vector to the GRU
 output, state = self.gru(x)

 # output shape == (batch_size * 1, hidden_size)
 output= tf.reshape(output, (-1, output.shape[2]))

 # pass the output thru Fc layers
x= self.fc(output)
 return x, state, attention_weights

Testing the Decoder

decoder= Decoder(target_vocab_size, embedding_dim, units, BATCH_SIZE)
sample_decoder_output, _, _= decoder(tf.random.uniform((BATCH_SIZE,1)), sample_hidden, sample_output)
print ('Decoder output shape: (batch_size, vocab size) {}'.format(sample_decoder_output.shape))

Define the optimizer

We use Adam optimizer here; you can try Rmsprop too

#Define the optimizer and the loss function
optimizer = tf.keras.optimizers.Adam()

Define the loss function

Use SparseCategoricalCrossentropy to compute the loss between the actual and the predicted output.

If the output is a one-hot encoded vector, then use categorical_crossentropy. Use SparseCategoricalCrossentropy loss for word2index vector containing integers.

SparseCategoricalCrossentropy is computationally and memory-efficient as it uses a single integer rather like [3] than the whole vector [0 0 0 1]

loss_object = tf.keras.losses.SparseCategoricalCrossentropy(
 from_logits=True, reduction='none')def loss_function(real, pred):
 mask = tf.math.logical_not(tf.math.equal(real, 0))
 loss_ = loss_object(real, pred)mask = tf.cast(mask, dtype=loss_.dtype)
 loss_ *= maskreturn tf.reduce_mean(loss_)

Train the dataset

To train the dataset using the Encoder-Decoder model

Pass the encoded source sentences through the Encoder and return the Encoder output sequences and the hidden states
Encoder output, Encoder hidden state, and Decoder input is passed to the Decoder. At time step =0, the Decoder takes ‘start_’ as the input.
Decoder returns the predicted word and the Decoder hidden state
Decoder hidden state is passed back to the model, and the predicted word is used for calculating the loss
For training, we use teacher forcing where we pass the actual word to the Decoder at each time step.
During inference, we pass the predicted word from the previous time step as the input to the Decoder
Calculate the gradient descent, apply it to the optimizer and backpropagate

Bahdanau et al. attention mechanism

Tensorflow keeps track of every gradient for every computation on every tf.Variable. To train, we use gradient tape as we need to control the areas of code where we need gradient information. For seq2seq with the Attention mechanism, we calculate the gradient for the Decoder’s output only.

def train_step(inp, targ, enc_hidden):
 loss = 0
 with tf.GradientTape() as tape:
 #create encoder
 enc_output, enc_hidden = encoder(inp, enc_hidden)
 dec_hidden = enc_hidden
 #first input to decode is start_
dec_input = tf.expand_dims(
 [target_sentence_tokenizer.word_index['start_']] * BATCH_SIZE, 1)
 # Teacher forcing - feeding the target as the next input
for t in range(1, targ.shape[1]):
 # passing enc_output to the decoder
predictions, dec_hidden, _ = decoder(dec_input, dec_hidden, enc_output)
 # calculate loss based on predictions 
loss += tf.keras.losses.sparse_categorical_crossentropy(targ[:, t], predictions)
 # using teacher forcing
dec_input = tf.expand_dims(targ[:, t], 1)
 batch_loss = (loss / int(targ.shape[1]))
 variables = encoder.trainable_variables + decoder.trainable_variables
 gradients = tape.gradient(loss, variables)
 optimizer.apply_gradients(zip(gradients, variables))
 return batch_loss

Training the Encoder-Decoder model with attention using multiple epochs

EPOCHS=20
for epoch in range(EPOCHS):
 start = time.time()enc_hidden = encoder.initialize_hidden_state()
 total_loss = 0
 # train the model using data in bataches 
 for (batch, (inp, targ)) in enumerate(dataset.take(steps_per_epoch)):
 batch_loss = train_step(inp, targ, enc_hidden)
 total_loss += batch_lossif batch % 100 == 0:
 print('Epoch {} Batch {} Loss {}'.format(epoch + 1,
 batch, 
 batch_loss.numpy()))
 print('Epoch {} Loss {}'.format(epoch + 1,
 total_loss / steps_per_epoch))
 print('Time taken for 1 epoch {} sec\n'.format(time.time() - start))

Making Inferences for the test data

Making inferences is similar to training except that we do not know the actual word that is used in Teacher Forcing, so we pass the predicted word from the previous time step as an input to the Decoder.

We calculate the Attention weights at each time step as it helps to pay attention to the most relevant information in the source sequence that is used to make the prediction.

We stop predicting the words either when we have reached the max target sentence length, or we have encountered the “stop_” tag.

#Calculating the max length of the source and target sentences
max_target_length= max(len(t) for t in target_tensor)
max_source_length= max(len(t) for t in source_tensor)

For making the inference

Pass the source sentence,
Preprocess the sentence to convert to lower case, removes spaces, special characters, put a space between the word and the punctuations, etc.
Tokenize the sentence to create the word2index dictionary
Post pad the source sequence with 0 to have the same length as the max source sentence
Create input tensors
Create the Encoder and pass the input vector along with the hidden state. The initial hidden state is set to zero
The first input to the Decoder will be “start_” tag. The initial hidden state of the Decoder is the Encoder hidden state
Create the Decoder to which we pass the Decoder input, Decoder hidden state, and Encoder output
Store the Attention weights and using the Decoder input, hidden and Context vector, find the integer for the word with maximum probability.
Convert the integer to word and keep appending the predicted words to form the target sentence till we encounter the ‘end_’ tag or reach the max target sequence length

def evaluate(sentence):
attention_plot= np.zeros((max_target_length, max_source_length))
 #preprocess the sentnece
sentence = preprocess_sentence(sentence)

 #convert the sentence to index based on word2index dictionary
inputs= [source_sentence_tokenizer.word_index[i] for i in sentence.split(' ')]

 # pad the sequence 
inputs = tf.keras.preprocessing.sequence.pad_sequences([inputs], maxlen=max_source_length, padding='post')

 #conver to tensors
inputs = tf.convert_to_tensor(inputs)

result= ''

 # creating encoder
hidden = [tf.zeros((1, units))]
encoder_output, encoder_hidden= encoder(inputs, hidden)

 # creating decoder
decoder_hidden = encoder_hidden
decoder_input = tf.expand_dims([target_sentence_tokenizer.word_index['start_']], 0)

for t in range(max_target_length):
predictions, decoder_hidden, attention_weights= decoder(decoder_input, decoder_hidden, encoder_output)

 # storing attention weight for plotting it
 attention_weights = tf.reshape(attention_weights, (-1,))
 attention_plot[t] = attention_weights.numpy()

 prediction_id= tf.argmax(predictions[0]).numpy()
result += target_sentence_tokenizer.index_word[prediction_id] + ' '

 if target_sentence_tokenizer.index_word[prediction_id] == '_end':
 return result,sentence, attention_plot

 # predicted id is fed back to as input to the decoder
 decoder_input = tf.expand_dims([prediction_id], 0)

return result,sentence, attention_plot

Function to plot the Attention weights between the source words and target words. The plot will help us understand which source word was given greater attention to predict the target word

def plot_attention(attention, sentence, predicted_sentence):
 fig = plt.figure(figsize=(10,10))
 ax= fig.add_subplot(1,1,1)
 ax.matshow(attention, cmap='Greens')
 fontdict={'fontsize':10}

 ax.set_xticklabels([''] + sentence, fontdict=fontdict, rotation=90)
 ax.set_yticklabels([''] + predicted_sentence, fontdict=fontdict)ax.xaxis.set_major_locator(ticker.MultipleLocator(1))
 ax.yaxis.set_major_locator(ticker.MultipleLocator(1))plt.show()

Translate the source sentence to target sentence

To translate the source sentence to the target language, we make a call to the evaluate function which creates the Encoder, Decoder and Attention layer

def translate(sentence):
 result, sentence, attention_plot = evaluate(sentence)

 print('Input : %s' % (sentence))
 print('predicted sentence :{}'.format(result))

 attention_plot= attention_plot[:len(result.split(' ')), :len(sentence.split(' '))]
 plot_attention(attention_plot, sentence.split(' '), result.split(' '))

The final prediction

translate(u'I am going to work.')

The attention plot for the translated sentence

Attention Plot

During translation, we see that “ going” was given greater attention to predict “voy”, similarly “work” was given higher attention to predict “trabajar”

Code available at Github

References

Bahdanau attention- https://arxiv.org/pdf/1409.0473.pdf

https://www.tensorflow.org/tutorials/text/nmt_with_attention

https://www.tensorflow.org/guide/keras/rnn

implementing Neural Machine Translation with Attention using Tensorflow