42

Recurrent Neural Networks (RNN) Explained — the ELI5 way

 4 years ago
source link: https://towardsdatascience.com/recurrent-neural-networks-rnn-explained-the-eli5-way-3956887e8b75?gi=f7553712f676
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Recurrent Neural Networks (RNN) Explained — the ELI5 way

qaYni2a.jpg!web

Photo by Michael Fruehmann on Unsplash

Sequence Modeling is the task of predicting what word/letter comes next. Sequence models compute the probability of occurrence of a number of words in a particular sequence. Unlike the FNN and CNN, in sequence modeling, the current output not only dependent on the current input but also on the previous input. In the sequence model, the length of the input is not fixed.

Citation Note: The content and the structure of this article is based my understand of the deep learning lectures from One-Fourth Labs — PadhAI .

Recurrent Neural Networks

Recurrent Neural Networks(RNN)are a type of Neural Network where the output from the previous step is fed as input to the current step.

fYNnYvJ.png!web

RNN’s are mainly used for,

  • Sequence Classification — Sentiment Classification & Video Classification
  • Sequence Labelling — Part of speech tagging & Named entity recognition
  • Sequence Generation — Machine translation & Transliteration

Sequence Classification

In this section, we will discuss how we can use RNN to do the task of Sequence Classification. In Sequence Classification, we will be given a corpus of sentences and the corresponding labels i.e…sentiment of the sentences either positive or negative.

eInaeqn.png!web

In this scenario, we don’t need to output after every word of the input rather we just need to understand the mood after reading the entire sentence i.e…either positive or negative.

f6vimuZ.png!web

As you can see from the above figure, the input sentences are not of equal length. Before we feed the data into the RNN we need to pre-process the data such that the input sequences are of equal length (Input matrix will have a fixed dimension of mxn). The input words should be converted into a one-hot representation vector.

Pre-Processing Data

In processing, we define a few special characters such as the start of the sequence, end of the sequence.

m67bQrn.png!web

All the input sequences are appended with “ Start-of-sequence ”<sos> character to indicate the beginning of the character sequence. The end of the sequence is appended with “ End-of-sequence ”<eos> character to mark the end of the character sequence. Since all character sequences must have the same length as defined by the corresponding input layer, padding will be applied where needed.

The way we apply padding is that,

  • Find the maximum input length across all the sequences (say, 10)
  • Add special word <pad> to all shorter sequences so that they become of the same length (10, in this case).

Once we are doing with the pre-processing (adding the special characters), we have to convert these words including the special characters into a one-hot vector representation and feed them into the network.

Important points to note about padding is that:

  • Padding was only done to ensure that the input sequences are of uniform size.
  • The computations in the RNN are only performed till the “ End-of-sequence ” special character i.e…padding is not considered as an input for the network.

Sequence Labelling

Parts of speech tagging is a task of labeling (predicting) the part of speech tag for each word in the sequence. Again in this problem, the output at the current time step is not only dependent on the current input (current word) but also on the previous input. For example, the probability of tagging the word ‘movie’ as a noun would be higher if we know that the previous word is an adjective.

Unlike the problem of sequence classification, in sequence labeling, we have to predict the output at each time step for every word occurring in the sequence. As we can see from the image since we have 6 words in the first sequence we will get 6 predictions for there part of speech based on the structure of the sentences.

Since our input sequences are of varying length, we have to pre-process the data such that the input sequences are of equal length. Remember that RNN will process the sequence of words only after it encounters “ Start-of-sequence ” <sos>token and “ End-of-sequence ” token signals to the network that the input has reached the end and the output needs to be the finalized.

Model

In the previous sections, we have discussed some of the tasks where RNN can be used along with the pre-processing steps to perform before feeding data into the model. In this section, we will discuss how to model (approximation function) the true relationship between input and output.

Sequence Classification

As we already know, in sequence classification the output depends on the entire sequence. eg. Predicting the pulse of the movie by analyzing the reviews.

jIJfM3a.png!web

The input to the function is denoted in orange color and represented as an xᵢ . The weights associated with the input is denoted using a vector U and the hidden representation ( sᵢ) of the word is computed as a function of the output of the previous time step and current input along with bias. The hidden representation will be computed until the length of the sequence (sₜ).

ZRRzuqI.png!web

The final output (y_hat) from the network is a softmax function of hidden representation and weights associated with it along with the bias.

Sequence Labeling

In sequence labeling, we have to predict the output at each time step unlike the predictions at the end in sequence classification.

iM3MRby.png!web

The mathematical formula will slightly vary from sequence classification, in this approach, we will predict the output after each time step.

v2IFzuU.png!web

Once we compute the hidden representation, the output ( yᵢ ) at the particular timestep from the network is a softmax function of hidden representation and weights associated with it along with the bias. Similarly, we will compute the hidden representation state and predicted output for each and every time step in the sequence.

Loss Function

The purpose of the loss function is to tell the model that some correction needs to be done in the learning process.

In the context of sequence classification problem , to compare two probability distributions (true distribution and predicted distribution) we will use the cross-entropy loss function. The loss function is equal to the summation of the true probability and log of the predicted probability.

VJnERfU.png!web

For ‘m’ training samples, the total loss would be equal to the average of overall loss (Where c indicates the correct class or true class).

VVvqiam.png!web

In the sequence labeling problem at every time step, we have to make a prediction that means at every time step we have a true distribution and predicted distribution.

v6bQfmN.png!web

Since we are predicting the labels at every time step, there is a possibility of making an error at each time step. So we have to check the true probability distribution and predicted probability distribution at every time step to calculate the loss of the model.

In effect, for all the training examples (m — training examples) and for all the time steps (T) we try to minimize the cross-entropy loss between the predicted distribution of the true class.

viqQj2R.png!web

Learning Algorithm

The objective of the learning algorithm is to determine the best possible values for the parameters, such that the overall loss (squared error loss) of the model is minimized as much as possible. Here goes the learning algorithm:

BfyQbyF.png!web

We initialize w, u, v and b randomly. We then iterate over all the observations in the data, for each observation find the predicted outcome using the RNN equation and compute the overall loss. Based on the loss value, we will update the weights such that the overall loss of the model at the new parameters will be less than the current loss of the model.

We will keep doing the update operation until we are satisfied. Till satisfied could mean any of the following:

  • The overall loss of the model becomes zero.
  • The overall loss of the model becomes a very small value closer to zero.
  • Iterating for a fixed number of passes based on computational capacity.

Recommend

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK