Recurrent Neural Networks (RNN) Explained — the ELI5 way
source link: https://towardsdatascience.com/recurrent-neural-networks-rnn-explained-the-eli5-way-3956887e8b75?gi=f7553712f676
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
Recurrent Neural Networks (RNN) Explained — the ELI5 way
Nov 16 ·8min read
Photo by Michael Fruehmann on Unsplash
Sequence Modeling is the task of predicting what word/letter comes next. Sequence models compute the probability of occurrence of a number of words in a particular sequence. Unlike the FNN and CNN, in sequence modeling, the current output not only dependent on the current input but also on the previous input. In the sequence model, the length of the input is not fixed.
Citation Note: The content and the structure of this article is based my understand of the deep learning lectures from One-Fourth Labs — PadhAI .
Recurrent Neural Networks
Recurrent Neural Networks(RNN)are a type of Neural Network where the output from the previous step is fed as input to the current step.
RNN’s are mainly used for,
- Sequence Classification — Sentiment Classification & Video Classification
- Sequence Labelling — Part of speech tagging & Named entity recognition
- Sequence Generation — Machine translation & Transliteration
Sequence Classification
In this section, we will discuss how we can use RNN to do the task of Sequence Classification. In Sequence Classification, we will be given a corpus of sentences and the corresponding labels i.e…sentiment of the sentences either positive or negative.
In this scenario, we don’t need to output after every word of the input rather we just need to understand the mood after reading the entire sentence i.e…either positive or negative.
As you can see from the above figure, the input sentences are not of equal length. Before we feed the data into the RNN we need to pre-process the data such that the input sequences are of equal length (Input matrix will have a fixed dimension of mxn). The input words should be converted into a one-hot representation vector.
Pre-Processing Data
In processing, we define a few special characters such as the start of the sequence, end of the sequence.
All the input sequences are appended with “ Start-of-sequence ”<sos> character to indicate the beginning of the character sequence. The end of the sequence is appended with “ End-of-sequence ”<eos> character to mark the end of the character sequence. Since all character sequences must have the same length as defined by the corresponding input layer, padding will be applied where needed.
The way we apply padding is that,
- Find the maximum input length across all the sequences (say, 10)
- Add special word <pad> to all shorter sequences so that they become of the same length (10, in this case).
Once we are doing with the pre-processing (adding the special characters), we have to convert these words including the special characters into a one-hot vector representation and feed them into the network.
Important points to note about padding is that:
- Padding was only done to ensure that the input sequences are of uniform size.
- The computations in the RNN are only performed till the “ End-of-sequence ” special character i.e…padding is not considered as an input for the network.
Sequence Labelling
Parts of speech tagging is a task of labeling (predicting) the part of speech tag for each word in the sequence. Again in this problem, the output at the current time step is not only dependent on the current input (current word) but also on the previous input. For example, the probability of tagging the word ‘movie’ as a noun would be higher if we know that the previous word is an adjective.
Unlike the problem of sequence classification, in sequence labeling, we have to predict the output at each time step for every word occurring in the sequence. As we can see from the image since we have 6 words in the first sequence we will get 6 predictions for there part of speech based on the structure of the sentences.
Since our input sequences are of varying length, we have to pre-process the data such that the input sequences are of equal length. Remember that RNN will process the sequence of words only after it encounters “ Start-of-sequence ” <sos>token and “ End-of-sequence ” token signals to the network that the input has reached the end and the output needs to be the finalized.
Model
In the previous sections, we have discussed some of the tasks where RNN can be used along with the pre-processing steps to perform before feeding data into the model. In this section, we will discuss how to model (approximation function) the true relationship between input and output.
Sequence Classification
As we already know, in sequence classification the output depends on the entire sequence. eg. Predicting the pulse of the movie by analyzing the reviews.
The input to the function is denoted in orange color and represented as an xᵢ . The weights associated with the input is denoted using a vector U and the hidden representation ( sᵢ) of the word is computed as a function of the output of the previous time step and current input along with bias. The hidden representation will be computed until the length of the sequence (sₜ).
The final output (y_hat) from the network is a softmax function of hidden representation and weights associated with it along with the bias.
Sequence Labeling
In sequence labeling, we have to predict the output at each time step unlike the predictions at the end in sequence classification.
The mathematical formula will slightly vary from sequence classification, in this approach, we will predict the output after each time step.
Once we compute the hidden representation, the output ( yᵢ ) at the particular timestep from the network is a softmax function of hidden representation and weights associated with it along with the bias. Similarly, we will compute the hidden representation state and predicted output for each and every time step in the sequence.
Loss Function
The purpose of the loss function is to tell the model that some correction needs to be done in the learning process.
In the context of sequence classification problem , to compare two probability distributions (true distribution and predicted distribution) we will use the cross-entropy loss function. The loss function is equal to the summation of the true probability and log of the predicted probability.
For ‘m’ training samples, the total loss would be equal to the average of overall loss (Where c indicates the correct class or true class).
In the sequence labeling problem at every time step, we have to make a prediction that means at every time step we have a true distribution and predicted distribution.
Since we are predicting the labels at every time step, there is a possibility of making an error at each time step. So we have to check the true probability distribution and predicted probability distribution at every time step to calculate the loss of the model.
In effect, for all the training examples (m — training examples) and for all the time steps (T) we try to minimize the cross-entropy loss between the predicted distribution of the true class.
Learning Algorithm
The objective of the learning algorithm is to determine the best possible values for the parameters, such that the overall loss (squared error loss) of the model is minimized as much as possible. Here goes the learning algorithm:
We initialize w, u, v and b randomly. We then iterate over all the observations in the data, for each observation find the predicted outcome using the RNN equation and compute the overall loss. Based on the loss value, we will update the weights such that the overall loss of the model at the new parameters will be less than the current loss of the model.
We will keep doing the update operation until we are satisfied. Till satisfied could mean any of the following:
- The overall loss of the model becomes zero.
- The overall loss of the model becomes a very small value closer to zero.
- Iterating for a fixed number of passes based on computational capacity.
Recommend
-
174
char-rnn This code implements multi-layer Recurrent Neural Network (RNN, LSTM, and GRU) for training/sampling from character-level language models. Requirements Python 3.6 TensorFlow 1.4 hb-config
-
17
The main objective of this post is to implement an RNN from scratch and provide an easy explanation as well to make it useful for the readers. Implementing any neural network from scratch at least once is a valuable exerc...
-
33
Recurrent Neural Networks Explained An entertaining and illustrated guide to understand the intuition.
-
13
Understand the intuition behind RNN!
-
2
循环神经网络 (Recurrent Neural Network, RNN) 范叶亮 / 2018-09-21 分类: 深度学习 / 标签:...
-
3
I enjoyed reading Andrej Karpathy’s The Unreasonable Effectiveness of Recurrent Neural Networks lately – it’s got some fascinating examples and some good explanati...
-
1
Convolutional Neural Networks with Recurrent Neural Filters Author: Yi Yang Contact: [email protected] Basic description This is the Python implementation of the...
-
3
Recurrent Neural Networks for Collaborative Filtering 2014-06-28I’ve been spending quite some time lately playing around with RNN’s for collaborative filtering. RNN’s are models that predict a sequence of something. The be...
-
9
记录一下,很久之前看的论文-基于RNN来从微博中检测谣言及其代码复现。 1 引言 现有传统谣言检测模型使用经典的机器学习算法,这些算法利用了根据帖子的内容、用户特征和扩散模式手工制作的各种特征,或者简单地利用
-
7
Recurrent Neural Network (RNN) and it's TypesSkip to content
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK