32

Recurrent Neural Networks Explained

 4 years ago
source link: https://towardsdatascience.com/recurrent-neural-networks-explained-ffb9f94c5e09?gi=b4e149f77616
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Recurrent Neural Networks Explained

An entertaining and illustrated guide to understand the intuition.

This article gently introduces recurrent units, how their memory works and how they are used to handle sequence data such as text and time series. Have you ever thought of recurrent neural network as a time machine?

qYra2yu.jpg!web

source

Why care about sequence?

In our previous article Why Deep Learning Works , we showcased few artificial neural networks for predicting the harvesting quantity that a farmer is likely to produce on his fields.

The model could improve its predictive power by looking at not only one year (e.g. 2019), but at a sequence of years (e.g. 2017, 2018, 2019) at once.

As illustrated below, the model would predict the fourth value given three consecutive values. The first vertical column shows the full sequence of values. The next columns show examples what the prediction would be (colored in blue).

3Qfmqq3.png!web

P. Protopapas, CS109b, Harvard FAS

This is called overlapping windowed dataset, since we’re windowing observations to create new. We can imagine building a fully connected neural network with three layers and ReLU activation functions for predicting the next number when given three consecutive numbers.

byM7JfI.png!web
P. Protopapas, CS109b, Harvard FAS

There are several problems with this approaches. If we rearrange the input sequence, the network will predict the same result. A fully connected network will not distinguish the order and therefore missing some information.

VRfqyma.png!web
P. Protopapas, CS109b, Harvard FAS

Moreover, a fully connected network requires fixed input and output size. If we decide to look at 5 consecutive number to make our prediction, we will have to build another model.

Another drawback of fully connected networks is, it can’t classify inputs in multiple places. We cannot, for example, predict the next 4 values at once, when the network was designed to predict 3 values.

The order of inputs matters. This is true for most sequential data.

We want a machine learning model to understand sequences, not isolated

samples. This is particularly important for time series data, where data is intrinsically linked to the notion of time. We provide additional illustrative examples of sequence data below.

MZba6ne.png!web

Sequences are also found in natural language, where sentences are sequence of words. A natural language processing model requires contextual information to understand the subject ( he ) and the direct object ( it ) in the sentence below. After providing sequential information, the model would understand the subject ( Joe’s brother ) and the direct object ( sweater ) in the sentence.

z2MvMvj.png!web

P. Protopapas, CS109b, Harvard FAS

The memory game

In the previous section, we came up with the idea of using context, history and also future for better prediction. Now let’s look at the concept of memory.

We probably all experienced once being beaten by kids in the game of Pexeso , a game of concentration and memory.

When playing Pexeso, the goal is, with least as possible moves, to find all pairs of identical cards. Somehow, kids are excellent at remembering things. Why don’t we give this same super-human capability to a neuron in an artificial neural network?

JBrYr2v.jpg!web

Basically, we need to update the neuron into a new computational unit that is able to remember what it has seen before. We’ll call this the unit’s state or hidden state or memory .

AJzMnuf.png!web

P. Protopapas, CS109b, Harvard FAS

How can build a unit that remembers the past? The memory or state can be written to a file but better, we keep it inside the unit, in an array or in a vector . When the unit reads an input, it also reads the content of the memory. Using both information it makes a prediction and more importantly, it updates the memory.

We can find this principle in the Pexeso game as well. A kid would open a card, would try to recall the location of the pair card, that was previously opened in the past, and would then decide on which card to open next. After opening that card, if it is not a match, the kid would update his memory by recording the position of the pair he has just discovered. In the next round of the game, the kid will repeat this process as represented graphically in the image below.

byeYven.png!web

P. Protopapas, CS109b, Harvard FAS

The recurrent unit

In mathematics, the type of dependence of the current value (event or word) on the previous event(s) is called recurrence and is expressed using recurrent equations.

jeyY3eb.png!web

A recurrent neural network can be thought of as multiple copies of the same node, each passing a message to a successor. One way to represent the above mentioned recursive relationships is to use the diagram below. The little black square indicates that the state used is obtained from a previous timestamp, aka a loop where previous state gets fed into the current state.

zQbQz2I.png!web

P. Protopapas, CS109b, Harvard FAS

Each unit has three sets of weights: one for the inputs ( ), another for the outputs of the previous time step ( –1) and the other for the output of the current time step ( ). Those weights are just like any other weights that we dealt with in normal artificial neural networks and eventually they would be determined by the training process.

The schematic below shows the inside of a unit that turns 3 numbers into 4 numbers. The input is a 3-numbers vector. The hidden state or memory is a 5-numbers vector. The unit will use an internal network of 5 neurons (A1–5) to turn the input into a 5-numbers vector, combine it with the current state and pass the result through an activation function. The resulting 5-number vector will pass through another internal network of 5 neurons (B1–5) to produce the new state. It will simultaneously go through another network of 4 neurons (C1–4) to produce a 4-numbers output vector.

fU3qYjr.png!web

P. Protopapas, CS109b, Harvard FAS

By combining the current input with the previous state and passing this through an activation function, the network will not just remember the previous inputs. The state will always be a bigger picture of things that have already been seen by the network.

In the Pexeso game, instead of recording the (x, y) coordinates of the cards on the table, a kid would vaguely remember which cards are next to one another, which ones are close to the center, which ones are rather on the left part of the board, which ones have more colors, etc.

Similarly, a unit would not just remember the last 2 or 3 words of a sentence, but it will find how many words and which words, especially which representation of those words, to remember in order to achieve the best prediction.

This is a very basic simple illustration of how we, humans, also choose to remember something (e.g. our mum anniversary) and to forget the other (our boss anniversary).

Backpropagation through time

The whole unit in our previous example will have 5+5+4=14 weights and 14 bias to learn during training. How do we find those weights? Will typical gradient descent and back propagation also work here?

Let’s have some ugly mathematics now. First, we will call V the weights applied to the input, U the weights applied to the state and W the weights applied to the output. Let’s call h the state. We have two activation functions, g_h which serves as the activation for the hidden state and _y which is the activation of the output.

UVzUbaR.png!web

P. Protopapas, CS109b, Harvard FAS

The recurrent unit is perfectly described by the following two equations, where b and b’ represent the bias in the output neurons and state neurons respectively.

MJjMbaa.png!web

In order to find the weights by applying stochastic gradient descent, as we covered in ourother article, we need to calculate the loss function and its derivative with respect to the weights.

Below you can see how this is done for W , the output weights. The loss for the whole network is the sum of individual losses in recurrent units.

femiqi3.png!web

Similarly we can calculate the derivative of the loss function with respect to state weights U as follows.

qeYnUb2.png!web

That wasn’t so bad. We now know how to train a recurrent neural network. But there is a problem.

In the component of the loss function which is highlighted above, you can see a huge product of numbers. The resulting quantity is known to become very large or very small, causing exploding or vanishing gradients .

Gradients may overshoot the minimum and undo a lot of the work that was already done. Therefore, it’s a common practice to clip the gradients to be in an acceptable interval and to choose an activation function that does not allow too small gradients.

Practically, RNNs cannot learn long dependencies because of this problem, as we will demonstrate in our practical experimentation with classifying real-world customer reviews . If you want to implement your own RNN from scratch in Python, check this excellent article from Victor Zhou.

Recurrent Neural Network

Now, we know how a single recurrent unit works. By chaining several units one after another, we are able to process a sequence of inputs.

Below, we illustrate how a recurrent neural network would take a sequence of observations and predict if it will rain or not.

At time t , the network is presented with the information ‘dog barking’ and its memory is empty. The prediction is therefore with probability 0.3 that it will rain. The network stores a representation of ‘dog barking’ in its memory for the next step.

At time t+1 , it receives a new information ‘white shirt’, which decreases the likelihood of raining to 0.1. Now the memory has a representation of both ‘dog barking’ and ‘white shirt’.

Next, at time t+2 , the network receives ‘apple pie’ as information. This does not change its prediction, but the memory is updated by pushing out ‘white shirt’.

At time t+3 , the input ‘knee hurts’ increases the prediction to 0.6 and overwrites ‘apple pie’ in the memory. The final input of the sequence is ‘get dark’, which pushes the final prediction to 0.9.

RfUFnib.png!web

P. Protopapas, CS109b, Harvard FAS

The example intuitively illustrates how the network intentionally keep and forget certain information in its memory when learning from the inputs’ sequence. This is a typical many-to-one scenario.

Recurrent neural networks are also found in the different flavors depicted below. The many-to-many architecture is typically used to translate a text from one language to another.

U32m2q2.png!web

source

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK