18

Building A Mental Model for Backpropagation

 3 years ago
source link: https://mc.ai/building-a-mental-model-for-backpropagation/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

As the beating heart of deep learning, a solid understanding of backpropagation is required for any deep learning practitioner. Although there are a lot of good resources that explain backpropagation on the internet already, most of them explain from very different angles and each is good for a certain type of audience. In this post, I’m going to combine intuition, animated graphs and code together for beginners and intermediate level students of deep learning for easier consumption. A good assessment of the understanding of any algorithm is whether you can code it out yourself from scratch. After reading this post, you should have an idea of how to implement your own version of backpropagation in Python.

Real-valued circuits, and a force of correction

Mathematically, backpropagation is the process of computing gradients for the components of a function by applying the chain rule . In the case of neural networks, the function of interest is the loss function . I like the interpretation by Andrej Karpathy in CS231n: take the compute graph as real-valued circuits with logic gates. The gates are the operations in the function, e.g. addition, multiplication, exponentiation, matrix multiplication, etc.

This is a great mental model because it means backpropagation is a local process . Every gate in the circuit diagram gets some inputs and can right away compute:

(1) its output value

(2) the local gradient

Keep in mind that this is still the forward pass of the neural net. Quoting the CS231n note:

In the forward pass, the gates can do this completely independently without being aware of any of the details of the full circuit that they are embedded in.

However, once the forward pass is over, during the backward pass (backpropagation) the gate will learn about the gradient of its output value on the final output of the entire circuit. It applies the chain rule, i.e. taking the output’s gradient and multiply it into every gradient it computes for all of its inputs. The backward pass can be implemented using a recursive approach to traverse back from the output of the circuit back to all the inputs.

Intuitively, the final effect of backprop and its associated weight updates is that the circuit “wants” to output a value that’s closer to whatever target value we have. Take the gradient of the add gate (-4) in the graph above as an example, it means that changing q by +1 will result in a change of -4 in f . If we’d like a higher f, we could make q lower. This is essentially what a gradient is. People sometimes call it “sensitivity”. Another great metaphor is a force of correction . The sign of the gradient indicates the direction of correction, and the magnitude indicates the strength .

Backpropagation can thus be thought of as gates communicating to each other (through the gradient signal) whether they want their outputs to increase or decrease (and how strongly), so as to make adjustments to the final output value.

From function to compute graph

One of the best ways to visualize backprop is to draw the compute graph of the function. Let’s look at this weird function below to demonstrate how to draw its compute graph and then do backprop on it manually.

To calculate its gradients, we can decompose it into add, sigmoid, square gates as shown below:

Concretely, the process consists of 3 high-level steps

  1. Build the compute graph from operations (gates)
  2. Run the forward pass through each operation
  3. Run the backward pass based on (1) the values computed in the forward pass, and (2) the backward function for each gate to calculate their local gradients

As shown in the animation above, you can follow along and manually calculate the values. I’ll show how to implement it in the last section, but now let’s look at a trick that will help us simplify the process.

Staged computation

Any kind of differentiable function can act as a gate, and we can group multiple gates into a single gate whenever it is convenient.

Since we should never explicitly solve for the gradients analytically, the selection of these function components becomes a problem to consider. Take the sigmoid function for example:

We can decompose it into add, multiply, negate, exponentiate, reciprocal gates individually as shown in the following compute graph:

A simple sigmoid already has so many operations and gradients, it seems unnecessarily complex. What we could do alternatively is just to have one sigmoid gate applying on the output of the red box along with the function that calculates its gradient. The gradient of the sigmoid is very simple:

This way we avoid a lot of unnecessary computation. It saves us time, space, energy, makes the code more modular and easier to read, and avoids numerical problems.

The code for the sigmoid gate could look something like:

It has both the forward pass output and the function to compute the local gradient for the backward pass in just a few lines of code. In the next section, I will fit this into the bigger picture and show how to write a mini autograd library with such components.

Autograd: a recursive approach

To let a computer calculate the gradients for any function expressed in Directed Acyclic Graphs (DAG) using the chain rule, we need to write the code for those 3 high-level steps mentioned before. Such a program is often called auto-differentiation or autograd. As you’ll see next, we can structure our code into a Tensor class that defines the data and operations, so that it not only can support building dynamic compute graphs on the fly, but also backpropagating through it recursively.

Forward pass: building the graph

A Tensor object has data , grad , a _backward() function, a _prev set of Tensor nodes, and a _op operation. When we execute an expression, it builds the compute graph on the fly since we have overridden the Python operators such as + , * and ** with customized dunder methods . The current Tensor’s _backward() , _prev and _op are defined by its parent(s), i.e. the Tensor(s) that produced it. For example, the c in c = a + b has the _backward() defined in __add__ , and _prev = {a, b} , _op = '+' . This way we can define any operation we want and let Python construct the graph.

Here we are talking about neural networks, so the expression we care about is the loss function . Take the MSE loss for example (using 1 scalar data point for simplicity), MSELoss = (wx - y)**2 . The graph is:

Notice that subtraction is actually negation and addition. I named the intermediate nodes for illustration purposes only. With the graph ready, we can implement backprop!

Backward pass: topological sort

backward should compute gradients one by one starting from the current Tensor node and move to its ancestors. The order of the traversal needs to the topologically sorted to make sure the dependencies are calculated at every step. One simple way to implement this topological sort is a depth-first search. Here’s an animation to show how it’s executed for the MSELoss example.

This is graph traversal 101: a plain old DFS. If you have a big complex neural net you just replace the upper left corner that is wx with your big graph, and the DFS will go there and compute the gradients, provided that all the operations and their _backward are defined in the Tensor class. Some of the obvious operations we need here include max , sum , matrix multiplication, transpose, etc. To use the gradients to update your weights, do:

for p in <your_parameters>:
    p.data -= learning_rate * p.grad

There you have it, the autograd algorithm in a few lines of code. It is the backbone of any deep learning framework. Now that the hard part is behind us, to complete the implementation of your mini deep learning framework, you just need to implement a model interface with a collection of layer types, loss functions, and some optimizers.

Recommended readings

This post is heavily inspired by Andrej Karpathy’s awesome CS231n lectures and beautifully written micrograd: tiniest autograd engine . If you would like to take a look at a different and more extended version of a DIY deep learning framework that closely resembles PyTorch, check out the one implemented in Grokking Deep Learning by Andrew Trask . If you prefer diving into PyTorch’s autograd directly instead, Elliot Waite has a great video on Youtube which will save you a lot of time digging into the source code. Keep on learning!


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK