36

Breaking down the Lottery Ticket Hypothesis

 4 years ago
source link: https://towardsdatascience.com/breaking-down-the-lottery-ticket-hypothesis-ca1c053b3e58?gi=eb58d83e55e1
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Distilling the ideas from MIT CSAIL’s intriguing paper: “The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks”.

Nov 26 ·4min read

3AZrYj2.jpg!web

Image Credit: https://www.dailymail.co.uk/news/article-3198477/Now-s-hedge-trimmer-just-look-did-health-safety-got-involved.html

One of the most interesting papers I have read recently is “The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks” ( https://arxiv.org/abs/1803.03635 ). I have distilled the contents into a short blog post which may help you quickly grasp the ideas in the paper.

Motivation

Pruning techniques for neural networks can reduce the parameter counts of trained networks by over 90% without compromising accuracy. Networks with less parameters have decreased storage requirements and can take less time to perform inference.

6R77Jzb.png!web

Simplified example of pruning a trained network. Removing a parameter is equivalent to removing a connection between neurons.

Sounds great, sign me up! But training large networks can be quite expensive. Wouldn’t it be great if we could just train a network with the same topology (shape) as the pruned network? Unfortunately, training a network with the topology of the pruned network yields a network with lower accuracy than the original unpruned network.

iYJbiyv.png!web

Simplified example of pruning a trained network (above). Simplified example of training a network with the topology of the pruned network using a new random weight initialization (below)

The authors of this paper thought to ask why training a network with the topology of the pruned network yields worse performance.

The Lottery Ticket Hypothesis

“A randomly-initialized, dense neural network contains a subnetwork that is initialized such that — when trained in isolation — it can match the test accuracy of the original network after training for at most the same number of iterations.”

They discovered that by preserving the original weight initializations from the unpruned network , you can train a network with the topology of the pruned network and achieve the same or better test accuracy within the same number of training iterations.

eUjUfyR.png!web

Simplified example of pruning a trained network (above). Simplified example of training a network with the topology of the pruned network using the original weight initialization from(below)

The Lottery Analogy

Okay, seems simple enough. But what’s all this about a lottery? Suppose your goal is to find a winning lottery ticket. If you buy one ticket, there is a low chance of winning any money. But if you buy a million tickets, it is likely that some of your tickets will win some money (although this may not be profitable).

In this analogy, buying a lot of tickets is like having an overparameterized neural network for your task. It common for neural networks to have more than a million parameters! Winning the lottery is equivalent to training a neural network that has high accuracy on your task. Finally, the winning ticket refers to the weight initializations of the pruned subnetwork that achieved high accuracy.

Identifying Winning Tickets

The paper presents two pruning strategies which can find winning tickets (subnetworks which achieve high accuracy).

One-shot pruning

  1. Randomly initialize a neural network
  2. Train the network
  3. Set p% of weights with the lowest magnitude from each layer to 0 (this is the pruning)
  4. Reset the pruned network weights to their original random initializations

It should be noted that connections to output neurons are pruned at half of the normal pruning rate.

Iterative Pruning

Iterative pruning just iteratively applies the steps of one-shot pruning. The authors found that iterative pruning yielded smaller pruned subnetworks than one-shot pruning.

Results

The iterative pruning strategy was applied to fully connected architectures, convolutional architectures, and convolutional architectures with residual connections (ResNet). The authors studied the iterative pruning technique on the MNIST and CIFAR-10 image classification datasets. Their pruning strategy also works with different optimizers (SGD, momentum, Adam), dropout, weight decay, batchnorm, and residual connections.

Pruned subnetworks were 10–20% smaller than the original and met or exceeded the original test accuracy in at most the same number of iterations. These results support the lottery ticket hypothesis. Pruned subnetworks also had better generalization as there was a smaller difference between train and test accuracies.

Discussion

Why do networks converge to using a small fraction of their total parameters to solve a task? Are the “winning” initializations are already close to fully-trained values? Nope. They actually change more during training than the other parameters! The authors surmise that winning initializations might land in a region of the loss landscape that is particularly

amenable to optimization. They present the “lottery ticket conjecture” which states that stochastic gradient descent seeks out and trains a winning ticket (subnetwork) in an overparameterized network.

Limitations

This paper shows intriguing results which have helped understand a little bit more about how deep neural networks learn. Unfortunately the iterative pruning strategy they propose does not provide significant practical benefits.

Iterative pruning is computationally expensive as it involves training a network 15 times per trial. This made it hard for the authors to study larger datasets like ImageNet. As future work, the authors want to find more efficient pruning strategies.

By reducing the number of parameters by 80–90%, the storage requirements for the network are decreased. The weight matrices still have the same dimensions but are just more sparse. The feedforward step for pruned subnetworks still requires computing matrix multiplications with the same computational complexity so training and inference is not significantly more efficient. As future work, they suggest that a non-magnitude based pruning strategy could find smaller pruned subnetworks.

Thanks for reading!


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK