40

Learning Parameters Part 4: Tips For Adjusting Learning Rate, Line Search

 4 years ago
source link: https://www.tuicool.com/articles/yUrIj2a
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Before moving on to advanced optimization algorithms let us revisit the problem of learning rate in gradient descent.

Inpart 3, we looked at stochastics and mini-batch versions of the optimizers. In this post, we will look at some commonly followed heuristics on how to tune the learning rate, etc. If you are not interested in these heuristics, feel free to skip to part 5 of the Learning Parameters series.

Citation Note: Most of the content and figures in this blog are directly taken from Lecture 5 of CS7015: Deep Learning course offered by Prof. Mitesh Khapra at IIT-Madras.

One could argue that we could have solved the problem of navigating gentle slopes by setting the learning rate high (i.e., blow up the small gradient by multiplying it with a large learning rate η ). This seemingly trivial idea does sometimes work at gentle slopes of the error function, but it fails to work when the error surface is flat. Here’s an example:

636Rjqv.jpg

Clearly, on the regions which have a steep slope, the already large gradient blows up further and the large learning rate sort of helps the cause but as soon as the error surface flattens, it doesn’t help a lot. It would be safe to assume that it is always good to have a learning rate which could adjust to the gradient and we will see a few such algorithms in the next post (part 5) in the Learning Parameters series.

Some Useful Tips

Tips for Initial Learning Rate

  • Tune learning rate. Try different values on a log scale: 0.0001, 0.001, 0.01, 0.1, 1.0.
  • Run a few epochs with each of these and figure out a learning rate which works best.
  • Now do a finer search around this value. For example, if the best learning rate was 0.1 then now try some values around it: 0.05, 0.2, 0.3.
  • Disclaimer: these are just heuristics, no clear winner strategy.

Tips for Annealing Learning Rate

Step Decay

  • Halve the learning rate after every 5 epochs
  • Halve the learning rate after an epoch if the validation error is more than what it was at the end of the previous epoch

Exponential Decay

  • η = η₀⁻ᵏᵗ , where η₀ and k are hyperparameters and t is the step number

1/t Decay

  • η = (η₀)/(1+kt), where η₀ and k are hyperparameters and t is the step number.

Tips for Momentum

The following schedule was suggested by Sutskever et al. , 2013

where, γ_max was chosen from {0.999, 0.995, 0.99, 0.9, 0}.

Line Search

In practice, often a line search is done to find a relatively better value of η . In line search, we update w using different learning rates ( η ) and check the updated model’s error in every iteration. Ultimately, we retain that updated value of w which gives the lowest loss. Take a look at the code:

Essentially at each step, we are trying to use the best η value from the available choices. This is obviously not the best idea. We are doing many more computations in each step but that’s the trade-off for finding the best learning rate. Today, there are cooler ways to do this.

Line Search in Action

3EBnQz3.jpg

Clearly, convergence is faster than vanilla gradient descent (seepart 1). We see some oscillations but notice that these oscillations are quite different from what we see in momentum and NAG (seepart 2).

Note:Leslie N. Smith in his 2015 paper, Cyclical Learning Rates for Training Neural Networks proposed a smarter way than line search. I refer the reader tothis medium post byPavel Surmenok to read more about it.

Conclusion

In this part of the learning parameters series, we looked at some heuristic that can help us tune the learning rate and momentum for better training. We also looked at Line Search, a once-popular method to finding the best learning rate at every step of the gradient update. In the next (final) part of the learning parameters series, we will closely look at gradient descent with adaptive learning rate, specifically the following optimizers — AdaGrad, RMSProp, and Adam.

You can find the next part here:

Acknowledgment

A lot of credit goes to Prof. Mitesh M Khapra and the TAs of CS7015: Deep Learning course by IIT Madras for such rich content and creative visualizations. I merely just compiled the provided lecture notes and lecture videos concisely.

7jmYB3i.jpg!web


Recommend

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK