Learning Parameters Part 4: Tips For Adjusting Learning Rate, Line Search
source link: https://www.tuicool.com/articles/yUrIj2a
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
Before moving on to advanced optimization algorithms let us revisit the problem of learning rate in gradient descent.
Sep 27 ·4min read
Inpart 3, we looked at stochastics and mini-batch versions of the optimizers. In this post, we will look at some commonly followed heuristics on how to tune the learning rate, etc. If you are not interested in these heuristics, feel free to skip to part 5 of the Learning Parameters series.
Citation Note: Most of the content and figures in this blog are directly taken from Lecture 5 of CS7015: Deep Learning course offered by Prof. Mitesh Khapra at IIT-Madras.
One could argue that we could have solved the problem of navigating gentle slopes by setting the learning rate high (i.e., blow up the small gradient by multiplying it with a large learning rate η ). This seemingly trivial idea does sometimes work at gentle slopes of the error function, but it fails to work when the error surface is flat. Here’s an example:
Clearly, on the regions which have a steep slope, the already large gradient blows up further and the large learning rate sort of helps the cause but as soon as the error surface flattens, it doesn’t help a lot. It would be safe to assume that it is always good to have a learning rate which could adjust to the gradient and we will see a few such algorithms in the next post (part 5) in the Learning Parameters series.
Some Useful Tips
Tips for Initial Learning Rate
- Tune learning rate. Try different values on a log scale: 0.0001, 0.001, 0.01, 0.1, 1.0.
- Run a few epochs with each of these and figure out a learning rate which works best.
- Now do a finer search around this value. For example, if the best learning rate was 0.1 then now try some values around it: 0.05, 0.2, 0.3.
- Disclaimer: these are just heuristics, no clear winner strategy.
Tips for Annealing Learning Rate
Step Decay
- Halve the learning rate after every 5 epochs
- Halve the learning rate after an epoch if the validation error is more than what it was at the end of the previous epoch
Exponential Decay
- η = η₀⁻ᵏᵗ , where η₀ and k are hyperparameters and t is the step number
1/t Decay
- η = (η₀)/(1+kt), where η₀ and k are hyperparameters and t is the step number.
Tips for Momentum
The following schedule was suggested by Sutskever et al. , 2013
where, γ_max was chosen from {0.999, 0.995, 0.99, 0.9, 0}.
Line Search
In practice, often a line search is done to find a relatively better value of η . In line search, we update w using different learning rates ( η ) and check the updated model’s error in every iteration. Ultimately, we retain that updated value of w which gives the lowest loss. Take a look at the code:
Essentially at each step, we are trying to use the best η value from the available choices. This is obviously not the best idea. We are doing many more computations in each step but that’s the trade-off for finding the best learning rate. Today, there are cooler ways to do this.
Line Search in Action
Clearly, convergence is faster than vanilla gradient descent (seepart 1). We see some oscillations but notice that these oscillations are quite different from what we see in momentum and NAG (seepart 2).
Note:Leslie N. Smith in his 2015 paper, Cyclical Learning Rates for Training Neural Networks proposed a smarter way than line search. I refer the reader tothis medium post byPavel Surmenok to read more about it.
Conclusion
In this part of the learning parameters series, we looked at some heuristic that can help us tune the learning rate and momentum for better training. We also looked at Line Search, a once-popular method to finding the best learning rate at every step of the gradient update. In the next (final) part of the learning parameters series, we will closely look at gradient descent with adaptive learning rate, specifically the following optimizers — AdaGrad, RMSProp, and Adam.
You can find the next part here:
Acknowledgment
A lot of credit goes to Prof. Mitesh M Khapra and the TAs of CS7015: Deep Learning course by IIT Madras for such rich content and creative visualizations. I merely just compiled the provided lecture notes and lecture videos concisely.
Recommend
-
7
In the previous entries to this series I talked about computing power for an interaction analysis (Part 1), interpreting interaction effect-sizes(Part 2), and what sample size is needed for an interaction (Part 3). Here...
-
21
Adjusting Bookmaker’s Odds to Allow for Overround :
-
7
Image Processing Algorithms: Adjusting Contrast And Image BrightnessImage Processing Algorithms: Adjusting Contrast And Image Brightness by@aryamansharda
-
10
Flamingo proposes adding GAS, adjusting FLM distribution, and establishing capital to fund a third-party – HodlalertFlamingo proposes adding GAS, adjusting FLM distribution, and establishing capital to fund a third-party...
-
0
CEO SpotlightAdjusting for Market Changes: A Business Case Study of Stephen Bittel’s TerranovaStephen Bittel, Founder and Chairman at Terranova Corpo...
-
6
Facebook Shares New Research into How SMBs are Adjusting their Digital Strategies Due to the Pandemic Published May 26, 2021 By
-
7
Did you know: Search-Mailbox adjusting permissions on target folder While browsing the Unified audit log in search for events related to some tests I performed the other day, I noticed something interesting....
-
2
Agile development and how to succeed by adjusting priorities The project with Zurich Tourism began in autumn of 2019. The website needed to be migrated from Drupal 7 to Drupal 9 in order t...
-
5
Preamble¶ from plotapi import Chord Chord.set_license("your user...
-
2
So you’ve just added som...
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK