29

Don’t look backwards, LookAhead!

 3 years ago
source link: https://towardsdatascience.com/dont-look-backwards-lookahead-6bcd7ff50f93?gi=d02a5f966140
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

How to make your optimizer less sensitive to the choice of hyperparameters

May 30 ·5min read

MnI7Vfm.jpg!web

Image by rihaij z Pixabay

The task of an optimizer is to look for such a set of weights for which a NN model yields the lowest possible loss. If you only had one weight and a loss function like the one depicted below you wouldn’t have to be a genius to find the solution.

fqeiYjY.png!web

Unfortunately you normally have a multitude of weights and a loss landscape that is hardly simple, not to mention no longer suited for a 2D drawing.

NfiummQ.png!web

The loss surface of ResNet-56 without skip connections visualized using a method proposed in https://arxiv.org/pdf/1712.09913.pdf .

Finding a minimum of such a function is no longer a trivial task. The most common optimizers like Adam or SGD require very time-consuming hyperparameter tuning and can get caught in the local minima. The importance of choosing a hyperparameter like learning rate can be summarized by the following picture:

mAjUreu.png!web

Too big learning rate causes oscillations around the minimum and too small learning rate makes the learning process super slow.

The recently proposed LookAhead optimizer makes the optimization process

less sensitive to suboptimal hyperparameters and therefore lessens the need for extensive hyperparameter tuning.

It sounds like something worth exploring!

The algorithm

Intuitively, the algorithm chooses a search direction by looking ahead at the sequence of “fast weights” generated by another optimizer.

The optimizer keeps two sets of weights: fast weights θ and slow weights ϕ . They are both initialized with the same values. A standard optimizer (e.g. Adam, SGD, …) with a certain learning rate η is used to update the fast weights θ for a defined number of steps k resulting in some new values θ’ .

Then a crucial thing happens: the slow weights ϕ are moved along the direction defined by the difference of weight vectors θ’- ϕ . The length of this step is controlled by the parameter α — the slow weights learning rate.

Crucial update in the LookAhead algorithm

The process is then repeated starting by re-setting the fast weights values to newly computed slow weights values ϕ’ . You can see the pseudocode below:

vyYzEf6.png!web

source: https://arxiv.org/pdf/1907.08610.pdf

What’s the point of this?

To answer this question we will study the (slightly modified) picture from the LookAhead publication , but as an introduction let’s first look at another picture. If our model only had three weights, the loss function could be easily visualized like in the picture below.

R3qiAz6.png!web

Loss function visualization in the weight space in the case of model depending on three weights only. The projection of the loss to three planes (“hyperplanes”) in the space of weights with one weight having a constant value is presented.

Obviously in real-life examples we have much more than three weights resulting in weight space with higher dimensionality. Nevertheless we can still visualize the loss by projecting it to a hyperplane in such a space.

That is what is presented in the LookAhead paper :

RZrauiy.png!web

source: https://arxiv.org/pdf/1907.08610.pdf

We see a projection of the objective function (in this case it’s accuracy, but it could be loss just as well) to a hyperplane in the weight space. Different colors correspond to different objective function values: the brighter the color, the more optimal the value. The behavior of the LookAhead optimizer is shown in the following way: the blue dashed line represents the trajectory of the fast weights θ (with blue squares indicating ten subsequent states), while the violet line shows the direction of fast weight update θ’- ϕ . The violet triangles indicate two subsequent slow-weights values ϕ , ϕ’ . The distance between the triangles is defined by slow-weights learning rate α .

We can see that the standard optimizer (in this case SGD) traverses a sub-optimal green region, whereas the second slow-weight state is already much closer to the optimum. The paper describes it more elegantly:

When oscillating in the high curvature directions, the fast weights updates make rapid progress along the low curvature directions. The slow weights help smooth out the oscillations through the parameter interpolation. The combination of fast weights and slow weights improves learning in high curvature directions, reduces variance, and enables Lookahead to converge rapidly in practice.

How to use it in Keras?

Now to the practical side of method: so far there is only an unofficial Keras implementation which can easily be used with your current optimizer:

As you can see, apart from the optimizer itself Lookahead expects two arguments:

  • sync_period which corresponds to previously introduced k — number of steps after which the two set of weights are synchronized,
  • slow_step which corresponds to α learning rate of the slow weights.

To check that it works as expected you can set slow_step to 1 and compare the behavior of Lookahead with that of a regular optimizer.

For the α of 1 the LookAhead update step reduces to:

which means that the LookAhead gets reduced to its underlying standard optimizer. We can also see it on the modified weights trajectory picture:

queMNrB.png!web

Adapted from: https://arxiv.org/pdf/1907.08610.pdf

Now the end state for slow weight is the same as the end state for fast weights.

You can test it using the following code:

Test to prove that the LookAhead with Adam and slow learning rate of 1 is equivalent to pure Adam.

Final word

LookAhead is an effective optimization algorithm which at a negligible computational cost makes the process of finding the minimum of a loss function more stable. What is more, less hyperparameter tuning is required.

It is said to be particularly effective when combined with Rectified Adam optimizer . I will cover this topic in my next article.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK