35

Thoughts from TTIC31230: Hyperparameter Conjugacy

 5 years ago
source link: https://www.tuicool.com/articles/hit/RvmMZzi
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

I have just finished teaching TTIC31230: fundamental of deep learning .  This is the first of a series of posts on thoughts arising from teaching this class.

While this post is about hyperparameter search, I want to mention in passing some issues that do not seem to rise to level of a full post.

Teaching backprop.When we teach backprop we should stop talking about “computational graphs” and talk instead about  programs defined by a sequence of in-line assignments.  I present backprop as an algorithm that runs on in-line code and I give a loop invariant for the backward loop over the assignment statements.

Teaching Frameworks.My class provides a 150 line implementation of a framework (the educational framework EDF) written in Python/NumPy.  The idea is to teach what a framework is at a conceptual level rather than teaching the details of any actual industrial strength framework.  There are no machine problems in my class — the class is an “algorithms course” in contrast to a “programming course”.

I also have a complaint about the way that PyTorch handles parameters. In EDF modules take “parameter packages” (python objects) as arguments.  This simplifies parameter sharing and maintains object structure over the parameters rather than reducing them to lists of tensors.

Einstein Notation.When we teach deep learning we should be using Einstein notation.  This means that we write out all the indices.  This goes very well with “loop notation”.  For example we can apply a convolution filter uIF3AbM.png!web to a layer 26fiMz2.png!web using the following program.

for , , :    eEZvuyE.png!web

for , , MNVbai.png!web , 63yqMf.png!web , , :    JNvyU3B.png!web

for , , :    nINJry3.png!web

Of course we can also draw pictures. The initialization to zero can be left implicit — all data tensors other than the input are implicitly initialized to zero.  The body of the “tensor contraction loop” is just a product of two scalers. The back propagation on a product of two scalars is trivial.  To back-propagate to the filter we have.

for , , MNVbai.png!web , 63yqMf.png!web , , :    e2aaAzZ.png!web

Since essentially all deep learning models consist of tensor contractions and scalar nonlinearities, we do not have to discuss Jacobian matrices.  Also, in Einstein notation we have mnemonic variable names such as , and for tensor indices which, for me, greatly clarifies the notation.  Yet another point is that we can easily insert a batch index into the data tensors when explaining minibatching.

Of course NumPy is most effective when using the index-free tensor operations.  The students see that style in the EDF code.  However, when explaining models conceptually I greatly prefer “Einstein loop notation”.

Hyperparameter Conjugacy. We now come to the real topic of this post. A lot has been written about hyperparameter search.  I believe that hyperparameter search can be greatly facilitated by simple  reparameterizations that greatly improve hyperparameter conjugacy.  Conjugacy means that changing the value of a parameter does not change (or only minimally influences) the optimal value of another parameter .  More formally, for a loss mau2UjI.png!web we would like to have

vyYfeqy.png!web

Perhaps the simplest example is the relationship between batch size and learning rate.  It seems well established now that the learning rate should be scaled up linearly in batch size as one moves to larger batches.  See Don’t Decay the Learning Rate, Increase the Batch Size by Smith et al. But note that if we simply redefine the learning rate parameter to be the learning rate appropriate for a batch size of 1, and simply change the minibatching convention to update the model by the sum of gradients rather than the average gradient, then the learning rate and the batch size become conjugate — we can optimize the learning rate on a small machine and leave it the same when we move to a bigger machine allowing a bigger batch. We can also think of this as giving the learning rate a semantics independent of the batch size.  A very simple argument for this particular conjugacy is given in my SGD slides .

The most serious loss of conjugacy is the standard parameterization of momentum. The standard parameterization strongly couples the momentum parameter with the learning rate. For most fameworks we have

AVzyAjU.png!web

where iy26NvB.png!web is the momentum parameter, is the learning rate, is the gradient of a single minibatch, and ueuaeyq.png!web is the system of model parameters.  This can be rewritten as

vUzeQb3.png!web

A recurrence of the form uyym632.png!web yields that is a running average of .  The running average of is linear in so the above formulation of momentum can be rewritten as

ZV3qInR.png!web

Now we begin to see the problem.  It can be shown that each individual gradient makes a total contribution to ueuaeyq.png!web of size MbE3qiJ.png!web .  If the parameter vector ueuaeyq.png!web remains relatively constant on the time scale of 7buUbe7.png!web updates (where 7buUbe7.png!web is typically 10) then all we have done by adding momentum is to change the learning rate from to vU7ji2N.png!web .  Pytorch starts from a superficial different but actually equivalent definition of the momentum and suffers from the same coupling of the momentum term with the learning rate. But a trivial patch is to use the following more conjugate formulation of momentum.

beQF7nA.png!web

The slides summarize a fully conjugate (or approximately so) parameterization of the learning rate, batch size and momentum parameters.

There appears to be no simple reparameterization of the adaptive algorithms RMSProp and Adam that is conjugate to batch size.  The problem is that the gradient variances are computed from batch gradients and variance estimates computed from large batches, even if “corrected”, contain less information that variances estimated at batch size 1.  A patch would be to keep track of fmQrquV.png!web for each each individual within a batch.  But this involves a modification to the backpropagation calculation. This is all discussed in more detail in the slides.

The next post will be on Langevin dynamics.

Advertisements


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK