# Norms, Penalties, and Multitask learning

source link:https://www.tuicool.com/articles/FvQzi2z

Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

### Introduction

A regularizer is commonly used in machine learning to constrain a model’s capacity to cerain bounds either based on a statistical norm or on prior hypotheses. This adds preference for one solution over another in the model’s hypothesis space, or the set of functions that the learning algorithm is allowed to select as being the solution [1]. The primary aim of this method is to improve the generalizability of a model, or to improve a model’s performance on previously unseen data. Using a regularizer improves generalizability because it reduces *overfitting* the model to the training data.

The most common practice is to add a norm penalty on the objective function during the learning process. The following equation is the regularized objective function:

The original objective function, *J,* is a function of the parameters θ, the true label y, and the input X. The regularizer consists of the penalty norm function Ω and a penalty α that weights the contribution of Ω. The next section will provide an introduction to some penalty norms that are commonly used.

### Commonly used Statistical Norms

Norms are a method of measuring the *length* or *magnitude* of vectors. A vector norm is calculated using some measure that can summarize the distance of the vector from the origin. These different measures are most often the L¹ norm and the L² norm.

The **L¹ norm** is calculated by the *sum of absolute differences* and is often referred to as the **Manhattan Norm** *:* || ** x ** ||₁ = |

*x*₁|+ |

*x*₂|+|

*x*₃|+..+|

*x*ₙ| where | ●| is the absolute value of a given variable. While this is the vector norm, to apply to matrices the calculation changes slightly. The matrix L¹ norm for example is ||A|| = |a₁₁|+|a₁₂| +…+|aᵢⱼ|.

The **L² norm** is also commonly referred to as the **Euclidean norm** . This norm measures the distance from the origin to the point ** x **

*.*

The **L∞ norm** , or **max-norm** measures the maximum of the vector as the length: || ** x ** ||∞ = max(|

*x*₁|+ |

*x*₂|+|

*x*₃|+..+|

*x*ₙ|).

These norms and their variations can be measured using **Lᵖ** , or the p-norm. The **p-norm** is measured by:

When *p* =1, we get the L¹ and when *p* =2, we get the L² norm. As *p* approaches infinity, you get the L∞ norm.

#### How Norms are Used in Regularization

Weight decayis a method that makes preference to weights being smaller than the L² norm, driving the weights to be closer to the origin (see Figure 1). The result is that the learning rule multiplicatively shrinks the weights by a constant factor at each step before performing a gradient update [1]. In other words, it constrains the weights to lie in a region limited by the L² norm.

Different choices of the norm used for Ω can result in different solutions preferred (see Figure 2). One common difference between the behavior of the L¹ and the L² penalty norms is that L¹ results in more sparse solutions, meaning some parameters’ optimal value is 0. This is commonly used for feature selection in which features with parameters that are optimally 0 are removed.

Multitask learningis a learning problem in which several similiar tasks are learned simultaneously. For example, tasks could be different classes in a multi-class learning problem. For each task, a different set of parameters is learned. The idea is that there is a sharing of information across the tasks from which they can benefit. In other words, “among the factors that explain the variations observed in the data associated with the diﬀerent tasks, some are shared across two or more tasks” [1]. The goal of this methodology is to improve generalizability overall.

It is common to use prior knowledge on how the tasks relate to each other to constrain the different weight vectors for each task (again see Figure 2). These constraints can be the same as mentioned above, e.g. L¹ norm. This is commonly done by applying the norm over columns of the matrix.

An example is a combination of the L¹ and L² norm in which the L² norm is applied on each column and an L¹ norm is applied over all columns:

### Conclusion

There are other ways to regularize that do not involve statistical norms such as adding noise, early stopping of the learning algorithm, and data augmentation. However, this article focuses on the use of statistical norms to add constraints to the learning algorithm as a means to improve generalizability of a model.

### References

[1] Ian Goodfellow and Yoshua Bengio and Aaron Courville. *Deep Learning* . MIT Press. 2016.

## Recommend

## About Joyk

Aggregate valuable and interesting links.

Joyk means Joy of geeK