Understanding Deep Learning Requires Rethinking Generalization

The results look quite interesting as the model can perfectly fit the noisy Gaussian samples. It also perfectly fits the training data with completely random labels although it takes a bit more time. This shows that a deep neural network with enough parameters could completely memorize some random inputs. This result is quite counter-intuitive as it is a widely accepted theory that Deep Learning usually discovers lower level features, middle-level features, and higher-level features and if a model can memorize any random inputs then what’s the guarantee that the model will try to learn some constructive features instead of simply memorizing the input data.

Results of Regularization Tests

https://bengio.abracadoudou.com/cv/publications/pdf/zhang_2017_iclr.pdf

The first diagram shows the effect of different explicit regularization on the training and testing accuracy. Here, the key takeaway is that there is not a very significant difference in the generalization performance between using regularization and not using regularization.

The second diagram shows the effect of batch normalization(implicit regularization) on the training and testing accuracy. We can see that the training with batch normalization is quite smooth but it doesn’t improve the test accuracy.

From the experiments, the authors concluded that both explicit and implicit regularizers could help to improve generalization performance. However, it is unlikely that regularizers are the fundamental reason for generalization .

FINITE-SAMPLE EXPRESSIVITY

The authors also proved the following theorem:

There exists a two-layer neural network with ReLU activations and 2n+d weights that can represent any function on a sample of size n in d dimensions.

which is basically an extension of the Universal Approximation Theorem . The prove is quite heavy if interested refer to Section C in the appendix of the paper .

IMPLICIT REGULARIZATION: AN APPEAL TO LINEAR MODELS

In the final section, the authors show that SGD-based learning imparts a regularization effect as the SGD converges to the solution with minimum L2 norm. Their experiments also show that a minimum norm doesn’t ensure a better generalization performance.

Final Conclusion

The effective capacity of several successful neural network architectures is large enough to shatter the training data.
Traditional measures of model complexity is not sufficient for a deep neural network.
Optimization continues to be easy even when generalization is poor.
SGD may be performing implicit regularization by converging to solutions with minimum L2 -norm.

A subsequent paper “ A Closer Look at Memorization in Deep Networks ” has challenged some of the views pointed out in this paper. They have convincingly demonstrating qualitative differences in learning random noise vs. learning actual data.

https://dl.acm.org/doi/pdf/10.5555/3305381.3305406?download=true

The above experiment shows a deep neural net attempting to memorize random noise takes significantly longer to learn relative to the actual dataset. It also shows fitting some random noise results in a more complex function(more number of hidden units per layer).

https://dl.acm.org/doi/pdf/10.5555/3305381.3305406?download=true

This experiment shows that regularizers do control the speed at which DNNs memorize.

Thus to conclude, a deep neural network first tries to discover patterns, not brute force memorization, to fit real data. However, if it doesn’t find any patterns(like in case of random noise), the network is capable of simply optimizing in a way that just memorizes the training data. As both, the paper suggests we need to find some better tools to control the degree of generalization and memorization, and tools like regularization, batch normalization, and dropout are not perfect.

Results of Regularization Tests

FINITE-SAMPLE EXPRESSIVITY

IMPLICIT REGULARIZATION: AN APPEAL TO LINEAR MODELS

Final Conclusion

Recommend

再啰嗦最后一次，Java字符串是不可变的

你还在手动调参？自动化深度学习了解一下！(CVPR2020 Tutorial)

Shadertoy Path Tracing: Fresnel, Rough Refraction and Absorption, Orbit Camera

声明式 UIKit 在有赞美业的实践

资料 | Python数据结构与算法分析（第2版）

Filecoin – Sector状态管理逻辑

50 道 Java 集合经典面试题（收藏版）

拒绝降权！教你用 Python 确保制作的短视频独一无二

前端存储除了 localStorage 还有啥

Generating Charts based on Data in Your Kendo UI Grid

About Joyk