28

A classic bedtime story: Cinderella of Neural Networks

 4 years ago
source link: https://www.tuicool.com/articles/2mU3aue
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Endgame for “AI Winter”

A classic bedtime story: Cinderella of Neural Networks

How a competition, ImageNet, along with a noisy algorithm, Stochastic Gradient Descent, changed the fate of AI?

In early 1980s, Winter was coming for Artificial Intelligence (AI) with a period of reduced funding and interest in AI research, which will later be called as the “AI Winter” !

During this cold-weather period which lasted until mid-2000s, almost no research paper on Neural Nets was published because of the lost interest in the field. The reason was simple: no effective algorithms had been put forward against the traditional ones.

AI’s misery changed dramatically when Geoffrey Hinton, you can call him as the godfather of Deep Learning, and his team submitted their research in a famous object recognition competition called ImageNet in 2012. “This ImageNet 2012 event was definitely what triggered the big explosion of AI today” quoted Matthew Zeiler, CEO of Clarifai.

Before Geoffrey’s work, accuracy of object recognition research under algorithms had been pretty poor with classification error rates more than 25%, while error rate of human eye performance around is 5%. But as the team of Geoffrey implemented neural networks beautifully, they managed to reduce the error rate dramatically by more than 40 percent !! Following his steps, deep learning has been so intensely used such that algorithms yielded much lower error rates than human eyes only 2 years later!! Currently, in object and speech recognition areas, almost all research is performed by AI, a sub field of deep learning.

I should also mention that, while moving away from ‘AI Winter’ period into glorious ‘Great AI Awakening’ period, AI received substantial support from the improvement in GPU (careful, not CPU) computing, thanks to NVIDIA, and availability of massive data sets, thanks to all of us by creating more than 2.5 quintillion ( 10¹⁸) bytes of data every day.

Let’s turn our attention back to back-bone algorithm of Deep Learning. As you know, algorithms try to optimize problems by minimizing the error functions, or generally known as cost functions. Efficient algorithms complete in nanoseconds, others may need to run for many hours.

Machine Learning algorithms require efficient optimization techniques to solve either convex problems or non-convex ones. Minimizing the error can be performed best in convex functions, because wherever your starting point is, you will always find the global minimum point! That’s why we love SVM (Support Vector Machines) or other linear functions because they have convex loss functions and you are guaranteed to find the global minimum!!

But, in real life, optimization functions are generally uglier. They live in non-convex spaces, which consist of many local minima (“valleys”) and maxima (“hills”). Neural Networks are no exception and even worse!

Neural Networks have 1,000+ of local minima, so you will, for sure, be trapped into one of those and will never find a global minimum! (Tell me if you disagree after peaking at the graph on the left which shows only a small section of a nonconvex function.) Depending on where your “initialization” point is, you will always end up in a different local minimum (this issue is generally dealt by random initializing, or perhaps adding some Gaussian noise). The reason why no one really liked Neural Networks was due to its highly complex nature in the absence of an affordable algorithm.

In the quest of a reasonable local minimum, researchers first used aggressive methods, like Gradient Descent or even faster approximate Newton or approximate Hessian methods (as the data size and dimensionality increase, convex optimization methods are adversely affected. For example, Hessian-based approaches which optimally handle convex optimization problems do not scale up and so approximations are required [Martens et al., 2012]). These methods are so perfect to find a minimum that they find the “closest” hole instantly. Although this sounds perfect, it is actually not! – closest holes unlikely have the ideal minimum, and you need to look for somewhere further away.

Also, even if you manage to look further away, you need to look for wide valleys rather than tight depths. A deep hole found while using train data may not be that deep while using test data. However wide valleys are likely to be there under both train and test data sets.

Which algorithm is capable of avoiding the nearest hole and settle in the wide valley space? This is exactly what SGD is actually capable of! (even though it took a lot of time for people to find this out). Who would think that the algorithm once labelled as inefficient and redundant will turn out to be the key optimization tool of almost all deep learning models? Fantastic!

picture source : https://wikidocs.net/3413

How did this happen? How “Stochastic” Gradient Descent (SGD) approach works as a very effective technique for training deep neural networks with linear computational complexity in the size of the dataset, given its very noisy and random nature?

SGD is very noisy compared to Gradient Descent (GD) because while GD uses the entire data set to calculate its path (gradient), SGD uses only one single example, a batch size of 1 which is chosen at random, per iteration; therefore, it follows a random path, but ultimately in the right direction. Randomness means it doesn’t converge to the nearest hole but only settle in large valleys! This is a quality that many other algorithms do not possess!

Well, in our fairy tale, before the wheel bend, and the story end, I should tell you that, our main character, SGD, has two main limitations due to learning rate (step size): if you choose it too large, then the learning trajectory leads to horrible situations; if chosen as too small, its convergence takes ages. So, be wise to choose the right step size. You can either use one of the several variants of SGD, such as Momentum, Averaging, Implicit updates (ISGD), AdaGrad, Adam, etc., or listen to my suggestion which is to take large steps at first, and when it starts to jump back and forth, start lowering your learning rate, i.e. take smaller steps. That’s it!

Now the bridge was mended and my story’s ended.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK