Nonlinear Activations for Neural Networks

2022-03-27

586 words 3 mins read 23 times read

Non-linear activations are important in deep neural networks. It is important in the sense that without non-linear activation functions, even if you have many linear layers, the end results is like you have only one linear layer, and the approximation ability of the network is very limited1. Some of most commonly-used nonlinear activation functions are Sigmoid, ReLU and Tanh.

Nonlinear activations and their derivatives

Sigmoid

Sigmoid function, also known as logistic function, has the following form:

f(x)=11+e−x

The derivative of sigmoid is:

dfdx=e−x(1+e−x)2=11+e−x(1−11+e−x)=f(x)(1−f(x))

Tanh function

f(x)=e2x−1e2x+1

The derivative of Tanh is:

dfdx=4e2x(e2x+1)2=1−f(x)2

ReLU, called rectified linear unit, has the following form:

f(x)=max(0,x)

We can also write ReLU as:

f(x)={xx≥00x<0

The derivate of ReLU is quite simple, it is 1 for x>0 and 0 otherwise.

There are also variants of ReLU, such as Leaky ReLU, PReLU (parametric ReLU), and RReLU (randomized ReLU). In Empirical Evaluation of Rectified Activations in Convolutional Network, the author claimed that PReLU and RReLU works better than ReLU in small scale datasets such as CIFAR10, CIFAR100 and Kaggle NDSB.

Vanishing gradient

I show the plot of different activation functions and their derivatives in the title image.

Click to show the code for visualization.

The derivative of sigmoid is relatively small, and its largest value is only 0.25 (when x=0). When x is large, the derivative is near zero. Tanh has a similar issue: it has a low gradient, and maximum gradient is only 1 (x=0).

This will cause the vanishing gradient problem, because in order to calculate the derivative of loss w.r.t the weight of earlier layers in the network, we need to multiply the gradient in the later layers. When you multiply several values below 0.25, the result goes down to zero quickly, so the network weight in earlier layers get updated slowly. In other words, the learning process will converge much slower than using ReLU, and we might need much more epochs to get a satisfactory result.

Another advantage of ReLU is that it is computationally cheap compared to sigmoid, both in terms of forward and backward operation.

Try it yourself interactively

To gain more insight into this, we can use minist on convenet.js and change the activation function to see how the train goes. We can see that training process under tanh and sigmoid activation is much slower than ReLU. Sigmoid is slowest among the three.

We can also play with different activations functions real quick with TensorFlow playground.

References

See this post and also this one for more detailed discussions.↩︎

Author jdhao

LastMod

2022-03-27

License CC BY-NC-ND 4.0

Reward

地道美味炸酱制作方法 Accelerate Batched Image Inference in PyTorch

Nonlinear Activations for Neural Networks

Nonlinear Activations for Neural Networks

Nonlinear activations and their derivatives

Sigmoid

Vanishing gradient

Try it yourself interactively

References

Recommend

企业数字化转型升级进行时，平安知鸟跑出人才培训加速度

瑞幸不再当星巴克的“替身”

Tableau, a decade as a Leader in 2022 Gartner® Magic Quadrant™ for Analytics &am...

2022年中国兽药行业研发现状及发展趋势分析企业加快新药研发【组图】

戴森确认将于今年发布空气净化耳机，同享洁净空气与纯净音质

增长黑客你得回到最初的地方

2022年中国旅游行业市场现状及发展趋势分析后疫情时代智慧旅游成为“解药”【组图】

华夏高科：峥嵘岁月十八载风华正茂向未来

GPT-3：现实版的“贾维斯”？还是真“人工”智能？

Will Smith Oscars slap shows why Apple should shake up its events

About Joyk