EfficientNet: Scaling of Convolutional Neural Networks done right

How to intelligently scale a CNN for achieving accuracy gains

Jun 16 ·7min read

MF77j2u.jpg!web

Photo by Lidya Nada on Unsplash

Ever since Alex net won the 2012 ImageNet Challenge, Convolutional Neural Networks have become ubiquitous in the world of Computer Vision. They have even found their applications in natural language processing, with state of the art models using convolution operations to retain context and provide better predictions. However, one of the key issues in designing CNNs, as with all other neural networks, is model scaling i.e deciding how to increase the model size so as to provide better accuracy.

This is a tedious process, requiring manual hit and trial until a sufficiently accurate model is produced that satisfies the resource constraints. The process is resource and time consuming and often yields models with sub-optimal accuracy and efficiency.

Taking this issue in consideration, Google released a paper in 2019 that dealt with a new family of CNNs i.e EfficientNet . These CNNs not only provide better accuracy but also improve the efficiency of the models by reducing the parameters and FLOPS (Floating Point Operations Per Second) manifold in comparison to the state of art models such as GPipe. The main contributions of this paper are:

Designing a simple mobile-size baseline architecture: EfficientNet-B0
Providing an effective compound scaling method for increasing the model size to achieve maximum accuracy gains.

EfficientNet-B0 Architecture

quQFZrQ.png!web

Table 1. Architecture Details for the baseline network

The compound scaling method can be generalized to existing CNN architectures such as Mobile Net and ResNet. However, choosing a good baseline network is critical for achieving the best results since the compound scaling method only enhances the predictive capacity of the networks by replicating base network’s underlying convolutional operations and network structure.

To this end, the authors use Neural Architecture Search to build an efficient network architecture, EfficientNet-B0 . It achieves 77.3% accuracy on ImageNet with only 5.3M parameters and 0.39B FLOPS. (Resnet-50 provides 76% accuracy with 26M parameters and 4.1B FLOPS).

The main building block of this network consists of MBConv to which squeeze-and-excitation optimization is added. MBConv is similar to the inverted residual blocks used in MobileNet v2. These form a shortcut connection between the beginning and end of a convolutional block. The input activation maps are first expanded using 1x1 convolutions to increase the depth of the feature maps. This is followed by 3x3 Depth-wise convolutions and Point-wise convolutions that reduce the number of channels in the output feature map. The shortcut connections connect the narrow layers whilst the wider layers are present between the skip connections. This structure helps in decreasing the overall number of operations required as well as the model size.

I7JBfua.png!web

Figure 1. Inverted residual block

The code for this block can be surmised as :

from keras.layers import Conv2D, DepthwiseConv2D, Add
def inverted_residual_block(x, expand=64, squeeze=16):
    block = Conv2D(expand, (1,1), activation=’relu’)(x)
    block = DepthwiseConv2D((3,3), activation=’relu’)(block)
    block = Conv2D(squeeze, (1,1), activation=’relu’)(block)
    return Add()([block, x])

Compound Scaling

aIJjQzi.png!web

Figure 2. Model Scaling. (a) is a baseline network example; (b)-(d) are conventional scaling that only increases one dimension of network width, depth, or resolution. (e) is our proposed compound scaling method that uniformly scales all three dimensions with a fixed ratio.

A convolutional neural network can be scaled in three dimensions: depth, width, resolution . The depth of the network corresponds to the number of layers in a network. The width is associated with the number of neurons in a layer or more pertinently, the number of filters in a convolutional layer. The resolution is simply the height and width of the input image. Figure 2 above, gives a clearer picture of scaling across these 3 dimensions.

Increasing the depth, by stacking more convolutional layers, allows the network to learn more complex features. However deeper networks tend to suffer from vanishing gradients and become difficult to train. Although new techniques such as batch normalization and skip connections are effective in resolving this problem, empirical studies suggest that the actual accuracy gains by only increasing the depth of the network quickly saturate. For instance Resnet-1000 provides the same accuracy as Resnet-100 despite all the extra layers.

Scaling the width of the networks allows layers to learn more fine grained features. This concept has been used extensively in numerous works such as Wide ResNet and Mobile Net. However, as is the case of increasing depth, increasing width prevents the network from learning complex features , resulting in diminishing accuracy gains.

Higher input resolution provides a greater detail about the image and hence enhances the model’s ability to reason about smaller objects and extract finer patterns. But like the other scaling dimensions, this too provides limited accuracy gains on its own.

QFfyEbj.png!web

Figure 3. Scaling Up a Baseline Model with Different Network Width (w), Depth (d), and Resolution (r) Coefficients.

This leads to an important observation:

Observation 1 : Scaling up any dimension of network width, depth, or resolution improves accuracy, but the accuracy gain diminishes for bigger models.

YZrQn2U.png!web

Figure 4. Scaling Network Width for Different Baseline Net-works.

This implies that the scaling of network for increase in accuracy should be contributed in part by a combination of the three dimensions. This is corroborated by empirical evidence in Figure 4 , where the networks’s accuracy is modeled with an increasing width for various depth and resolution settings.

The results depict that scaling only one dimension (width) quickly stagnates the accuracy gains. However, coupling this with an increase in number of layers (depth) or input resolution enhances the models predictive capabilities.

These observations are somewhat expected and can be explained by intuition. For instance, if the spatial resolution of the input image is increased , the number of convolutional layers should also be increased so that the receptive field is large enough to span the entire image that now contains more pixels. This leads to the second observation :

Observation 2: In order to pursue better accuracy and efficiency, it is critical to balance all dimensions of network width, depth, and resolution during ConvNet scaling.

The proposed scaling method

A convolutional neural network can be thought of as stacking or composition of various convolutional layers. Furthermore these layers can be partitioned into different stages e.g ResNet has five stages, and all layers in each stage have the same convolutional type. Therefore, a CNN can be represented mathematically as:

Equation 1

where N depicts the network, i represents the stage number, F ᵢ represents the convolution operation for the i-th stage, and L ᵢ represents the number of times F ᵢ is repeated in stage i. H ᵢ , W ᵢ and C ᵢ simply denote the input tensor shape for stage i.

As can be deduced from the equation 1, L ᵢ controls the depth of the network, C ᵢ is responsible for the width of the network whereas H ᵢ and W ᵢ affect the input resolution. Finding a set of good coefficients to scale these dimensions for each layer is impossible, since the search space is huge. So, in order to restrict the search space, the authors lay down a set of ground rules.

All the layers/stages in the scaled models will use the same convolution operations as the baseline network
All layers must be scaled uniformly with constant ratio

With these rules established , equation 1 can be parameterized as:

Equation 2

where w, d, r are coefficients for scaling network width,depth, and resolution; F̂ ᵢ , L̂ ᵢ , Ĥ ᵢ , Ŵ ᵢ , Ĉ ᵢ are predefined parameters in baseline network.

The authors propose a simple, albeit effective scaling technique that uses a compound coefficient ɸ to uniformly scale network width, depth, and resolution in a principled way:

JjQbqqV.png!web

Equation 3

ɸ is a user-defined, global scaling factor (integer) that controls how many resources are available whereas α , β , and γ determine how to assign these resources to network depth, width, and resolution respectively. The FLOPS of a convolutional operation are proportional to d, w², r², since doubling the depth will double the FLOPS while doubling width or resolution increases FLOPS almost by four times. So ,scaling the network using equation 3 will increase the total FLOPS by (α * β² * γ²) ^ɸ . Hence, in order to make sure that the total FLOPS don’t exceed 2^ϕ, the constraint (α * β² * γ²) ≈ 2 is applied. What this means, is that if we have twice the resources available we can simply use compound coefficient of 1 to scale the FLOPS by 2¹.

The parameters - α , β , and γ- can be determined using grid search by setting ɸ=1 and finding parameters that result in the best accuracy. Once found, these parameters can then be fixed , and the compound coefficient ɸ can be increased to get larger but more accurate models. This was how EfficientNet-B1 to EfficientNet-B7 are constructed , with the integer in the end of the name indicating the value of compound coefficient.

Results

This technique allowed the authors to produce models that provided accuracy higher than the existing ConvNets and that too with a monumental reduction in overall FLOPS and model size.

BZ7nY3N.jpg!web

Table 2. Comparison of EfficientNet with existing networks for ImageNet Challenge

This scaling method is generic and can be used with other architectures to effectively scale Convolutional Neural Networks and provide better accuracy.

MneeE3q.png!web

Table 3. Scaling Up MobileNets and ResNet.

References:

EfficientNet , ICML 2019
MobileNet v2 , CVPR 2018
GPipe , NIPS 2019
Official released code: https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet

How to intelligently scale a CNN for achieving accuracy gains

EfficientNet-B0 Architecture

Compound Scaling

The proposed scaling method

Results

References:

Recommend

已现多起AirPods爆炸事故：苹果正采集设备进一步调查

传腾讯计划入股爱奇艺并与百度接洽三方均不予置评

滴滴：原高级总监收受巨额贿赂被采取刑事强制措施

美的集团辟谣闯入者是“美的供应商”传闻：恶意诽谤

苹果披露App Store商业数据:中国市场规模2460亿美元

iOS用户可以修改微信号了！支持一年修改一次

爱奇艺开盘涨35.8% 传腾讯与百度接洽计划收购爱奇艺

苹果浦东国金店被砸？警方：砸店男子无法正常交流被带走调查

对话淘宝直播负责人玄德:直播电商绝对不是一个流量生意

美商务部新规允许美企与华为合作制定5G标准华为回应

About Joyk