Everything you need to know about MobileNetV3 and its comparison with previous v...

Efficient Mobile Building Blocks

MobileNetV1 introduced the depth-wise convolution to reduce the number of parameters. The second version added an expansion layer in the block to get a system of expansion-filtering-compression(See figure below[1]) using the three layers. This system — coined as Inverted Residual Block — further helped in improving the performance.

The latest version adds squeeze and excitation layers[2] in the initial building block taken from V2, which goes for further treatment (NAS and NetAdapt) later on. See the figure below for comparison with V2.

IVBVrqM.png!web

As mentioned in the figure above, h-swish non-linearity is used, which will be discussed later in the story. The addition of squeeze and excitation module helps by giving un-equal weights to different channels from the input when creating the output feature maps as supposed to equal weight that a normal CNN gives. Squeeze and excitation is generally added separately to the resnet/inception blocks. However, in this model, it is applied in parallel to the resnet layers. The Squeeze and excitation layers are as follows(small arrows at the bottom of the figure above):

Pool -> Dense -> ReLU -> Dense -> h-swish -> scale back.

Neural Architecture Search for light models

Neural Architecture Search (NAS)[3] in laymen’s terms is the process of trying to make a model (generally an RNN also called controller) output a thread of modules that can be put together to form a model that gives the best accuracy possible by searching among all the possible combinations . The basic algorithm can be summarised using this figure[4].

BBnMRfq.png!web

This is similar to the standard process on which Reinforcement Learning works. There is a reward function according to which the controller is updated. This is done in pursuit of a state where the reward is maximum from the current state. In NAS terms, the model moves from the current state to a state where the reward accuracy increases.

Generally, in most of the papers, NAS is used to get the structure of an efficient sub-module that can be stacked together repeatedly to get the whole model. However, here it is used in addition to the NetAdapt algorithm(discussed later) which will be used to decide the number of filters for every layer. So, the NAS will be used for optimising each block.

Further, because we want a light model, NAS is tuned accordingly. We work on a new reward function: ACC(m) × [LAT(m)/TAR]^w, which considers both accuracy and latency (total inference time) for the model. ACC is accuracy, LAT is latency, TAR is target latency, and m is the model that resulted from the search. Here w is a constant. The authors also observe that for smaller models(that we are looking to search), w needs to be -0.15 (vs the original w = -0.07 ).

With all these changes, NAS results in a network architecture that can be further refined layer-wise using NetAdapt.

NetAdapt for Layer wise search

The original NetAdapt[5] algorithm works on the number of filters of every conv, as shown in this algorithm below:

ie6v22Z.png!web

Here it tries to optimise the number of filters for every conv in question and picks the model with the highest accuracy. The model among K models with the peak highest accuracy is taken that came from working on one of the K conv layers.

A kind of diluted version of this algorithm that works in MobileNet is as follows:

Start with the NAS output.
Generate a set of proposal. Every proposal should have latency reduced by delta as compared to the model in the previous step.
Set weights for the new proposal by using weights from the previous network and random initialisation of any new filters.
Fine-tune the finally selected proposals until the target latency is achieved.

Final selection is made based on a metric. The authors choose ( change in Acc/change in latency) . This is done with the intuition that both the factors of accuracy and latency are kept in the mix and preference is given to models that maximise the slope of the accuracy-latency trade-off.

Network Improvements

Network Improvements have been made in two ways:

Layer removal
swish non-linearity

Layer Removal

Here, we talk about some changes that are done manually without the aid of the search. This is mainly done in the first few layers and the last layers. Following are the tweaks that the paper does:

In the last block, the 1x1 expansion layer taken from the Inverted Residual Unit from MobileNetV2 is moved past the pooling layer. This means the 1x1 layer works on feature maps of size 1x1 instead of 7x7 making it efficient in terms of computation and latency.
We know that the expansion layer takes a lot of computation. But now that it is moved behind a pooling layer, we don’t need to do the compression done by projection layer from the last layer from the previous block. Thus we can remove that projection layer and the filtering layer from the previous bottleneck layer(block). Both these changes are illustrated in this figure.

AZjMNrA.png!web

3. An experimental change is to use 16 filters in the initial 3x3 layer instead of 32, which is the default mobile models.

These changes add up to save nine milliseconds of inference time.

Non-Linearity

swish non-linearity is defined as:

It has been proved experimentally to improve accuracy. However, as the sigmoid function is computationally expensive and we care a lot about computational expenses in this model, so the authors modify it with what is called hard swish or h-swish :

Following is the graph comparing all the non-linearities in discussion:

NFzeQfZ.png!web

Clearly, h-swish is not much different from swish as far as the curve is concerned even though it is less computationally expensive.

Overall Structure

This brings us to the overall structure. The paper defines two terminologies: MobileNetV3-Large and MobileNetv3-Small. The structure for both of them is as follows(MobileNetV3-Large on the left and MobileNetV3-Small on the right):

3yMJBby.png!web

JjIzqa7.png!web

The changes described in the previous section can be seen in the first few and last few layers, especially the placing of the pooling layer onto 7x7 input before the 1x1 final conv2d layers.

Experiments and Results

The authors have done quite a lot of experiments ranging in different problems of computer vision to prove what this model is actually worth. We will go onto each one of them one by one.

Since this paper is by a bunch of folks (tbh a lot of folks since sooo many people are authors of this paper :D) from Google, so any embedded device used had to be Pixel obviously. They use Pixel 1/2/3 and denote them as P-n. One more thing to note is that all these results are done on single-core hardware and not the multi-core hardware that these phones have at their disposal.

Classification

NZbUfuu.png!web

aeY7ziQ.png!web

The table on the left clearly shows that latency decreases from V2 to V3 even though the accuracy increases for classification on Imagenet.

Object Detection

For detection experiments, the authors use MobileNetv3 as a backbone on SSDLite and following are the results:

VRRj2u2.png!web

It turns out MobileNetv3-Large is 27% faster than MobileNetV2 while maintaining similar mAP.

Segmentation

For semantic segmentation, the authors propose a new segmentation head that is derived from R-ASSP[6] named Lite R-ASSP or LR-ASSP. It is based on the idea of pooling as used Squeeze and Excitation

nEJf6vE.png!web

The authors report that the LR-ASSP is faster than the predecessor R-ASSP that was proposed along with MobileNetV2.

Also, MobileNetV3 backbones are slightly faster than the V2 counterparts.

Conclusion

Google coming up with state-of-the-art results using Neural Architecture Search is a great result for deep learning and computer vision in the future since this can be a great motivation to use and further improve such network search algorithms. Such improvements may lead to a stage where we totally depend on Search for deciding model architectures. MobileNetV3 gives SOTA results for lightweight models in major computer vision problems.

References

[1] https://machinethink.net/blog/mobilenet-v2/

[2] J. Hu, L. Shen, and G. Sun. Squeeze-and-Excitation Networks. ArXiv e-prints, Sept. 2017

[3] Barret Zoph and Quoc V. Le. Neural architecture search with reinforcement learning. CoRR, abs/1611.01578, 2016

[4] https://towardsdatascience.com/everything-you-need-to-know-about-automl-and-neural-architecture-search-8db1863682bf

[5] Tien-Ju Yang, Andrew G. Howard, Bo Chen, Xiao Zhang, Alec Go, Mark Sandler, Vivienne Sze, and Hartwig Adam. Netadapt: Platform-aware neural network adaptation for mobile applications. In ECCV, 2018

[6] Mark Sandler, Andrew G. Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. mobile networks for classification, detection and segmentation. CoRR, abs/1801.04381, 2018

Efficient Mobile Building Blocks

Pool -> Dense -> ReLU -> Dense -> h-swish -> scale back.

Neural Architecture Search for light models

NetAdapt for Layer wise search

Network Improvements

Layer Removal

These changes add up to save nine milliseconds of inference time.

Non-Linearity

Overall Structure

Experiments and Results

Classification

Object Detection

Segmentation

Conclusion

References

Recommend

漫谈分布式系统（四）：存的下，还要存的好

人工智能在电网管理中的应用

基于多视角学习和个性化注意力机制的新闻推荐

基于ClickHouse分析和优化MySQL的业务运行

PyTorch官方出品了一本深度学习书，免费提供给开发者

知识图谱 + 数据中台，会是未来中台战略的答案吗？

小红书向左，返利网向右

双11零售迷思：流量漏斗之外，如何才能让电商节更有长期价值？

Java升级那么快，多个版本如何灵活切换和管理？ - - SegmentFault 思否

轻量级边缘计算 EMQ X Kuiper 与 Azure IoT Hub 集成方案 | EMQ

About Joyk