FickleNet: Weakly and Semi Supervised Semantic Image Segmentation Using Stochast...

FickleNet

Weakly and Semi Supervised Semantic Image Segmentation Using Stochastic Inference

May 13 ·10min read

Preface

Brief summary of FickleNet [1] originated from my Master-Seminar at the Chair for Computer Aided Medical Procedures & Augmented Reality at TUM. I want to thank my project supervisor Tariq Bdair as well as the course supervisors Magda Paschali and Dr. Shadi Albarqouni .

Introduction

For segmentation tasks, including medical ones such as tumour segmentation, pixel accuracy improves the quality of the result. However, obtaining pixel-level accuracy is a current problem as computation is expensive and the required fully annotated data is rare. Using weakly and semi supervised learning techniques, FickleNet proposes a solution to this problem. Contributing stochastic hidden unit selection, which enables exploration of diverse locations on a single image, the problems of computational expensiveness and limited data are tackled.

uUjm6na.png!web

Image 1: Visualization of FickleNet [1]

Related Work

The goal of segmentation tasks is to obtain a class activation map (CAM) [2], which explores the contribution of each hidden unit in a neural net to the classification score. However, the naïve approach does not represent the semantic structure of the target object, but highlights discriminative parts of the object, such as the dog’s face.

BfMnUra.png!web

Image 2: Visualization of vanilla CAM [2]

The goal of novel methods is thus to produce a better localization maps, in the sense of preserving semantics and size of the object. To obtain a better CAM three main directions exist:

Image-Level [3–5]

The idea of image-level processing is to hide discriminative parts of the image, forcing the network to classify an object based on less discriminative parts. In the dog example this results in hiding the dog’s face The network is now forced to use the rest of the dog’s body for classification. However, existing models confuse the network too much for further segmentation, loose semantics and size of the objects or are computationally too expensive.

Nn6BZzz.png!web

Image 3: Visualization of GAIN, an image-level processing segmentation network [3]

Feature-Level [6, 7]

Analogous to image-level processing, feature level-processing aims to hide discriminative parts to the network. Instead of parts of the image, parts of features are hidden. The network is forced to classify the object using less discriminative features. The drawback of existing feature-level methods is the computational expensiveness.

Image 4: ACoL, a feature-level processing segmentation network [6]

Region Growing [8–10]

For region growing, a CAM initially provides small discriminative parts of an object, so called seeds. Using random walk, the network identifies which pixels still belong to that classified object. The seeds are enlarged until they fit the object’s semantics and size. Like in feature-level methods, currently the main drawback is computational expensiveness.

bUJrQf2.png!web

Image 5: Simplified from [10], DSRG, a region growing based segmentation network

Methodology

The goal of FickleNet is to obtain more information from a single image while limiting itself to a single optimization step. In a nutshell, FickleNet provides random parts of the classifier to the region-growing method DSRG. The network iterates through a single image and selects random combinations of locations on the feature map, which are used as seeds in the segmentation. As a result, a single image produces diverse locations from the feature map, while only one network needs to be trained. How this works in detail, will be presented in the following.

VVNJz2I.png!web

Image 6: Architecture of stochastic hidden Unit Selection [1]

Preprocessing

First, a single image should produce different classification scores by iteration. For this, the classification score of randomly selected pairs of hidden units are computed. To receive these classification scores, following steps are done:

a) Feature Map Expansion

What can be seen in comparison of (a) and (b) in the above image 6, the feature map is expanded, such that the sliding windows do not overlap when applying dropout. This is possible without expansion, but it makes computation less expensive, as the results will show.

b) Center-Preserving Dropout [11]

At each sliding window position dropout with a random mask is performed, except for the center of the window, which will always be preserved. The result of dropout is to randomly select hidden units, differently at each sliding window position. Thus, random parts are provided to DSRG. Preserving the center means preserving correlation between the sliding window positions. The result of this is illustrated in the results section.

c) Obtain Activation Score s

The last steps produce receptive fields of different shapes and sizes. For each iteration through the image, different receptive fields result. In consequence different classification scores are produced as follows. The classification score for each iteration is obtained in the common way with a convolutional layer, global average pooling and a sigmoid layer to classify with activation score S.

Class Activation Map

Second, for each image the activation scores from the iterations need to be aggregated into a single localization map. To obtain a class activation map in the first hand, Grad-CAM [12] is used:

JRfuyyb.png!web

Formula 1: Grad-CAM [12]

which also includes the gradient of the last layer ∂Sc/∂x_k . This results into a better class activation map, which is comparable to more expensive methods.

AvMZjuA.png!web

Image 7: Comparison of Grad-CAM to computationally more expensive methods (e,k) [14]

Having a Grad-CAM for each class and each iteration through the image, the maps for an image are aggregated. For this, a pixel u is classified to class c if the Grad-CAM score for c in any localization map of the image at that pixel u is larger than a threshold.

Losses

Third, and last, the segmentation network is trained. It is important to notice that this is the only training procedure required for FickleNet. The following loss is minimized using ADAM:

Where

[10]

where H_u,c is the probability of class c at position u of segmentation map H and Sc is the set of locations classified to class c. This is a cross entropy loss with the goal to match only seed cues given by the classification network and to ignore the rest.

[10]

where f_uc(X) is the network output and Q_uc(X, f(X)) is the Conditional Random Field (CRF). This is a KL-Divergence to be minimized in order to penalize discontinuous segmentation with regard to spatial and colour information [10].

where H_u,c is the probability of class c at position u of segmentation map H and F_c is the groundtruth mask. This is a cross entropy loss that penalizes deviations from fully annotated images (annotated by a human expert). This loss will only be used in the semi supervised setting.

Summary of Methodology

FickleNet contributes stochastic hidden unit selection as a preprocessing step before applying DSRG for segmentation. Thus, the CAM is improved by not only hiding discriminative parts, but not knowing about discriminative parts in the first place. It is to be noticed, that this pixel-accurate segmentation is trained with only a single loss function.

Experiment

Setup

Dataset : PASCAL VOC 2012 Image Segmentation Benchmark [13] (20 classes, 10M labelled images)
Network : VGG-16 [14] pretrained on Imagenet [15], Segmentation performed by DSRG [10]
Setup : Mini-Batch Size: 10,
Image cropped to 321x321 Pixels at random Location,
Learning Rate of 0.001, halved every 10 Epochs,
ADAM Optimizer [16]
Frame : Pytorch [17] for Localization Maps
Caffe [18] for Segmentation
NVIDIA TITAN Xp GPU

Evaluation

Two weakly supervised settings and a semi supervised setting were used as evaluation baseline, with different architectures for segmentation. In every case PASCAL VOC 2012 Image Segmentation Benchmark [13] was used as datasource.

Weakly Supervised Setting, Segmentation with DeepLab-VGG16

2UNriab.png!web

Image 8: Own Evaluation of the Results given in [1]

In this setting FickleNet outperforms comparable networks. It is remarkable that the accuracy comes close to that of networks utilizing additional annotations.

Weakly Supervised Setting, Segmentation with ResNet

reUbYv3.png!web

Image 9: Own Evaluation of the Results given in [1]

Here FickleNet also outperforms comparable networks. It is remarkable that even AffinityNet is outperformed, which is based on ResNet 38, while FickleNet only used the less complex ResNet 101.

Semi Supervised Setting, Segmentation with DeepLab-VGG16

raEZJje.png!web

Image 10: Own Evaluation of the Results given in [1]

FickleNet results in a slightly better mean Intersection over Union (mIoU) score, which is an accuracy measure for segmentation tasks. In this setting, the networks are compared to fully supervised methods with 1.4K and 10.6K strongly annotated images. Thus, the semi supervised networks are provided more information than DeepLab 1.4K, but less than DeepLab 10.6K. The low improvement can be explained by the focus of FickleNet on weakly supervised learning, where application for semi supervised learning is possible. The contribution is not a suited semi supervised learning solution.

Results

fiUNnqf.png!web

Image 11: Visual Results of FickleNet [1]

There are five takeaways from the ablation studies contributed by FickleNet:

Map expansion

Map expansion makes stochastic hidden unit selection at each sliding window position easier. This results in faster computing at the cost of little more GPU memory usage.

Image 12: Result of Map Expansion [1]

More random selections of a single image

The more often a single image is iterated, the larger is the represented area of the object. This proves the main idea of stochastically choosing hidden units, as it makes iteration through an image possible.

qMBBjaR.png!web

Image 13: Result of Random Selections [1]

Dropout Rate

A higher dropout rate results in larger covered region of the target object. If the dropout rate is low, probability is high that a highly discriminative part of the object is still active, what counteracts the idea to hide these regions. Less discriminative parts are consequently not used as a seed, fewer seeds are provided, the covered region is smaller.

FRVfA3r.png!web

Image 14: Result of Dropout Rate [1]

Center-Preserving Dropout

To preserve correlation between sliding window positions, Center-Preserving Dropout is applied. As visible in Image 14, the compared general dropout results in noisy activation. If correlation is not preserved and provided to random walk, the probability to still belong to the same object is lower. Without discriminative features or correlation of neighbouring, discriminative, parts, some small parts of the object are not classified as such.

Effect of stochastic methods

The underlying concept is a stochastic perspective to segmentation, specifically which parts of the image is provided to the segmentation network. Proof is given by the result, that stochastic selection in training and inference provides the highest mIoU (mean Intersection over Union)score.

Image 15: Result of Stochastic Methods [1]

Conclusion

FickleNet contributes a novel approach of pre-processing in weakly and semi-supervised semantic image segmentation. The effect of stochastically selecting hidden units has been proven to improve accuracy of the segmentation task. This idea is interesting to apply in other domains. However, the paper did not propose possible directions of improvement of FickleNet or different applications of the idea, which are thus left for the interested reader.

For further visualizations feel free to have a look at my presentation here ! If you have any questions or ideas, feel free to contact me on LinkedIn .

Thanks for reading!

Bibliography

[1]: Lee, Jungbeom, et al. “FickleNet: Weakly and Semi-Supervised Semantic Image Segmentation Using Stochastic Inference.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition . 2019.

[2]: B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba. Learning deep features for discriminative localization

[3]: K. Li, Z. Wu, K.-C. Peng, J. Ernst, and Y. Fu. Tell me where to look: Guided attention inference network

[4]: K. K. Singh and Y. J. Lee. Hide-and-seek: Forcing a network to be meticulous for weakly-supervised object and action localization

[5]: Y. Wei, J. Feng, X. Liang, M.-M. Cheng, Y. Zhao, and S. Yan. Object region mining with adversarial erasing: A simple classification to semantic segmentation approach

[6]: X. Zhang, Y. Wei, J. Feng, Y. Yang, and T. Huang. Adversarial complementary learning for weakly supervised object Localization

[7]: D. Kim, D. Cho, D. Yoo, and I. S. Kweon. Two-phase learning for weakly supervised object localization

[8]: R. Fan, Q. Hou, M.-M. Cheng, T.-J. Mu, and S.-M. Hu. s4net: Single stage salient-instance segmentation.

[9]: A. Kolesnikov and C. H. Lampert. Seed, expand and constrain: Three principles for weakly-supervised image segmentation

[10]: Z. Huang, X. Wang, J. Wang, W. Liu, and J. Wang. Weakly supervised semantic segmentation network with deep seeded region growing

[11]: N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting

[12]: R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, D. Batra, et al. Grad-cam: Visual explanations from deep networks via gradient-based localization

[13]: M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge

[14]: K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition

[15]: J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei Fei. Imagenet: A large-scale hierarchical image database

[16]: D. P. Kingma and J. Ba. Adam: A method for stochastic optimization

[17]: A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch

[18]: Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding

FickleNet

Weakly and Semi Supervised Semantic Image Segmentation Using Stochastic Inference

Preface

Introduction

Related Work

Image-Level [3–5]

Feature-Level [6, 7]

Region Growing [8–10]

Methodology

Preprocessing

Class Activation Map

Losses

Summary of Methodology

Experiment

Setup

Evaluation

Results

Conclusion

Bibliography

Recommend

GitHub - hangzhaomit/semantic-segmentation-pytorch: Pytorch implementation for S...

GitHub - MSiam/TFSegmentation: RTSeg: Real-time Semantic Segmentation Comparativ...

Semantic Image Segmentation with DeepLab in Tensorflow

Research Blog: Semantic Image Segmentation with DeepLab in Tensorflow

GitHub - CSAILVision/semantic-segmentation-pytorch: Pytorch implementation for S...

GitHub - MVIG-SJTU/pointSIFT: a module for 3D semantic segmentation in point clo...

GitHub - DrSleep/light-weight-refinenet: Light-Weight RefineNet for Real-Time Se...

HoloClean: Weakly Supervised Data Repairing

Self-supervised Video Object Segmentation by Motion Grouping

Reading Self-supervised Single-view 3D Reconstruction via Semantic Consistency

About Joyk