Simple Introduction about Hourglass-like Model

In the previous article, we just only considered the straight forward network. Some common problems can be solve by such kinds of model, including object classification and object recognition. I want to discuss about another important problem: object segmentation. Further to use this concept to try to solve the object recognition problem.

The legend of these models

The above image shows the legends which will be used in this article. I will introduce some tricky idea about these legends.

Object segmentation is another kinds of popular problem. Unlike the object recognition, we should recognize the object into pixel-level. In the previous object recognition task, we can just use bounding box to post the region of interesting. The point we care is just the location of the object, and we don’t really care about the edge (or shape) of the object. However, we should consider pixel-by-pixel to check if it’s belong to object or not. Rather than object recognition, object segmentation is a more hard problem.

How to deal with this problem? The first straight forward sight is: bring each pixel into network and predict one by one! However, it’s not a practical method. Since the category of the pixel isn’t just determined by the intensity. The relation between the neighbor may be the influence toward the result. Another idea is: why can we just revise the original CNN structure? As the result, FCN was born.

The structure of fully convolution network

The original name of FCN is fully convolutional network[1]. For original structure, the last layer of CNN may be a softmax layer to predict the probability of the category. However, there’s only one result can be produced. As the result, Long et al. try to treated the model with another concept: the fully connected layer in usual structure can be regard as a convolution layer whose kernel size is the size of whole feature map! Is there possible to predict the category of each pixel just by convolution?

The answer is yes. In FCN, the image will be processed through the network. The “course” feature response map will be produced at the end. This feature response map some how represents the category of original image in pixel level. However, the size is shrink to 1/32 times. To reconstruct as original image size, the first idea is bi-linear interpolation. However, it’s not suitable to adapt in the realistic situation.

Another idea is learnable up-sampling method, and it’s more reasonable to learn how to deal with this problem case by case. In the upsampling process of FCN, the course feature map will be operated by a learnable up-sampling layer. Next, the element-wise addition will be adopt to the feature map. By the previous feature map, the result can realize more location and detail information which are destroyed by max pooling layer.

The concept of how to up-sampling in FCN

The author purposed the three kinds of result: FCN-32s, FCN-16s and FCN-8s. The meaning of the back number is the times of shrinking. For example, the result of FCN-32s is 1/32 than the original image. On the contrary, the result of FCN-8s gets through two learnable up-sampling layers and element-wise additions.

The results of different scale output of FCN

In the author’s experiment, we can see that the performance of FCN-8s is the most brilliant. As you can see, the detail margin of human and bicycle is more clear than the two other result. Moreover, as the author mentioned, the performance of FCN-4s and FCN-2s aren’t well than the FCN-8s. Thus it’s not certain that the more fusion will lead to more high accuracy.

U-Net

The FCN brought a big bang to this territory. It used straight forward concept and gave the hint to solve such this kinds of problem. After FCN, there’re lots of model being launch. Since the shape of these models are just like the horizontal hourglass, I call them hourglass-like models. These models did the good jobs in many different tasks, including pixel segmentation, object recognition, denoising, super-resolution…etc.

The next one is U-Net[2]. The most specialty of this model is that it’s purposed to solve the medical problem in advance! The shape of the whole model is just like the english alphebat “U”. As the result, the name of this model is U-Net. You can just see the shape which are drawn in original paper in below.

The structure of U-Net in original paper

The author of U-Net was trying to solve the denoising which isn’t related to the category of the object. To speed up the computation, the U-Net drops the last two layer of VGG. This is first advantage of the U-Net. Second, rather than using element-wise addition to fusion the information of previous tensor and up-sampling tensor, the U-Net concatenates each other in channel dimension conversely.

The structure of U-Net that scratched by myself

This is another structure image of U-Net. As you can see, after each up-sampling operation, the tensor will also get through two convolution layers to reinforce the intensity. In original paper, the author also announced some revision to weight the margin of the different instances. We don’t consider the loss function detail in this article. Generally, the U-Net is a great model that use the detail of previous layers completely.

SegNet

The idea of learnable up-sampling layer is great. However, Badrinarayanan[3] thought that the structure of U-Net wasn’t perfect enough. The major problem is max-pooling.

The idea of max pooling

The above image shows the process of max pooling. In each filter region, we choose the max value to become the result which might represent the strong response toward the feature. However, the location information will lose after this operation. Are there some methods to solve the location missing problem?

The alternative way is pooling-indices. We should remember the the location of the max value for each filter region during max pooling operation. After this recording, we can get the max pooling mask . On the contrary, we can utilize this mask to fill the max value to the original corresponding position. By the consulting, the location information will not lose during down sampling process.

The structure of SegNet

The above image illustrates the structure of SegNet. The twist arrows in the below side are the pooling-indices technique. After each stage, we just consult the mask and fill the max value to the original position. Next, we can just use convolution layer as U-Net does. At the end, we use softmax layer to predict the result.

DeconvNet

By the pooling-indices mechanism, the location detail can be preserved. The Noh[4] raised another creative idea. As we know, the convolution layer will learn the feature of the object. We can regard the kerenl as the perception retina of the specific object. By sliding on the image, it will turn out to become the response toward the specific feature. In other word, the process of convolution is just like “extracting feature”.

However, can we do this process in reverse order? We can just regard it as to “render” the feature to the feature map. You can treat the process of convolution as changing the image to the feature response low dimension space.

The structure of DeconvNet

In the process of DeconvNet, the image will get through the VGG, and get the feature response map with low dimension. This feature map remains the rough structure of the original image. Next, we render this course feature map into the category space. By layers of deconvolution and pooling-indices, the location and category detail will be described at the end.

You may be curious about a question: If the DeconvNet firstly uses deconvolution as the upsampling method, why are some yellow part in the FCN image? As I know, the original group of FCN author didn’t announce the original source code. However, other third-parties re-implementations are found. To simplified the “learnable” up-sampling mechanism, they just use deconvolution to be the alternative one. As the result, I use yellow block to represent the design.

To make the conclusion toward the DeconvNet, the most structure of DeconvNet is similar to the SegNet. The only difference is changing the convolution layer to deconvolution layer during up-sampling.

RedNet

The original full name of RedNet is the residual encoder decoder network[5]. After the previous idea, Mao et al. though of two points:

pooling-indices isn’t perfect! By this mechanism, we should record the location information by the extra mask. This process not only wastes other memory to record but also should spend time to compute the max value by sliding window.
The idea of residual is more and more popular. Is it possible to use residual idea to enhance the performance toward this work?

The structure of RedNet

The RedNet solves the previous two problem. To my surprised, the idea and structure of RedNet is quite simple and clear! It’s just composed by convolution layer, deconvolution layer and addition! Rather than losing the location information and spending the extra memory, Mao got rid of the max pooling directly. In the whole process, the size of feature map isn’t change at all. The image just get through the layers of convolution. Next, we do deconvolution and element-wise addition with the previous tensor.

The two different ways of skip connection

But there’s one question. Where is the residual concept? In our previous experience, the skip connection is in order. And it just like the upper one of above image. On the other hand, the design of skip connection is like the lower one. Why should the author do this changing?

The two experiments related to skip connection

In fact, the author did the two experiments. The first one is to examine the performance between removing the skip connection or not. The left chart of the above image shows the result. As you can see, the red line gets the more higher value of the result. It shows that it’s essential to use skip connection to enhance the performance.

The right side shows the result of two different skip connection. The blue and green line is the order residual connection (the original author of ResNet is He et al. ), and the other two line is the symmetric residual connection. From the value of the chart, the symmetric assignment gets higher value indeed. As the result, they adopt the symmetric skip connection at last.

Experiment by myself

What is the performance of each models to solve the real world problem rather than dataset? I try to implement by myself. First, the simple dataset was made. I choose the top phase of my refrigerator and place my red pen and green earphone one it. I take 20 photo for training and 2 photo for testing. Another short extra video was record that I want to check the performance of continuous situation. You can find the whole data here.

Next, I implement the FCN, U-Net and RedNet by myself. The framework I use is tensorflow. Since the support of pooling-indices isn’t sufficient, I just implement these three model. You can see the whole project here.

The result of testing data toward three different model

Under my experiment, I shrink the whole image into 104*78. The times of shrinking is 10. The base number of filter is 32, so I really use standard VGG structure to train the model. The epochs in my experiment is only 500, and I record the loss for each 20 epochs.

The above image illustrates the result of the testing data. The number of testing image is 2, so there’re 2 rows in the image. In each model, three sub-regions are divided. The most left one is the original image. The middle one is the ground truth of annotation and the most right region is the prediction result.

As you can see, whole three models can capture the rough of the ear phone. However, only FCN-8 and RedNet can describe the distribution of red pen. The information of red pen in U-Net is just gone.

The video prediction toward three different models

The above gif image shows the video testing result. The right one is FCN, and it can capture some response of red pen. RedNet can show the red pen completely. However, it’s more sensitive to the environment, a bunch of object proposals are circled. Maybe more training iterations can improve a little.

Conclusion

The structure of five models with legend (in detail)

In this article, I describe the progress of the hourglass-like model, including FCN, U-Net, SegNet, DeconvNet and RedNet. After the introduction, I show my simple implementation about these model.

Reference

[1] J. Long, E. Shelhamer, and T. Darrell, “Fully Convolutional Networks for Semantic Segmentation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 4, pp. 640–651, Nov. 2014.

[2] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional Networks for Biomedical Image Segmentation,” Miccai, vol. 9351, no. Pt 1, pp. 234–241, May 2015.

[3] V. Badrinarayanan, A. Kendall, and R. Cipolla, “SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation.”

[4] H. Noh, S. Hong, and B. Han, “Learning deconvolution network for semantic segmentation,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, vol. 2015 Inter, pp. 1520–1528.

[5] X. -J.Mao, C. Shen, and Y.- B.Yang, “Image Restoration Using Convolutional Auto-encoders with Symmetric Skip Connections.”

Simple Introduction about Hourglass-like Model

Simple Introduction about Hourglass-like Model

U-Net

SegNet

DeconvNet

RedNet

Experiment by myself

Conclusion

Reference

Recommend

7 Ways to Fix the Windows Registry Editor When It Won’t Respond

Spotify's 'Car Thing' can now do more than just play music

Setting up a relevance evaluation program

KubeVela Releases 1.1: Reaching New Peaks in Cloud-Native Continuous Delivery

What's new in Jakarta EE 10 - Mastertheboss

Web App Ideas to Start a New Business in 2022

写在订阅量过 2000 之际 -#13

Collect and race gorgeous supercars in ‘Gear.Club Stradale’ on Apple Arcade

Top 10 Automobile Technologies Coming in the Next 10 Years

Meta is reportedly working on several new digital currencies, including 'Zuck Bu...

About Joyk