31

CBNet: A Novel Composite Backbone Network Architecture for Object Detection Revi...

 4 years ago
source link: https://towardsdatascience.com/cbnet-a-novel-composite-backbone-network-architecture-for-object-detection-review-ec98e8b7bc9b?gi=a32a47984d82
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

As of today, the object detection network that performs best on the COCO dataset is CBNet, having an average precision of 53.3 on the COCO test dataset.

m6zMZve.png!web

The authors claim that incorporating a more powerful backbone increases the performance of the object detector. To do so, they propose a novel strategy for assembling multiple identical backbones by composite connections between the adjacent backbones. By doing this they came up with a more powerful backbone called the Composite Backbone Network.

rAJz6zF.png!web

As it’s shown in the above figure, CBNet is composed of multiple identical backbone networks and composite connections between the neighbor backbones. From left to right, the output of each stage in an Assistant Backbone, which also can be seen as higher-level features. The outputs from each feature level flow to the parallel stage of the succeeding backbone as part of inputs through composite connections. By doing this, multiple high-level and low-level features are fused to generate richer feature representation.

The paper introduces two types of architectures: Dual-Backbone (DB) and Triple-Backbone (TB) . As you can guess from the naming, DB consists of two identical backbones, and TB consists of three identical backbones. The performance difference will be discussed later in this post.

To compose multiple outputs from the backbones, the paper introduces a Composite Connection block. This block consists of a 1x1 convolution followed by a batch normalization layer. These layers are added to reduce the number of channels and to perform an upsample operation.

The final backbone (placed rightmost in the figure), named as a Lead Backbone, is used for object detection. The output feature from the Lead Backbone is fed into the RPN/detection head, while the output of each Assistant Backbones is fed into its adjacent backbone.

Composite Styles

UVFVnyV.png!web

There are also four kinds of composite styles.

  • Adjacent Higher Level Composition is the style explained in the earlier section. Each output feature from the Assistant Backbone is fed into the adjacent backbone using the Composite Connection block.
  • Same Level Composition is another simple composition style, which feeds the output of the adjacent lower-level stage of the previous backbone to the succeeding backbone. As it’s shown in the figure, this style does not make use of the composite connection block. The feature from the lower level backbone is added straight to the adjacent backbone.
  • Adjacent Lower-Level Composition is very similar to the AHLC. The only difference is that the feature from the lower level stage of the previous backbone is passed on to the succeeding backbone.
  • Dense Higher-Level Composition is inspired by the DenseNet paper, where each layer is connected to all the subsequent layers to build a dense connection in a stage.
buErqij.png!web

The table above shows the comparison between different composition styles. We can observe that the AHLC style outperforms other composite styles. The reason behind this is well explained in the paper. The authors claim that directly adding the lower-level features of the previous backbone to the higher-level ones of the succeeding backbone harms the semantic information of the latter features. On the other hand, adding deeper features of the previous backbone to the shallow ones of the succeeding backbone enhances the semantic information of the latter features.

Results

Ev2yaiz.png!web

The table above shows the detection results on the MS-COCO test dataset. Column 5–7 shows the object detection results while column 8–10 shows instance segmentation results. It clearly shows that utilizing more backbone architectures pulls up the performance of the network.

Conclusion

The paper shows a novel architecture called CBNet. By composing multiple backbone architectures, the proposed network increases the accuracy of the detection network by about 1.5 to 3 percent.

It would be worth inspecting further about the increased parameter size and the training time.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK