Single Stage Instance Segmentation — A Review - JOYK Joy of Geek, Geek News, Link all geek

Single Stage Instance Segmentation — A Review

A glimpse into the future of real-time instance segmentation

Apr 28 ·16min read

Instance segmentation is a challenging computer vision task that requires the prediction of object instances and their per-pixel segmentation mask. This makes it a hybrid of semantic segmentation and object detection.

zeMZNfR.png!web

Ever since Mask R-CNN was invented, the state-of-the-art method for instance segmentation has largely been Mask RCNN and its variants ( PANet , Mask Score RCNN , etc). It adopts the detect-then-segment approach, first perform object detection to extract bounding boxes around each object instances, and then perform binary segmentation inside each bounding box to separate the foreground (object) and the background.

However, Mask RCNN is quite slow and precludes the use of many real-time applications. In addition, masks predicted by Mask RCNN have fixed resolution and thus are not refined enough for large objects with complex shapes. There has been a wave of studies on single-stage instance segmentation, fueled by the advances in anchor-free object detection methods (such as CenterNet and FCOS . See my slides for a quick intro into anchor-free object detection). Many of these methods are faster and more accurate than Mask RCNN, as shown in the image below.

ummMVzf.png!web

Inference time of recent one-stage methods tested on a Tesla V100 GPU ( source )

This blog will review the recent advances in single-stage instance segmentation, with a focus on mask representation — one key aspect of instance segmentation.

Local Mask and Global Mask

One core question to ask in instance segmentation is the representation or parameterization of instance masks — 1) whether to use local masks or global masks and 2) how to represent/parameterize the mask.

nYNBRnZ.png!web

Mask representation: Local Masks and Global Masks

There are largely two ways to represent an instance mask: local masks and global masks. A global mask is what we ultimately want, which has the same spatial extent to the input image, although the resolution may be smaller such as 1/4 or 1/8 of the original image. It has the natural advantage of having the same resolution (and thus fixed-length features) for big and small objects. This will not sacrifice resolution for bigger objects and the fixed resolution lends itself to perform batching for optimization. A local mask is usually more compact in the sense that it does not have excessive boundaries as a global mask. It has to be used with mask location to be recovered to the global mask, and local mask size will depend on object size. But to perform effective batching, instance masks require a fixed-length parameterization. The simplest solution is by resizing instance masks to fixed image resolution, as adopted by Mask RCNN. There are, as we see below, more effective ways to parameterized local masks as well.

Based on whether local or global masks are used, single-stage instance segmentation can be largely categorized into local-mask-based and global-mask-based approaches.

Local-mask-based Methods

Local-mask-based methods output instance masks on each local region directly.

Contours with Explicit Encoding

Bounding box in a sense is a rough mask, which approximates the contour of the mask with the minimum bounding rectangle. ExtremeNet (Bottom-up Object Detection by Grouping Extreme and Center Points, CVPR 2019) performs detection by using four extreme points (thus a bounding box with 8 degrees of freedom rather than the conventional 4 DoF), and this richer parameterization can be naturally extended to an octagonal mask by extending an extreme point in both directions on its corresponding edge to a segment of 1/4 of the entire edge length.

R7fEzeU.png!web

Since then there are a series of work trying to encode/parameterize the contours of an instance mask into fixed-length coefficients, given different decomposition basis. These methods regress the center of each instance (not necessarily the bbox center) and the contour with respect to that center. ESE-Seg (Explicit Shape Encoding for Real-Time Instance Segmentation, ICCV 2019) designs an inner center radius shape signature for each instance and fits it with Chebyshev polynomials. PolarMask (PolarMask: Single Shot Instance Segmentation with Polar Representation, CVPR 2020) uses utilizes rays at constant angle intervals from the center to describe the contour. FourierNet (FourierNet: Compact mask representation for instance segmentation using differentiable shape decoders) introduces a contour shape decoder using Fourier transform and achieves smoother boundaries than PolarMask.

eYBn2mq.png!web

Various contour-based methods

These methods typically use 20 to 40 coefficients to parameterize the mask contours. They are fast to inference and easy to optimize. However, their drawbacks are also obvious. First of all, visually they all look — let’s be honest — quite awful. They can not depict the mask precisely and can not describe objects that have holes in the center.

Personally I think this line of work is cute but has little future. Explicit encoding of complex topology of instance masks or their contours are intractable.

Structured 4D Tensor

TensorMask (TensorMask: A Foundation for Dense Object Segmentation, ICCV 2019) is one of the first works to demonstrate the idea of dense mask prediction, by predicting a mask at each feature map location. TensorMask still predicts a mask wrt a region of interest instead of a global mask, and it is able to run instance segmentation without running object detection.

TensorMask utilizes structured 4D tensors to represent masks over a spatial domain (2D iterates over all possible locations in the input image and 2D representing a mask in each location), it also introduces aligned representation and tensor bipyramid to recover spatial details, but these align operations make the network even slower than the two-stage Mask R-CNN. In addition, in order to get good performance, it needs to be trained with a schedule that is six times longer than a standard COCO object detection pipeline (6x schedule).

iAJbYnZ.png!web

Compact Mask Encoding

Natural object masks are not random and akin to natural images, instance masks reside in a much lower intrinsic dimension than that of pixel space. MEInst (Mask Encoding for Single Shot Instance Segmentation, CVPR 2020) distills the mask into a compact and fixed dimensional representation. With a simple linear transformation with PCA, MEInst is able to compress a 28x28 local mask into a 60 -dim feature vector. The paper also tried to directly regress 28x28=784-dim feature vector on a one-stage object detector ( FCOS ), and also get reasonable results with 1 to 2 AP point drop. This means that directly predicting high dimensional masks (in natural representation per TensorMask) is not entirely impossible, but are hard to optimize. The compact representation of masks makes it easier to optimize and also faster to run at inference time. It is most similar to Mask RCNN and can be directly used with most of the other object detection algorithms.

imQjQzj.png!web

Global-mask-based Methods

Global-mask-basedmethods first generate intermediate and shared feature maps based on the whole image, then assemble the extracted features to form the final masks for each instance. This is the mainstream methods among recent one-stage instance segmentation methods.

Prototypes and Coefficients

YOLACT (YOLACT: Real-time Instance Segmentation, ICCV 2019) is one of the first methods attempting real-time instance segmentation. YOLACT breaks instance segmentation into two parallel tasks, generating a set of prototype masks and predicting per-instance mask coefficients. The prototype masks are generated with FCN and can directly benefit from advances in semantic segmentation. The coefficients are predicted as extra features of the bounding box. These two parallel steps are followed by an assembly step: a simple linear combination realized by matrix multiplication and a cropping operation with the predicted bounding boxes for each instance. The cropping operation reduces the network’s burden to suppress noise outside of the bounding box but still sees some leakage if the bounding box include part of another instance of the same class.

jYvQVvq.png!web

The prediction of prototype masks are critical to ensure high resolution of the final instance masks, which is comparable with semantic segmentation. The prototype masks are only dependent on input images and are independent of categories and specific instances. This distributed representation is compact as the number of the prototype masks is independent of the number of instances, which makes YOLACT’s mask computation cost constant (unlike Mask RCNN which has a computation cost linear to the number of instances).

Looking back at InstanceFCN (Instance-sensitive Fully Convolutional Networks, ECCV 2016) and the followup study FCIS (Fully Convolutional Instance-aware Semantic Segmentation, CVPR 2017) by MSRA, they seem to be a special case of YOLACT. Both InstanceFCN and FCIS utilize FCN to generate multiple instance-sensitive score maps that contain the relative positions to objects instances, then apply an assembling module to output object instances. The position-sensitive score maps can be seen as the prototype masks, but instead of learned linear coefficients, IntanceFCN and FCIS use a fixed set of spatial pooling operations to combine the position-sensitive prototype masks.

IRbqiqN.png!web

InstanceFCN [b] and FCIS [c] uses fixed pooling operation for instance segmentation ( source )

BlendMask (BlendMask: Top-Down Meets Bottom-Up for Instance Segmentation, CVPR 2020) builds on YOLACT, but instead of predicting one scalar coefficient for each prototype mask, BlendMask predicts a low-res (7x7) attention map to blend the masks within the bounding box. This attention map is predicted as a high dimensional feature (7x7=49-d) attached to each bounding box. Interestingly, the prototype masks used by BlendMask is 4, but it works by even only 1 prototype masks. CenterMask (CenterMask: single shot instance segmentation with point representation, CVPR 2020) works almost in exactly the same way and uses 1 prototype mask (named global saliency map) explicitly.

Rbi63ui.png!web

Architecture for CenterMask. BlendMask has an extremely similar pipeline.

Note that both BlendMask and CenterMask developed further dependencies on the detected bounding box. The attention map or shape mask has to be scaled to the same size as the bounding box, before blending with the cropped prototype mask.

CondInst (Conditional Convolutions for Instance Segmentation) takes one step further and completely removes any dependency on bounding boxes. Instead of assembling cropped prototype masks, it borrows the idea of dynamic filters and predicts the parameters of a lightweight FCN head. The FCN head has three layers and 169 parameters in total. What is amazing is that the authors showed that even when the prototype mask is a 2–ch CoordConv alone, the network is predicting good results with 31 AP on COCO. We will discuss this in the implicit representation section below.

muANfeI.png!web

Both BlendMask /CenterMask and CondInst is an extension to YOLACT.

BlendMask/CenterMask is trying to blend cropped prototype masks with a finer-grained mask within each bbox. YOLACT is one special case of BlendMask or CenterMask where the resolution of the attention map is 1x1.
CondInst is trying to blend cropped prototype masks with deeper convs consisting of dynamically predicted filters. YOLACT is one special case of CondInst where FCN is 1 1x1 conv layer.

The use of a branch to predict prototype masks allow these methods to benefit from using an auxiliary task of semantic segmentation (usually with 1 to 2 points boost in AP). It can also be naturally extended to perform panoptic segmentation as well.

Regarding the parameters needed to represent each instance mask, some technical details are listed below. These methods with global masks and coefficients use 32, 196, 169 parameters per instance mask.

YOLACT uses 32 prototype masks + 32-dim mask coeff + box crop;
BlendMask uses 4 prototype masks + 4 7x7 attention map + box crop;
CondInst uses coordConv + 3 1x1 dynamic conv (169 parameters)

SOLO and SOLOv2: segmenting objects by locations

SOLO is one of a kind and merits its own section. The papers are deeply insightful and very well written. They are a piece of art to me (like another one of my favorite CenterNet).

nAJ773Q.png!web

Architecture of SOLOv1

The first author of the paper posted his reply on the motivation of SOLO on Zhihu (知乎) , which I quote below:

Semantic segmentation predict the semantic category for each pxiel in the image. In analogously, for instance segmentation, we propose to predict the “instance category” of each pixel. Now the key question is, how do we define Instance Category?

If two object instances in the input image has exactly the same shape and position, they are the the same instance. Any two different instances either have different position or shape. And as shape is hard to describe in general we approximate shape with size.

Thus “instance category” is defined by location and size. Location is classified by its center position. SOLO approximates the center position by dividing the input image into a grid of S x S cells and thus S² classes. Size is handled by assigning objects of different sizes to different levels of a feature pyramid (FPN). Thus for each pixel, SOLO only needs to decide which SxS grid cell and which FPN level to assign the pixel (and the corresponding instance category) to. So SOLO only needs to perform two pixel-level classification problems, analogous to semantic segmentation. Now another key question is how are the masks represented?

The instance masks are represented directly by global masks stacked into S² channels. This is an ingenious design that solves many problems simultaneously. First, many previous studies store 2D masks as a flattened vector and this quickly gets intractable when the mask resolution increases leading to the explosion of the number of channels. A global mask naturally preserves the spatial relationships within the pixels of the mask. Second, the global mask generation can preserve high resolution of the mask. Third, the number of predicted masks are fixed, regardless of objects in the image. This is similar to the line of work of prototype masks, and we will see how these two streams merge in SOLOv2.

SOLO formulates instance segmentation as a classification-only problem and removes any dependent on regression. This makes SOLO naturally independent of object detection. SOLO and CondInst are the two works that directly operate on global masks and are truly bounding box free methods.

ZFfaYjA.png!web

Global masks predicted by SOLO. The masks are redundant, sparse and robust to object localization error.

Resolution tradeoff

From the global masks predicted by SOLO, we can see that the masks are relatively insensitive to localization error as masks predicted by neighboring channels are quite similar. This brings up the tradeoff between the resolution (and thus precision) of object localization and instance masks.

TensorMask’s idea of 4D structured tensor makes perfect sense in theory but is hard to realize in practice in the current framework of NHWC tensor format. Flattening a 2D tensor with spatial semantics into a 1D vector will inevitably lose some spatial details (similar to doing semantic segmentation with fully connected networks), and has its limitations in even representing a low-resolution image of 128x128. Either the 2D of location or the 2D of mask has to sacrifice resolution. Most previous studies took for granted that the location resolution is more important and downsamples/compresses the mask dimensions, hurting the expressiveness and quality of the masks. TensorMask tried to strike a balance but the tedious operations led to slow training and inference. SOLO realizes that we do not need high-resolution location information and borrows from YOLO by compressing location into a coarse S² grid. In this way, SOLO keeps the high resolution of global masks.

I naively thought SOLO could perhaps work by predicting the S² x W x H global masks as an additional flattened WH-dimensional feature attached to each of the S² grids in YOLO. I was wrong — the formulation of global masks in full resolution instead of a flattened vector is actually the key to SOLO’s success.

Decoupled SOLO and Dynamic SOLO

As mentioned above, the global masks predicted by SOLO in the S² channels are quite redundant and sparse. Even at a coarse resolution of S=20, there are 400 channels and it is unlikely that there are so many objects in the picture that each of the channels contains a valid instance mask.

In decoupled SOLO, the original M tensor of shape H x W x S² is replaced by two tensors X and Y each of shape H x W x S. For an object located at grid location (i, j), M_ij is approximated by the element-wise multiplication X_i ⊗ Y_j. This reduces 400 channels to 40 channels and experiments show that there is no degradation in performance.

Q3IRZvY.png!web

SOLO vs Decoupled SOLO vs SOLOv2

Now it is natural to ask can we borrow from YOLACT’s prototype mask idea by predicting even fewer masks and predicting coefficients for each grid cell to combine them? SOLOv2 does exactly that.

In SOLOv2, there are two branches, a feature branch and a kernel branch. The feature branch predicts E prototype masks, and the kernel branch predicts a kernel of size D at each of the S² grid cell locations. This dynamic filter approach is the most flexible as we saw in the YOLACT section above. When D=E, it is a simple linear combination of prototype masks (or 1x1 conv), the same as YOLACT. The paper also tried 3x3 conv kernels (D=9E). This can be taken a step further by predicting the weights and biases of a lightweight multi-layer FCN such as in CondInst.

f6Zz6zu.png!web

Now since the global mask branch is decoupled from its dedicated location, we can observe that the emerging prototype masks exhibit more complex patterns than that in SOLO. They are still position-sensitive and more similar to that of YOLACT.

Implicit Representation of Masks

The idea of dynamic filters used in CondInst and SOLOv2 sounds glorious at first but are actually quite simple if you think of it as a natural extension to a list of coefficients used for linear combination.

You can also think we parameterized the mask with the coefficients or attention maps or eventually, into dynamic filters for a small neural network head. The idea of using a neural network to dynamically encode a geometric entity is also explored in 3D learning recently. Traditionally, a 3D shape is either encoded with voxels, point clouds or mesh. Occupancy Networks (Occupancy Networks: Learning 3D Reconstruction in Function Space, CVPR 2019) proposed to encode the shape into a neural network, by considering the continuous decision boundary of a deep neural network as a 3D surface. The network takes in a point in 3D and tells whether it is on the boundary of the encoded 3D shape. This approach allows extracting 3D meshes at any resolution during inference.

VNr6zaQ.png!web

Implicit representation proposed in Occupancy Networks

Can we learn a neural network consisting of dynamic filters per object instance so that the network takes in a point in 2D and output if the point belongs to that object mask or not? This naturally outputs a global mask and can have any desired resolution. Looking back at the ablation study of CondInst, it is demonstrated that even without the prototype mask but only with CoordConv input (which serves as performing uniform spatial sampling). As this operation is detached from the resolution of the prototype masks, it would be interesting to input CoordConv alone at a higher resolution to get higher resolution global masks to see if this improves performance. I strongly believe the implicit encoding of instance mask is the future.

Single Stage Instance Segmentation — A Review