Mask R-CNN Deep Dive

Mask R-CNN, developed by He et al., extends Faster R-CNN by adding a parallel branch for predicting pixel-level segmentation masks. It replaced RoIPool with RoIAlign to resolve spatial quantization errors, enabling precise pixel alignment.

RoIAlign: Precise Spatial Alignment

RoIPool rounds coordinates to discrete values, causing spatial misalignments that degrade pixel-level masks. RoIAlign resolves this by avoiding coordinate quantization.

The Quantization Problem

When projecting bounding box coordinates from input space to a downsampled feature map, RoIPool performs rounding (quantization) twice. This introduces a misalignment of several pixels, which ruins masks. RoIAlign uses bilinear interpolation to extract feature values at continuous coordinates.

Bilinear Interpolation in RoIAlign

RoIAlign divides each RoI bin into four sub-cells, calculates feature values at four sampled points in each sub-cell using bilinear interpolation, and aggregates them using max or average pooling.

Parallel Mask Branch & Loss

In Mask R-CNN, a Fully Convolutional Network (FCN) head runs in parallel with the classification and box regression heads to generate high-resolution binary masks.

The Mask Branch

For each RoI, the mask branch outputs a tensor of shape [K, m, m], where m \\times m is the mask resolution (e.g., 28x28) and K is the number of classes. This allows the model to predict a dedicated mask for each class, preventing class competition.

Multi-Task Loss

The training loss is defined as: L = L_{cls} + L_{box} + L_{mask}. The mask loss L_{mask} is computed as the average binary cross-entropy loss over the predicted mask, restricted to the ground-truth class channel.