Instance Segmentation vs. Semantic Segmentation

While semantic segmentation treats all pixels of a class as a single continuous mask, instance segmentation detects and delineates each individual object separately. This difference is critical for tasks like crowd counting or tracking individual vehicles in traffic.


Class-Level vs. Instance-Level Labeling

Semantic segmentation groups pixels by class, whereas instance segmentation tracks individual instances, separating adjacent objects of the same class into distinct masks.

Delineating Adjacent Objects

If five people are standing next to each other, semantic segmentation outputs a single unified blob representing 'person'. Instance segmentation outputs five distinct boundary masks, each labeled as a unique person entity.

Output Representations

Semantic output is a single 2D label map. Instance output contains bounding boxes, confidence scores, class labels, and individual binary mask layers for each detected object.

Methodology and Evaluation

Instance segmentation models are evaluated differently and generally combine object detection techniques with segmentation heads.

Top-Down vs. Bottom-Up

Top-down models (e.g., Mask R-CNN) detect bounding boxes first and then segment pixels inside each box. Bottom-up models segment pixels first and then cluster them into separate object instances using embedding shifts.

Evaluation: Mask mAP

Instance segmentation is evaluated using Mask mAP, which is computed similarly to bounding box mAP, but replaces the bounding box IoU calculation with pixel-level mask IoU: IoU_{mask} = \\frac{|M_{pred} \\cap M_{gt}|}{|M_{pred} \\cup M_{gt}|}.