Instance Segmentation vs. Semantic Segmentation
While semantic segmentation treats all pixels of a class as a single continuous mask, instance segmentation detects and delineates each individual object separately. This difference is critical for tasks like crowd counting or tracking individual vehicles in traffic.
Class-Level vs. Instance-Level Labeling
Semantic segmentation groups pixels by class, whereas instance segmentation tracks individual instances, separating adjacent objects of the same class into distinct masks.
Delineating Adjacent Objects
If five people are standing next to each other, semantic segmentation outputs a single unified blob representing 'person'. Instance segmentation outputs five distinct boundary masks, each labeled as a unique person entity.
Output Representations
Semantic output is a single 2D label map. Instance output contains bounding boxes, confidence scores, class labels, and individual binary mask layers for each detected object.
Methodology and Evaluation
Instance segmentation models are evaluated differently and generally combine object detection techniques with segmentation heads.
Top-Down vs. Bottom-Up
Top-down models (e.g., Mask R-CNN) detect bounding boxes first and then segment pixels inside each box. Bottom-up models segment pixels first and then cluster them into separate object instances using embedding shifts.
Evaluation: Mask mAP
Instance segmentation is evaluated using Mask mAP, which is computed similarly to bounding box mAP, but replaces the bounding box IoU calculation with pixel-level mask IoU: IoU_{mask} = \\frac{|M_{pred} \\cap M_{gt}|}{|M_{pred} \\cup M_{gt}|}.