Semantic Segmentation Concepts

Semantic segmentation goes beyond bounding boxes to perform pixel-level classification. The goal is to assign a semantic class label (such as sky, road, or car) to every single pixel in the image, resulting in a dense category map.

Semantic Segmentation vs. Classification

Unlike standard classification which collapses spatial coordinates into a single label, semantic segmentation preserves spatial dimensionality throughout the network to output a mask matching the input dimensions.

Pixel-Level Mapping

Given an input image of shape [C, H, W], a semantic segmentation model outputs a tensor of shape [K, H, W], where K is the number of target classes. Applying argmax along the class dimension yields an [H, W] mask of class IDs.

Pixel-Wise Loss

Training utilizes pixel-wise Cross-Entropy loss. The loss is computed independently for each pixel coordinate and averaged over the entire spatial grid.

Architectural Foundation: Fully Convolutional Networks

Fully Convolutional Networks (FCNs) replaced fully connected layers with convolutions, allowing networks to accept variable input sizes and output spatial masks.

Spatial Downsampling & Upsampling

An FCN downsamples the image using pooling layers to extract abstract semantic features, then upsamples those features to restore the original resolution using bilinear interpolation or transposed convolutions.

Transposed Convolutions

A transposed convolution (fractionally strided convolution) is a learnable upsampling layer. It expands spatial dimensions by inserting padding and applying standard convolution filters.

<pre><code class="language-python">import torch import torch.nn as nn # Downsampled features: [batch, channels, height, width] x = torch.randn(1, 128, 16, 16) # Upsample by a factor of 2 upsample = nn.ConvTranspose2d(128, 64, kernel_size=4, stride=2, padding=1) output = upsample(x) print(output.shape) # torch.Size([1, 64, 32, 32])</pre>