SSD (Single Shot MultiBox Detector)

While YOLOv1 struggled with small objects due to its coarse grid, SSD (Single Shot MultiBox Detector) introduced multi-scale feature map predictions. By attaching detector heads to multiple layers of varying resolution, SSD handles objects of different sizes in a single pass.

Multi-Scale Feature Maps

SSD uses a base network (such as VGG or ResNet) followed by additional convolutional layers that progressively decrease in size, enabling multi-scale predictions.

Intermediate Layer Predictions

Earlier convolutional layers have high spatial resolution and are used to detect small objects. Later layers have lower spatial resolution (large receptive fields) and are used to detect larger objects.

Default Boxes (Priors)

Similar to Faster R-CNN's anchors, SSD places a set of default boxes (priors) of different aspect ratios at every cell of each feature map. This ensures a broad range of coverage across shapes and scales.

SSD Training Dynamics

Training SSD involves matching default boxes to ground truth boxes and balancing the loss function to handle the overwhelming amount of background area (negative samples).

Hard Negative Mining

Since most default boxes do not contain objects, there is an extreme class imbalance between positive and negative samples. SSD addresses this by sorting negative boxes by loss values and picking only the worst ones to maintain a 3:1 ratio of negatives to positives during training.

Inference and Multi-Scale Loss

During inference, predictions from all layers are combined, and redundant boxes are removed. The model is trained with a combination of Smooth L1 regression loss and Softmax classification loss.