Image Classification vs. Object Detection

Image classification and object detection represent the two fundamental tasks in computer vision. While classification identifies 'what' is in an image, object detection identifies both 'what' and 'where' the objects are, transforming the prediction task from a single label classification to spatial localization.

Classification vs Detection Paradigm

The core difference lies in the output format and complexity. Image classification maps an input image to a single probability distribution over classes, whereas object detection maps an image to a variable number of bounding boxes, each with its own class probability.

Task Definition & Outputs

In classification, the output is a vector of logits representing class scores. In detection, the network must output both a class label and four regression coordinates representing the bounding box bounds: [x_{min}, y_{min}, x_{max}, y_{max}] or [x_{center}, y_{center}, w, h].

PyTorch Output Representation

For detection models, PyTorch uses a list of dicts output format rather than simple tensor labels.

<pre><code class="language-python">import torch # Standard classification output class_output = torch.randn(2, 10) # batch_size=2, num_classes=10 # Detection output: list of dictionaries per image detection_output = [ { 'boxes': torch.tensor([[10.0, 20.0, 150.0, 200.0]], dtype=torch.float32), # [x1, y1, x2, y2] 'labels': torch.tensor([3], dtype=torch.int64), # Class index 'scores': torch.tensor([0.95], dtype=torch.float32) # Confidence score } ]</pre>

Localization Challenges

Adding spatial localization introduces two primary challenges: handling a variable number of objects in an image and optimizing a multi-task loss function.

Variable Output Lengths

Unlike classification where the output size is fixed (equal to the number of classes), the number of objects in an image is unknown in advance. Object detection models use techniques like anchor grids, region proposals, or dense window predictions to handle variable output lengths.

Multi-Task Loss Functions

To train a detector, we combine classification loss (e.g., cross-entropy) and regression loss (e.g., Smooth L1 or GIoU loss) into a single multi-task loss function: L = L_{cls} + \\lambda L_{box}, where \\lambda balances the two objectives.