Two-Stage Detectors: R-CNN and Fast R-CNN

Two-stage detectors divide the object detection problem into region proposal extraction followed by region classification. R-CNN pioneered this approach, which Fast R-CNN later optimized by introducing shareable feature maps and Region of Interest (RoI) pooling.

R-CNN: Region-based CNN

R-CNN uses a standalone proposal algorithm (Selective Search) to identify candidate regions, warps each region to a fixed size, and processes each independently through a CNN.

The Proposal-First Pipeline

Selective Search proposes ~2000 regions per image. R-CNN crops these regions, resizes them, and runs them through a CNN to extract features. Finally, SVMs classify the features, and a linear regressor refines box boundaries.

Computational Bottlenecks

Processing 2000 regions separately requires 2000 forward passes of the CNN per image. This redundant calculation made inference take nearly 47 seconds per image, preventing real-time application.

Fast R-CNN: Shareable Features

Fast R-CNN resolved R-CNN's main bottleneck by passing the entire image through the CNN once and using RoI Pooling to extract region-specific features directly from the shared feature map.

RoI Pooling Concept

RoI Pooling takes a region of arbitrary size from the feature map and projects it onto a fixed grid (e.g., 7x7) using max pooling. This allows the subsequent fully connected layers to receive uniform input shapes.

PyTorch RoIPool

Torchvision offers high-performance pooling operations for regional feature extraction.

<pre><code class="language-python">import torchvision.ops as ops import torch # Simulated shared feature map: [batch, channels, height, width] features = torch.randn(1, 64, 50, 50) # Simulated regions: [image_idx, x1, y1, x2, y2] rois = torch.tensor([[0.0, 10.0, 10.0, 40.0, 40.0]], dtype=torch.float32) # Pool features to a fixed 7x7 size pooled_feats = ops.roi_pool(features, rois, output_size=(7, 7), spatial_scale=1.0) print(pooled_feats.shape) # torch.Size([1, 64, 7, 7])</pre>