One-Stage Detectors: YOLO (You Only Look Once)
Unlike two-stage models, one-stage detectors skip the region proposal step entirely. YOLO (You Only Look Once) pioneered this class of detectors, framing object detection as a single regression problem that maps pixels directly to bounding boxes and class probabilities.
The YOLO Grid Concept
YOLO divides the input image into an S \\times S grid. If an object's center falls into a grid cell, that cell is responsible for predicting the object's presence, location, and class.
Single Forward Pass
Each grid cell predicts B bounding boxes (coordinates [x, y, w, h] and a confidence score) and C conditional class probabilities. This yields an output tensor of shape [S, S, B \\times 5 + C], which is computed in one pass.
The Unified Loss Function
YOLO uses a custom multi-part loss function combining coordinate regression (using sum-squared error), confidence score loss (for cells containing and not containing objects), and class probability loss: L_{YOLO} = \\lambda_{coord} L_{coord} + L_{obj} + \\lambda_{noobj} L_{noobj} + L_{class}.
Real-Time Performance and Limitations
By processing images in a single step, YOLO models achieve remarkable speed, enabling real-time detection on video feeds, though early versions struggled with small or clustered objects.
Inference Speed Advantage
Because the network contains no separate region proposal step, inference is extremely fast (often exceeding 45 frames per second on standard GPUs), making it the standard choice for video analytics and mobile deployment.
Grid Cell Constraints
Early YOLO versions restricted each grid cell to predicting a limited number of boxes with one class, making the model struggle with groups of small objects (like flocks of birds) or highly overlapping items.