Max Pooling and Average Pooling Layers
Pooling layers perform spatial downsampling on feature maps, reducing the computational footprint and introducing translation invariance.
Pooling Mechanics
Pooling layers reduce spatial dimensions by summarizing local patches using maximum or average values.
Max Pooling
Max pooling extracts the maximum value within a sliding spatial window. For a window of size \\(P \\times P\\), max pooling slides across the feature map and selects the largest activation. Mathematically, it operates on a local region \\(R\\):
\\(y = \\max_{(i, j) \\in R} x_{i, j}\\)
This operation selects the most prominent features, such as sharp edges or textures, while throwing away background details. Because it selects the peak activation, max pooling is highly robust to small spatial translations of the input features. If a feature shifts slightly, it will still yield the same maximum value within the pooling window, introducing translation invariance.
Average Pooling
Average pooling computes the average value within the sliding window:
\\(y = \\frac{1}{|R|} \\sum_{(i, j) \\in R} x_{i, j}\\)
This operation smooths the feature map, capturing the general activation level rather than the peak feature. While average pooling was popular in older networks, it can wash out sharp local features. In modern networks, average pooling is primarily used at the end of the network (Global Average Pooling) to collapse spatial dimensions before classification.
Pooling Hyperparameters and Shapes
Pooling layers use fixed mathematical functions without learnable weights to downsample features.
Window Size and Stride
Pooling layers have two main hyperparameters: kernel size (window size) and stride. Unlike convolutional layers, pooling is typically configured with a stride equal to the kernel size (e.g., 2x2 pool with stride of 2). This prevents overlapping windows and reduces spatial dimensions by exactly half.
Pooling layers do not contain any learnable parameters; they perform fixed mathematical operations, which reduces the model's memory footprint and speeds up execution.
Global Average Pooling (GAP)
Global Average Pooling collapses the spatial dimensions \\((H, W)\\) of a tensor to \\(1 \\times 1\\) by averaging all pixels in each channel. For a tensor of shape \\((N, C, H, W)\\), GAP produces a tensor of shape \\((N, C, 1, 1)\\).
This operation replaces fully connected layers at the end of CNNs, drastically reducing parameter counts and preventing overfitting. It also makes the model robust to variable input sizes, as the output of GAP is always determined by the channel count.
PyTorch Pooling Implementation
PyTorch offers distinct classes for standard pooling and adaptive global pooling.
nn.MaxPool2d and nn.AvgPool2d
PyTorch provides the nn.MaxPool2d and nn.AvgPool2d classes to perform pooling. The code below shows how to apply these layers to a simulated batch of feature maps.
In this example, both pooling layers halve the spatial resolution of the feature maps without changing the number of channels, demonstrating how pooling scales down layer dimensions.
Global Pooling Implementation
We can implement Global Average Pooling in PyTorch using nn.AdaptiveAvgPool2d(1). This layer dynamically adjusts the pooling window size to ensure that the output spatial dimension is exactly \\(1 \\times 1\\), regardless of the input size.
Adaptive pooling dynamically handles varying input dimensions, computing the appropriate stride and kernel parameters internally. This makes the model robust to shifts in input image resolution.