Max Pooling and Average Pooling Layers

Pooling layers perform spatial downsampling on feature maps, reducing the computational footprint and introducing translation invariance.

Pooling Mechanics

Pooling layers reduce spatial dimensions by summarizing local patches using maximum or average values.

Max Pooling

Max pooling extracts the maximum value within a sliding spatial window. For a window of size \\(P \\times P\\), max pooling slides across the feature map and selects the largest activation. Mathematically, it operates on a local region \\(R\\):

\\(y = \\max_{(i, j) \\in R} x_{i, j}\\)

This operation selects the most prominent features, such as sharp edges or textures, while throwing away background details. Because it selects the peak activation, max pooling is highly robust to small spatial translations of the input features. If a feature shifts slightly, it will still yield the same maximum value within the pooling window, introducing translation invariance.

Average Pooling

Average pooling computes the average value within the sliding window:

\\(y = \\frac{1}{|R|} \\sum_{(i, j) \\in R} x_{i, j}\\)

This operation smooths the feature map, capturing the general activation level rather than the peak feature. While average pooling was popular in older networks, it can wash out sharp local features. In modern networks, average pooling is primarily used at the end of the network (Global Average Pooling) to collapse spatial dimensions before classification.

Pooling Hyperparameters and Shapes

Pooling layers use fixed mathematical functions without learnable weights to downsample features.

Window Size and Stride

Pooling layers have two main hyperparameters: kernel size (window size) and stride. Unlike convolutional layers, pooling is typically configured with a stride equal to the kernel size (e.g., 2x2 pool with stride of 2). This prevents overlapping windows and reduces spatial dimensions by exactly half.

Pooling layers do not contain any learnable parameters; they perform fixed mathematical operations, which reduces the model's memory footprint and speeds up execution.

Global Average Pooling (GAP)

Global Average Pooling collapses the spatial dimensions \\((H, W)\\) of a tensor to \\(1 \\times 1\\) by averaging all pixels in each channel. For a tensor of shape \\((N, C, H, W)\\), GAP produces a tensor of shape \\((N, C, 1, 1)\\).

This operation replaces fully connected layers at the end of CNNs, drastically reducing parameter counts and preventing overfitting. It also makes the model robust to variable input sizes, as the output of GAP is always determined by the channel count.

PyTorch Pooling Implementation

PyTorch offers distinct classes for standard pooling and adaptive global pooling.

nn.MaxPool2d and nn.AvgPool2d

PyTorch provides the nn.MaxPool2d and nn.AvgPool2d classes to perform pooling. The code below shows how to apply these layers to a simulated batch of feature maps.

<pre><code class="language-python">import torch import torch.nn as nn x = torch.randn(1, 16, 32, 32) max_pool = nn.MaxPool2d(kernel_size=2, stride=2) avg_pool = nn.AvgPool2d(kernel_size=2, stride=2) print("Max Pool shape:", max_pool(x).shape) # [1, 16, 16, 16] print("Avg Pool shape:", avg_pool(x).shape) # [1, 16, 16, 16]</pre>

In this example, both pooling layers halve the spatial resolution of the feature maps without changing the number of channels, demonstrating how pooling scales down layer dimensions.

Global Pooling Implementation

We can implement Global Average Pooling in PyTorch using nn.AdaptiveAvgPool2d(1). This layer dynamically adjusts the pooling window size to ensure that the output spatial dimension is exactly \\(1 \\times 1\\), regardless of the input size.

<pre><code class="language-python">import torch import torch.nn as nn x = torch.randn(2, 64, 14, 14) global_pool = nn.AdaptiveAvgPool2d(1) out = global_pool(x) print("Global Pool shape:", out.shape) # [2, 64, 1, 1] # Flatten for classification out_flat = torch.flatten(out, 1) print("Flattened shape:", out_flat.shape) # [2, 64]</pre>

Adaptive pooling dynamically handles varying input dimensions, computing the appropriate stride and kernel parameters internally. This makes the model robust to shifts in input image resolution.