Stride and Padding in Convolutional Layers

Stride and padding are key hyperparameters of convolutional layers that regulate output dimensions and preserve spatial boundary information.

Padding Mechanics

Padding adds extra boundary values around input matrices to prevent boundary shrinking and preserve edge details.

Preserving Boundary Information

When a kernel slides across an image, the pixels at the boundaries are only convolved once or twice, whereas central pixels are convolved multiple times. This causes boundary information to be discarded, and shrinks the spatial dimensions of the output map at each layer.

Padding addresses this issue by adding rows and columns of zeros around the boundary of the input image. This ensures that boundary pixels are convolved as frequently as central pixels, preserving edge details and allowing for deeper networks.

Valid vs. Same Padding

Valid Padding does not add any padding, causing the output dimensions to shrink. Same Padding adds enough padding so that the output spatial dimensions match the input spatial dimensions (when stride is 1). For a kernel of size \\(K\\), the padding \\(P\\) required for same padding is:

\\(P = \\frac{K - 1}{2}\\)

Using same padding is common in modern architectures because it keeps spatial dimensions stable across layers, making it easier to design deep networks and use skip connections.

Stride Mechanics

Stride determines the step size of the kernel as it traverses the input dimensions.

Controlling Downsampling Rates

Stride is the step size of the kernel as it traverses the input dimensions. A stride of \\(S=1\\) moves the kernel one pixel at a time. A stride of \\(S=2\\) skips every other pixel, downsampling the spatial dimension of the output map by approximately 50%.

Downsampling reduces the computational cost of subsequent layers and increases the effective receptive field of the neurons, allowing them to capture wider spatial context in deeper layers.

Computational Trade-offs

While a larger stride reduces computational requirements, it also discards spatial details. Choosing between stride-based downsampling and pooling-based downsampling is a key design choice when building convolutional neural networks.

Modern networks like ResNet often use a stride of 2 in convolutional layers to perform downsampling, which simplifies the architecture by reducing the need for separate pooling layers.

PyTorch Configurations

We can configure stride and padding in PyTorch using integer or tuple values.

Customizing Stride and Padding

In PyTorch's nn.Conv2d, stride and padding can be configured as integers or tuples. Using tuples allows for asymmetric stride or padding (e.g., stride=(2, 1)), which can be useful when processing non-square inputs.

<pre><code class="language-python">import torch import torch.nn as nn # Input tensor shape: [batch_size, channels, height, width] x = torch.randn(1, 3, 32, 32) # Conv layer with stride of 2 and padding of 1 conv = nn.Conv2d(in_channels=3, out_channels=16, kernel_size=3, stride=2, padding=1) out = conv(x) print("Output spatial shape:", out.shape[2:]) # [16, 16]</pre>

In this example, the spatial shape is halved to 16x16, confirming that the combination of kernel_size=3, stride=2, and padding=1 has performed spatial downsampling.

Padding Styles and Tensor Boundaries

PyTorch also supports alternative padding styles, such as reflection padding (nn.ReflectionPad2d) and replication padding (nn.ReplicationPad2d). Reflection padding pads boundaries by mirroring the edge pixels, which reduces artifacts in image processing tasks compared to zero padding.

Configuring these alternative padding styles can improve performance in tasks like image generation or style transfer, where boundary artifacts can degrade output quality.