Calculating Output Dimensions of Convolutions

Calculating output dimensions is critical for configuring neural architectures, ensuring layers align properly without dimension mismatches.

The Dimension Formula

The output height and width of a 2D convolutional layer depend on the input shape, kernel size, padding, dilation, and stride.

Mathematical Formula Derivation

The output height \\(H_{out}\\) and width \\(W_{out}\\) of a 2D convolutional layer depend on the input shape, kernel size \\(K\\), padding \\(P\\), dilation \\(D\\), and stride \\(S\\). When dilation is 1, the formula is:

\\(H_{out} = \\lfloor \\frac{H_{in} + 2P - K}{S} \\rfloor + 1\\)

The floor division operation \\(\\lfloor \\cdot \\rfloor\\) is necessary because the kernel cannot slide beyond the boundaries of the padded input; if the division yields a fractional value, the remaining pixels are discarded. For example, if we have an input of size 32x32, a kernel of size 3, padding of 1, and stride of 2, the output size is: \\(H_{out} = \\lfloor \\frac{32 + 2(1) - 3}{2} \\rfloor + 1 = \\lfloor \\frac{31}{2} \\rfloor + 1 = 15 + 1 = 16\\). This downsamples the input by exactly 50%.

The Role of Dilation

Dilation (or atrous convolution) introduces spaces between kernel cells, expanding the kernel's receptive field without adding parameters. For a kernel of size \\(K\\) and dilation \\(D\\), the effective kernel size becomes \\(K_{eff} = D(K - 1) + 1\\). The general formula for output dimensions incorporating dilation is:

\\(H_{out} = \\lfloor \\frac{H_{in} + 2P - D(K - 1) - 1}{S} \\rfloor + 1\\)

When dilation is greater than 1, the kernel looks at pixels that are spaced apart, which is useful for dense prediction tasks like semantic segmentation where wide spatial context is needed without downsampling. Dilated convolutions are widely used in architectures like WaveNet and DeepLab.

Floor Division and Edge Cases

Understanding fractional outputs and how frameworks handle pixel discards prevents bugs in deep architectures.

Fractional Strides and Pixel Discards

When the numerator of the dimension formula is not divisible by the stride, the division yields a fractional value. Because PyTorch's floor division discards the remainder, the last column or row of pixels is ignored during the forward pass.

This can lead to minor information loss at the boundaries. If preservation of every pixel is required, developers must adjust padding or kernel size to ensure that the numerator is divisible by the stride.

PyTorch Dimension Verification

We can verify these calculations by passing dummy tensors through a convolutional layer in PyTorch and printing the shape of the output tensor.

<pre><code class="language-python">import torch import torch.nn as nn # Input: [batch, channels, height, width] x = torch.randn(1, 3, 32, 32) # Custom conv layer: kernel=5, stride=2, padding=1, dilation=2 # K_eff = 2*(5-1)+1 = 9 # H_out = floor((32 + 2 - 9) / 2) + 1 = floor(25 / 2) + 1 = 13 conv = nn.Conv2d(in_channels=3, out_channels=8, kernel_size=5, stride=2, padding=1, dilation=2) out = conv(x) print("Output shape:", out.shape) # [1, 8, 13, 13]</pre>

The printed shape matches our mathematical calculations, confirming that PyTorch's backend uses the exact same formulas including dilation and floor division rules.

Architecting Deep CNNs

Stacking layers requires balancing spatial dimensions and channel counts to maintain representation capacity.

Aligning Spatial Dimensions

In deep networks, spatial dimensions must be downsampled progressively while channel dimensions are expanded. If the spatial dimensions shrink too fast, the model loses spatial detail before extracting deep semantic features.

Aligning these dimensions requires planning the stride and padding at each layer. For example, if a model downsamples using stride=2, it must expand channels (e.g., from 64 to 128) to maintain the information capacity of the feature maps.

Designing Custom Shape-Matching Blocks

When building complex architectures with residual connections, the output of a block must match the spatial and channel dimensions of the skip connection. If the dimensions do not match, the skip connection cannot be added to the output.

<pre><code class="language-python">import torch import torch.nn as nn class ProjectionHelper(nn.Module): def __init__(self, in_c, out_c, stride): super().__init__() # Use 1x1 conv to match channels and stride to match spatial dimensions self.proj = nn.Conv2d(in_c, out_c, kernel_size=1, stride=stride, bias=False) def forward(self, x): return self.proj(x) x = torch.randn(1, 64, 32, 32) proj = ProjectionHelper(in_c=64, out_c=128, stride=2) print("Projected shape:", proj(x).shape) # [1, 128, 16, 16]</pre>

Using 1x1 projection convolutions is the standard method for adjusting channels and resolution in models like ResNet, allowing skip connections to merge inputs of differing sizes without requiring manual resizing filters.