The Convolution Operation Explained

The 2D convolution operation is the core mechanism of CNNs, sliding a small weight matrix across an input to extract spatial feature maps with local connectivity and parameter sharing.

Mathematical Foundations

A convolution mathematically combines two matrices—the input and a kernel—to produce an output activation map.

Continuous vs. Discrete Convolution

Mathematically, a convolution is an operation on two functions that produces a third function expressing how the shape of one is modified by the other. In continuous space, the convolution of functions \\(f\\) and \\(g\\) is defined as \\((f * g)(t) = \\int_{-\\infty}^{\\infty} f(\\tau)g(t - \\tau)d\\tau\\). In deep learning, because image data is discrete, we use discrete 2D convolutions.

For a 2D input image \\(I\\) and a 2D kernel \\(K\\) of size \\(M \\times N\\), the discrete convolution is formulated as:

\\(S(i, j) = (I * K)(i, j) = \\sum_{m=0}^{M-1} \\sum_{n=0}^{N-1} I(i - m, j - n) K(m, n)\\)

In practice, most deep learning frameworks implement cross-correlation rather than mathematical convolution, which skips the step of flipping the kernel axes. Since kernel weights are learned during training, this difference does not affect model performance.

Sliding Window and Activations

The convolution operation works by sliding the kernel matrix across the input grid. At each spatial position, the kernel performs element-wise multiplication with the overlapping input patch and sums the results to produce a single value in the output feature map. This process is repeated across all horizontal and vertical positions.

The output feature map tracks the presence of the feature detected by the kernel across different areas of the input image. If a patch matches the kernel's pattern, the dot product will yield a high activation value, which is then passed to the activation function.

Channel Mixing and Depthwise Projections

Convolutional layers operate across multi-channel inputs, combining spatial and channel details.

Multi-Channel Convolutions

Real-world images have multiple channels (like RGB). In a multi-channel convolution, the kernel is a 3D tensor of shape \\((C_{in}, K_H, K_W)\\), where \\(C_{in}\\) is the number of input channels. The kernel performs a 2D convolution on each input channel individually, and the results are summed across the channel dimension to produce a single output channel.

If the layer has \\(C_{out}\\) output channels, it applies \\(C_{out}\\) independent 3D kernels. This allows the model to mix spatial information across channels, combining different color and feature components to form more complex representations in the next layer.

Parameter Sharing and Receptive Fields

By applying the same kernel across all spatial locations, the network shares parameters, which reduces the model's weight count and enforces translation equivariance. Additionally, as layers are stacked, the receptive field of the neurons increases, allowing the model to capture larger spatial context in deeper layers.

This parameter efficiency is what allows CNNs to scale to deep architectures and process high-resolution images without overfitting. The combination of shared weights and local connectivity forms the core inductive bias of convolutional neural networks.

PyTorch Implementation

PyTorch's Conv2d module provides an efficient backend for executing convolutional layers.

PyTorch nn.Conv2d API

PyTorch provides the nn.Conv2d module to perform 2D convolutions. The key arguments are in_channels, out_channels, kernel_size, stride, and padding. The module automatically initializes the learnable kernel weights and biases.

<pre><code class="language-python">import torch import torch.nn as nn class ConvExample(nn.Module): def __init__(self): super().__init__() # 3 input channels (RGB), 16 output channels, 3x3 kernel self.conv = nn.Conv2d(in_channels=3, out_channels=16, kernel_size=3, padding=1) def forward(self, x): # Input shape: [batch_size, channels, height, width] return self.conv(x) x = torch.randn(4, 3, 32, 32) model = ConvExample() out = model(x) print("Output shape:", out.shape) # [4, 16, 32, 32]</pre>

The output tensor has shape [4, 16, 32, 32], confirming that the spatial dimensions have been preserved due to padding=1, while the channel dimension has expanded from 3 to 16.

Verifying Weights and Bias Attributes

The weights of the convolutional layer are stored in the weight attribute, which has the shape [out_channels, in_channels, kernel_size, kernel_size]. The bias vector is stored in the bias attribute, with a shape matching [out_channels]. These parameters are optimized during backpropagation.

<pre><code class="language-python">conv_layer = nn.Conv2d(in_channels=3, out_channels=8, kernel_size=5) print("Conv Weight Shape:", conv_layer.weight.shape) # [8, 3, 5, 5] print("Conv Bias Shape:", conv_layer.bias.shape) # [8]</pre>

Inspecting these parameters helps verify that the layer has been initialized with the correct tensor shapes. If we disable the bias term by setting bias=False in the constructor, the bias attribute is initialized to None, which can reduce parameter counts when followed immediately by batch normalization.