Receptive Fields in CNNs

The receptive field is the specific region of the input image that influences the activation of a particular neuron in a deeper layer of the network.


Receptive Field Calculation

The receptive field of neurons expands deeper in the network, depending on the kernel sizes and strides of previous layers.

Receptive Field Expansion

In the first convolutional layer, each neuron's receptive field matches the kernel size. As layers are stacked, the receptive field of neurons in subsequent layers increases recursively, allowing them to capture wider spatial context.

For a layer \\(l\\) with kernel size \\(k_l\\), stride \\(s_l\\), and receptive field \\(r_{l-1}\\) from the previous layer, the receptive field \\(r_l\\) is calculated as:

\\(r_l = r_{l-1} + (k_l - 1) \\times j_{l-1}\\)

where \\(j_{l-1}\\) is the cumulative stride up to layer \\(l-1\\), calculated as \\(j_{l-1} = \\prod_{i=1}^{l-1} s_i\\). This formula shows how stride accelerates receptive field growth.

Effective Receptive Field

While the theoretical receptive field grows linearly or exponentially, the effective receptive field (ERF) is much smaller. The ERF represents the area of the input that has a significant impact on the neuron's activation.

Mathematical analysis shows that the impact of input pixels follows a Gaussian distribution centered in the middle of the theoretical receptive field. This means that peripheral pixels have very little influence, making the effective receptive field smaller than the theoretical one.

Receptive Field Design Trade-offs

Designing the network to capture sufficient receptive field is essential for identifying large objects in visual scenes.

Small vs. Large Kernels

Stacking multiple small kernels (e.g., three 3x3 convolutions) yields the same receptive field as a single large kernel (e.g., one 7x7 convolution), but requires fewer parameters and incorporates more non-linear activations.

Specifically, three 3x3 convolutions have a receptive field of 7, but only require \\(3 \\times (3 \\times 3) = 27\\) weights per channel, compared to 49 weights for a 7x7 kernel. This design choice is a key feature of architectures like VGG.

Receptive Field for Semantic Tasks

Tasks like object detection and semantic segmentation require large receptive fields to capture the context of large objects. Without sufficient receptive field, the model cannot identify large objects because it can only see local details.

To increase the receptive field without discarding spatial details through pooling, architectures use dilated convolutions, which expand the kernel's footprint while keeping parameter counts stable.

Receptive Field Verification in PyTorch

Let's calculate the growth of receptive fields mathematically to verify network design choice parameters.

Mathematical Calculation Class

We can write a Python helper class to calculate the receptive field size at each layer of a CNN. This helps verify that the model has sufficient receptive field for the target task.

<pre><code class="language-python">def compute_rf(layers): # layers is a list of dicts: [{'k': kernel_size, 's': stride}] rf = 1 jump = 1 for layer in layers: k = layer['k'] s = layer['s'] rf = rf + (k - 1) * jump jump = jump * s return rf, jump # Sequence: three 3x3 convs with stride 1 sequence = [{'k': 3, 's': 1}, {'k': 3, 's': 1}, {'k': 3, 's': 1}] rf, _ = compute_rf(sequence) print("Theoretical Receptive Field:", rf) # Expected: 7</pre>

Calculating the receptive field before building a deep architecture ensures that neurons in the final layer have access to a large enough portion of the input image to identify target objects.

Receptive Field Growth with Stride

Adding stride to the layers accelerates receptive field growth. If we change the stride of the first layer in the sequence to 2, the receptive field of subsequent layers increases much faster.

<pre><code class="language-python"># Sequence: Conv (k=3, s=2), Conv (k=3, s=1) sequence_stride = [{'k': 3, 's': 2}, {'k': 3, 's': 1}] rf_stride, _ = compute_rf(sequence_stride) print("Receptive Field with Stride:", rf_stride) # Expected: 5</pre>

In this configuration, the second layer's neurons have a receptive field of 5 instead of 3, demonstrating how strides expand receptive fields while simultaneously downsampling spatial dimensions.