ResNet (Residual Networks) and Skip Connections

ResNet introduced skip connections that bypass layers, allowing gradients to flow directly through the network and enabling the training of extremely deep architectures.


The Residual Learning Framework

Residual learning reformulates stacked layers to fit residual mappings instead of direct underlying representations.

Mathematical Formulation

Instead of forcing a stack of layers to fit a desired underlying mapping \\(H(x)\\), ResNet reformulates the layers to learn a residual mapping \\(F(x) := H(x) - x\\). The original mapping is cast into \\(F(x) + x\\), which is implemented by adding a skip connection (or identity shortcut) that feeds the input \\(x\\) directly to the output of the convolutional layers:

\\(H(x) = F(x, \\{W_i\\}) + x\\)

This formulation assumes that optimizing a residual mapping is much easier than optimizing the original, unreferenced mapping. If the optimal mapping is an identity, it is much easier for optimizer algorithms to push the residual weights \\(F(x)\\) to zero than to train a stack of non-linear layers to behave like an identity mapping from scratch. This simple formulation completely resolves the degradation problem, allowing networks to scale to hundreds or thousands of layers.

Skip Connection Variants (Projection Shortcuts)

When the input \\(x\\) and the output \\(F(x)\\) have the same dimensions, they can be added element-wise directly. However, if the convolutional layers change the spatial dimensions (via stride) or the channel depth, a simple addition is impossible.

In these cases, a projection shortcut is used to adjust dimensions. This is typically implemented using a 1x1 convolution that projects the input \\(x\\) to match the target channels and spatial dimensions:

\\(H(x) = F(x, \\{W_i\\}) + W_s x\\)

where \\(W_s\\) represents the projection weights. This projection ensures that tensors match in size for the final addition operation.

ResNet-34 vs. ResNet-50 (Bottleneck Design)

Deeper ResNet architectures transition from basic double-conv blocks to triple-conv bottleneck designs to control parameter scaling.

Basic Block vs. Bottleneck Block

Shallow ResNet models (like ResNet-18 and ResNet-34) use the "Basic Block," which consists of two stacked 3x3 convolutional layers. Deep ResNet models (like ResNet-50, ResNet-101, and ResNet-152) use the "Bottleneck Block" to control computational costs.

The bottleneck block uses three stacked convolutions: a 1x1 convolution to reduce channels, a 3x3 convolution, and a final 1x1 convolution to restore the original channel depth. This design reduces parameters and FLOPs, allowing the network to be much deeper without overloading memory.

Impact on Gradient Flow and Loss Landscapes

The introduction of skip connections has a profound impact on the optimization process. By bypassing layers, skip connections create paths where gradients can flow directly from the final layers to the early layers during backpropagation, bypassing the vanishing gradient problem.

Additionally, mathematical analysis shows that skip connections flatten the loss landscape, turning a rugged, chaotic surface into a smooth, convex-like bowl. This makes training stable and allows convergence using standard optimizers like SGD, even at extreme depths.

PyTorch Implementation of a Residual Block

Let's build a custom ResNet Basic Block and Bottleneck Block in PyTorch, commenting on shape dimensions.

Building a Custom ResNet Block

The code below shows how to implement a basic Residual Block with a projection shortcut in PyTorch.

<pre><code class="language-python">import torch import torch.nn as nn class ResidualBlock(nn.Module): def __init__(self, in_c, out_c, stride=1): super().__init__() self.conv1 = nn.Conv2d(in_c, out_c, kernel_size=3, stride=stride, padding=1, bias=False) self.bn1 = nn.BatchNorm2d(out_c) self.relu = nn.ReLU(inplace=True) self.conv2 = nn.Conv2d(out_c, out_c, kernel_size=3, stride=1, padding=1, bias=False) self.bn2 = nn.BatchNorm2d(out_c) self.shortcut = nn.Sequential() if stride != 1 or in_c != out_c: self.shortcut = nn.Sequential( nn.Conv2d(in_c, out_c, kernel_size=1, stride=stride, bias=False), nn.BatchNorm2d(out_c) ) def forward(self, x): residual = self.shortcut(x) out = self.conv1(x) out = self.bn1(out) out = self.relu(out) out = self.conv2(out) out = self.bn2(out) out += residual return self.relu(out) x = torch.randn(2, 64, 32, 32) block = ResidualBlock(in_c=64, out_c=128, stride=2) out = block(x) print("Residual Block shape:", out.shape) # [2, 128, 16, 16]</pre>

In this block, the shortcut path uses a 1x1 convolution when stride is 2, adjusting the channel size from 64 to 128 and matching spatial sizes for the element-wise addition step.

Bottleneck Block Implementation

We can extend this modular design to implement a Bottleneck Block, which is crucial for building larger architectures like ResNet-50.

<pre><code class="language-python">import torch import torch.nn as nn class BottleneckBlock(nn.Module): expansion = 4 def __init__(self, in_c, base_c, stride=1): super().__init__() out_c = base_c * self.expansion self.conv1 = nn.Conv2d(in_c, base_c, kernel_size=1, bias=False) self.bn1 = nn.BatchNorm2d(base_c) self.conv2 = nn.Conv2d(base_c, base_c, kernel_size=3, stride=stride, padding=1, bias=False) self.bn2 = nn.BatchNorm2d(base_c) self.conv3 = nn.Conv2d(base_c, out_c, kernel_size=1, bias=False) self.bn3 = nn.BatchNorm2d(out_c) self.relu = nn.ReLU(inplace=True) self.shortcut = nn.Sequential() if stride != 1 or in_c != out_c: self.shortcut = nn.Sequential( nn.Conv2d(in_c, out_c, kernel_size=1, stride=stride, bias=False), nn.BatchNorm2d(out_c) ) def forward(self, x): residual = self.shortcut(x) out = self.relu(self.bn1(self.conv1(x))) out = self.relu(self.bn2(self.conv2(out))) out = self.bn3(self.conv3(out)) out += residual return self.relu(out) x = torch.randn(2, 64, 32, 32) bottleneck = BottleneckBlock(in_c=64, base_c=32) out_btn = bottleneck(x) print("Bottleneck shape:", out_btn.shape) # [2, 128, 32, 32]</pre>

The bottleneck block uses 1x1 convolutions to project the input into a lower channel space before running the 3x3 convolution, saving computations. The final 1x1 layer restores the channels to 4x expansion size.