The Evolution of CNNs: LeNet-5

LeNet-5, developed by Yann LeCun in 1998, is the pioneering convolutional neural network that established the standard pipeline of stacked convolutions, pooling, and fully connected layers.

LeNet-5 Architecture Structure

Designed for optical character recognition, LeNet-5 uses alternating convolutional and subsampling layers to extract robust features.

Alternating Convolution and Subsampling

LeNet-5 was designed for optical character recognition, specifically classifying 32x32 grayscale images of handwritten digits. The network starts with a 5x5 convolution (C1) yielding 6 feature maps of size 28x28, followed by a subsampling (S2) pooling layer that downsamples them to 14x14. This is followed by a second 5x5 convolution (C3) yielding 16 feature maps of size 10x10, and another subsampling (S4) pooling layer reducing them to 5x5.

This alternating structure established the core pattern of convolutional feature extraction: using convolutions to detect local patterns and pooling to downsample spatial dimensions. The subsampling layers in LeNet-5 were average pooling layers that also applied a learnable coefficient and bias to each channel, which is different from modern parameter-free average pooling layers.

Connection Tables in Layer C3

A unique design choice in LeNet-5 is the connection table between S2 and C3. Instead of connecting every input channel to every output channel, C3 uses a sparse connection matrix where different output channels are connected to specific subsets of input channels (e.g., some look at channels 0,1,2, while others look at 1,2,3).

This sparse connectivity was designed to reduce the number of parameters and break the symmetry of the network, forcing different channels to learn complementary features. In modern networks, this has been replaced by dense connections or depthwise separable convolutions due to the abundance of computational resources.

Classifier and Output Layers

The final layers of LeNet-5 map high-level feature vectors to character predictions using Euclidean distances.

Radial Basis Functions (RBF)

At the end of the feature extraction pipeline, LeNet-5 has a fully connected layer (C5) followed by a dense layer (F6) with 84 units. The output layer does not use a standard Softmax activation. Instead, it uses Euclidean Radial Basis Function (RBF) network units to compute output scores.

Each RBF unit computes the squared Euclidean distance between the input vector and a target prototype vector representing the class. The target class with the smallest distance (highest negative activation) is chosen as the prediction, representing the closest digit template matching the output feature.

Limitations and Modern Evolution

While LeNet-5 was a breakthrough, it suffered from limitations that prevented it from scaling to larger datasets. It relied on sigmoid and tanh activation functions, which caused vanishing gradients in deeper networks. It also lacked modern normalization layers like Batch Normalization and regularizers like Dropout.

The network was also constrained by the hardware of the late 1990s, running on CPUs. As GPUs emerged, these scaling issues were resolved, leading to deeper networks like AlexNet that could handle high-resolution color images.

PyTorch Implementation of LeNet-5

Let's build a PyTorch implementation of the LeNet-5 architecture, modernizing its pooling and dense layers.

Building LeNet-5 in PyTorch

The code below shows how to implement the classic LeNet-5 architecture in PyTorch, replacing the original subsampling and RBF layers with modern MaxPool2d and Linear layers.

<pre><code class="language-python">import torch import torch.nn as nn class LeNet5(nn.Module): def __init__(self, num_classes=10): super().__init__() self.features = nn.Sequential( # Input: [batch, 1, 32, 32] nn.Conv2d(1, 6, kernel_size=5), # [batch, 6, 28, 28] nn.Tanh(), nn.MaxPool2d(kernel_size=2, stride=2), # [batch, 6, 14, 14] nn.Conv2d(6, 16, kernel_size=5), # [batch, 16, 10, 10] nn.Tanh(), nn.MaxPool2d(kernel_size=2, stride=2) # [batch, 16, 5, 5] ) self.classifier = nn.Sequential( nn.Flatten(), nn.Linear(16 * 5 * 5, 120), # [batch, 120] nn.Tanh(), nn.Linear(120, 84), # [batch, 84] nn.Tanh(), nn.Linear(84, num_classes) # [batch, num_classes] ) def forward(self, x): return self.classifier(self.features(x)) x = torch.randn(4, 1, 32, 32) model = LeNet5() out = model(x) print("LeNet-5 output shape:", out.shape) # [4, 10]</pre>

Using nn.Tanh activations and a custom classification sequential block reproduces the classical architecture's execution details, while leveraging PyTorch's native GPU acceleration.

Modern Updates to the Classic Design

When using LeNet-5 for modern benchmarks, developers typically replace the Tanh activations with ReLU to accelerate training and prevent vanishing gradients.

Additionally, inserting Batch Normalization after convolutions and adding Dropout before the classifier layers significantly improves generalization, allowing this simple architecture to achieve high accuracy on datasets like MNIST and Fashion-MNIST.