The Problem with Flattening Images for MLPs

Attempting to process raw digital images with Multi-Layer Perceptrons (MLPs) by flattening them into 1D vectors destroys crucial 2D spatial structures and leads to an unsustainable parameter explosion.

Loss of Spatial Topology

MLPs require 1D input vectors, forcing 2D images to be flattened row-by-row or column-by-column.

Destroying Neighboring Relationships

In digital images, pixels close to each other are highly correlated, representing textures, edges, and objects. Flattening a 2D grid of shape \\((H, W)\\) to a 1D vector of length \\(H \\times W\\) separates vertically adjacent pixels. For example, in a \\(256 \\times 256\\) image, the pixel at coordinate \\((0, 0)\\) is separated from the pixel directly below it \\((1, 0)\\) by 255 elements in the flattened vector.

This separation makes it extremely difficult for fully connected layers to capture local correlations, as the model must learn the relationship between these distant indices from scratch. Because MLPs treat each input element as an independent feature, they throw away the spatial topology of the image, leading to poor generalization on visual tasks.

Lack of Translation Invariance

A key property of visual patterns is translation invariance: an object (like a cat) is still the same object regardless of where it appears in the image. However, because fully connected layers assign a unique weight to every coordinate of the flattened input, if a feature shifts slightly, its flattened indices change completely.

As a result, an MLP has no inherent translation invariance. The model must learn to recognize the same object at every possible location in the image separately, which requires a massive amount of training data and leads to poor robustness when objects appear in unexpected positions.

Parameter Explosion

Dense connections in MLPs lead to a massive number of weights as image resolution scales.

Mathematical Dimension Scaling

Consider a moderate resolution RGB image of size \\(256 \\times 256 \\times 3 = 196,608\\) pixels. If the first hidden layer has 1,024 neurons, the first layer alone will have \\(196,608 \\times 1,024 = 201,326,592\\) parameters (weights). This massive parameter count makes training computationally expensive and requires large amounts of memory.

The parameter explosion gets worse as image resolutions increase. A standard 1080p image contains over 2 million pixels, making it practically impossible to process with fully connected layers. This scaling issue makes MLPs impractical for computer vision tasks and drives the need for more efficient architectures.

Overfitting and Generalization Gaps

Having a large number of parameters in the first layer makes the model highly susceptible to overfitting. The network has enough capacity to simply memorize the pixel values of the training images instead of learning general visual features, leading to low training error but poor test performance.

To prevent this overfitting, developers would need to use extreme regularization techniques or gather millions of training samples, which is often unfeasible. Convolutional networks solve this parameter crisis by replacing global dense connections with local receptive fields and shared weights.

Transition to Shared Weights

Convolutional layers resolve the limitations of MLPs by leveraging localized weights and parameter sharing.

Local Connectivity and Parameter Sharing

Convolutional layers address the issues of MLPs by enforcing two key inductive biases: local connectivity and parameter sharing. Local connectivity means that each neuron in a convolutional layer only connects to a local spatial neighborhood of the input, drastically reducing the number of weights per neuron.

Parameter sharing means that the same weight matrix (kernel) is applied across all spatial locations of the input. This sharing enforces translation equivariance, meaning that if a feature shifts in the input, the corresponding activation shifts by the same amount in the output map, allowing the model to detect features regardless of their location.

Parameter Count Comparison

Let's compare the parameter count of a convolutional layer versus a fully connected layer in PyTorch. The code below demonstrates the difference in parameters when processing a simulated image batch.

<pre><code class="language-python">import torch import torch.nn as nn # Simulated input: batch of 1 image, 3 channels, 256x256 size x = torch.randn(1, 3, 256, 256) # 1. Fully connected layer (MLP approach) x_flat = x.view(1, -1) # Flatten to 1D: [1, 196608] fc = nn.Linear(3 * 256 * 256, 64) out_fc = fc(x_flat) # 2. Convolutional layer (CNN approach) conv = nn.Conv2d(in_channels=3, out_channels=64, kernel_size=3, padding=1) out_conv = conv(x) print("FC parameters:", sum(p.numel() for p in fc.parameters())) print("Conv parameters:", sum(p.numel() for p in conv.parameters()))</pre>

The output shows that the fully connected layer requires over 12 million weights, while the convolutional layer only requires 1,792 weights (calculated as \\(3 \\text{ input channels} \\times 64 \\text{ output channels} \\times 3 \\times 3 \\text{ kernel size} + 64 \\text{ biases}\\)). This comparison highlights the extreme efficiency of convolutions for processing high-dimensional image data.