Multilayer Perceptrons (MLPs) Architecture

A Multilayer Perceptron (MLP) is a class of feedforward artificial neural network consisting of an input layer, one or more hidden layers, and an output layer. By stacking layers and applying non-linear activations, MLPs can approximate complex functions.


Structural Components of an MLP

An MLP is organized as a feedforward graph of layers, where signals flow in one direction from input to output.

Input, Hidden, and Output Layers

The input layer receives the raw features and does not perform any computation. The hidden layers perform linear transformations followed by non-linear activations, extracting increasingly abstract representations. The output layer produces the final prediction, with its activation chosen based on the task type (e.g., Softmax for classification, linear for regression).

Each layer contains a set of units (neurons). The design of the network involves deciding the number of hidden layers and the number of units per layer, which are key hyperparameters determining the model's capacity and risk of overfitting.

Fully Connected Connections

MLPs are typically fully connected (or dense), meaning every neuron in layer $l$ is connected to every neuron in layer $l+1$. This architecture maximizes the parameter interaction between layers, allowing the network to combine all input features to form its representations.

The downside of full connectivity is the large number of parameters. For an input of size $N$ and a hidden layer of size $M$, the linear layer requires $N \times M$ weights and $M$ biases. This can lead to high computational costs and memory requirements for high-dimensional inputs like images.

Mathematical Representation of Layers

The forward pass of an MLP can be represented compactly using matrix algebra and vector compositions.

Vectorized Layer Operations

For a layer $l$ with weight matrix $\mathbf{W}^{[l]}$ and bias vector $\mathbf{b}^{[l]}$, the activation vector $\mathbf{a}^{[l]}$ is computed from the activation of the previous layer $\mathbf{a}^{[l-1]}$ as:

$$\mathbf{z}^{[l]} = \mathbf{W}^{[l]} \mathbf{a}^{[l-1]} + \mathbf{b}^{[l]}$$

$$\mathbf{a}^{[l]} = g^{[l]}(\mathbf{z}^{[l]})$$

where $\mathbf{z}^{[l]}$ is the pre-activation vector and $g^{[l]}$ is the element-wise activation function. For the first layer, $\mathbf{a}^{[0]} = \mathbf{x}$, the input vector.

Function Composition

Mathematically, an MLP represents a composition of functions. For a network with $L$ layers, the mapping from input $\mathbf{x}$ to output $\hat{\mathbf{y}}$ is represented as:

$$\hat{\mathbf{y}} = f(\mathbf{x}; \mathbf{\Theta}) = (g^{[L]} \circ f^{[L]} \circ \dots \circ g^{[1]} \circ f^{[1]})(\mathbf{x})$$

where $\mathbf{\Theta}$ represents the collection of all weights and biases. This compositional nature allows the network to build hierarchically structured features, where each layer builds on top of the representations of the previous layer.

PyTorch Implementation

We can construct an MLP in PyTorch using the nn.Sequential container or by defining explicit layers in a custom class.

Coding a Multilayer MLP

Here is a complete PyTorch implementation of an MLP with two hidden layers, including dropout and batch normalization:

<pre><code class="language-python">import torch import torch.nn as nn class DeepMLP(nn.Module): def __init__(self, input_dim, hidden_dims, output_dim): super().__init__() # Stack layers using nn.ModuleList self.layers = nn.ModuleList() prev_dim = input_dim for h_dim in hidden_dims: self.layers.append(nn.Linear(prev_dim, h_dim)) self.layers.append(nn.BatchNorm1d(h_dim)) self.layers.append(nn.ReLU()) self.layers.append(nn.Dropout(0.2)) prev_dim = h_dim self.output_layer = nn.Linear(prev_dim, output_dim) def forward(self, x): # x shape: [batch_size, input_dim] for layer in self.layers: x = layer(x) logits = self.output_layer(x) # [batch_size, output_dim] return logits # Test the MLP model = DeepMLP(input_dim=10, hidden_dims=[64, 32], output_dim=5) x = torch.randn(4, 10) # batch of 4 samples print(model(x).shape) # [4, 5]</pre>

In this code, we utilize nn.ModuleList to dynamically stack layers. Including Batch Normalization and Dropout layers helps stabilize training and prevent overfitting as the network gets deeper.

Dimension Matching and Tensor Flows

When designing an MLP, the output size of layer $l$ must match the input size of layer $l+1$. For instance, in our example, the first linear layer maps features from $10 \to 64$, meaning the next layer must accept $64$ inputs.

Mismatches in layer dimensions are a common source of runtime errors in PyTorch. Tracing shapes using comments (e.g. # [batch_size, output_dim]) ensures correctness and improves readability of complex neural network pipelines.