Forward Propagation Mechanics

Forward propagation is the process of passing input data through a neural network's sequential layers to generate a prediction. Each layer computes a linear transformation followed by a non-linear activation function, mapping the features into latent representations.

Graph-Level Execution

Forward propagation is represented as a directed computational graph, where data flows from left to right.

Data Flow and Computation Stacking

In a feedforward neural network, execution is organized as a sequence of layer computations. The input vector $\mathbf{a}^{[0]} = \mathbf{x}$ is passed to the first hidden layer, which computes a pre-activation vector $\mathbf{z}^{[1]} = \mathbf{W}^{[1]} \mathbf{a}^{[0]} + \mathbf{b}^{[1]}$ and applies the activation function $\mathbf{a}^{[1]} = g^{[1]}(\mathbf{z}^{[1]})$.

This output is fed to the next layer in the sequence. For a network with $L$ layers, this process repeats until the final output layer computes the prediction $\mathbf{a}^{[L]} = \hat{\mathbf{y}}$. The intermediate activations $\mathbf{a}^{[l]}$ represent feature extraction states that are stored in memory for use during the backward pass.

Hidden Representation Space

As features propagate forward, each layer shifts, rotates, and distorts the input space. This transforms the input manifold into a representation space where the classes become increasingly separable. Early layers extract simple features (e.g. edges in images), while deeper layers combine these to represent complex patterns.

This hierarchical representation learning is what makes deep networks powerful. By composing simple operations, the model can represent complex decision surfaces that would be impossible to describe with single-layer architectures.

Matrix Operations and Parallelism

Neural networks process samples in parallel batches, converting vector equations into optimized matrix multiplications.

Vectorization and Batching

Rather than executing the forward pass for one sample at a time, we stack $B$ samples into a matrix $\mathbf{X} \in \mathbb{R}^{B \times N}$. The forward equation for layer $l$ becomes:

$$\mathbf{Z}^{[l]} = \mathbf{A}^{[l-1]} (\mathbf{W}^{[l]})^T + \mathbf{b}^{[l]}$$

where $\mathbf{A}^{[l-1]}$ has shape $[B, D_{in}]$ and $(\mathbf{W}^{[l]})^T$ has shape $[D_{in}, D_{out}]$. The bias vector $\mathbf{b}^{[l]} \in \mathbb{R}^{D_{out}}$ is broadcast across the batch dimension.

Hardware Acceleration

Vectorizing the forward pass allows the computation to leverage hardware accelerators like GPUs and TPUs. These devices contain thousands of arithmetic units designed to execute matrix multiplications in parallel, significantly reducing training times compared to CPUs.

On GPUs, execution is optimized using specialized kernel libraries like cuBLAS. These libraries manage memory caches and thread registers to maximize throughput, making batch size a key hyperparameter for training efficiency.

PyTorch Execution Model

We can trace the tensor dimensions during forward propagation in PyTorch by inspecting intermediate activations.

Coding a Custom Forward Pass

This PyTorch model shows how data flows through sequential layers, with comments tracing the shape of tensors at each step:

<pre><code class="language-python">import torch import torch.nn as nn class FeedForwardNet(nn.Module): def __init__(self, input_dim, hidden_dim, output_dim): super().__init__() self.fc1 = nn.Linear(input_dim, hidden_dim) self.relu = nn.ReLU() self.fc2 = nn.Linear(hidden_dim, output_dim) def forward(self, x): # Input x shape: [batch_size, input_dim] z1 = self.fc1(x) # [batch_size, hidden_dim] a1 = self.relu(z1) # [batch_size, hidden_dim] logits = self.fc2(a1) # [batch_size, output_dim] return logits # Initialize model and process a batch of 4 samples model = FeedForwardNet(input_dim=5, hidden_dim=10, output_dim=3) x = torch.randn(4, 5) out = model(x) print("Forward output shape:", out.shape) # torch.Size([4, 3])</pre>

In this code, the forward method explicitly defines the computation steps. The input tensor is mapped to a hidden dimension of 10, then to an output dimension of 3, keeping the batch dimension constant throughout.

The Backward Cache Requirement

During the forward pass, PyTorch's autograd engine caches intermediate tensors (like $a_1$ and $x$) that are needed to compute derivatives during backpropagation. This caching increases the memory footprint during training.

When evaluating the model for inference, we can disable this caching using the torch.no_grad() context manager. This reduces memory usage and speeds up computation, as no backward graph is constructed.