Linear Transformations in Neural Networks
Linear transformations, executed as matrix multiplications, are the core mathematical operations in hidden layers. They project input features into new geometric coordinate systems, preparing them for non-linear activations.
Geometry of Linear Transformations
A linear transformation maps vectors from one vector space to another, preserving vector addition and scalar multiplication.
Rotation, Scaling, and Shearing
Mathematically, multiplying an input vector $\mathbf{x}$ by a weight matrix $\mathbf{W}$ performs a linear transformation $\mathbf{y} = \mathbf{W}\mathbf{x}$. Geometrically, this transformation can rotate, scale, or shear the input space. For example, scaling stretches or shrinks features, while rotation changes the coordinate directions.
This allows the network to align features along axes that capture the most variance, similar to Principal Component Analysis (PCA). The weights represent the transformation coefficients that the network adjusts during training to find the best orientation of feature space.
Dimensionality Alteration
The dimensions of the weight matrix determine whether the transformation projects the inputs to a higher-dimensional space or compresses them to a lower-dimensional space. An input vector $\mathbf{x} \in \mathbb{R}^N$ multiplied by $\mathbf{W} \in \mathbb{R}^{M \times N}$ results in a vector $\mathbf{y} \in \mathbb{R}^M$.
Expanding dimensions ($M > N$) allows the network to resolve complex interactions by embedding data in a higher-dimensional space. Compressing dimensions ($M < N$) forces the network to learn a bottleneck representation, extracting only the most critical information.
Affine Transformations and Biases
Adding a bias vector translates the linear transformation, completing the transition from linear to affine mappings.
Translation and Affine Mappings
A pure linear transformation must map the zero vector to the zero vector ($\mathbf{W}\mathbf{0} = \mathbf{0}$), meaning the decision boundaries are constrained to pass through the origin. Adding a bias vector $\mathbf{b}$ performs a translation, shifting the transformed space away from the origin. This combined operation is called an affine transformation:
$$\mathbf{z} = \mathbf{W}\mathbf{x} + \mathbf{b}$$
Translation is critical because real-world data points do not necessarily cluster around the origin. The bias represents the offset threshold, allowing the model to position its decision boundaries anywhere in the feature space.
Matrix Notation and Batch Processing
In practice, neural networks process batches of samples simultaneously. For a batch of $B$ samples, the inputs are represented as a matrix $\mathbf{X} \in \mathbb{R}^{B \times N}$. The affine transformation is written as:
$$\mathbf{Z} = \mathbf{X}\mathbf{W}^T + \mathbf{b}$$
where $\mathbf{W} \in \mathbb{R}^{M \times N}$ is the weight matrix and $\mathbf{b}$ is the bias vector of shape $\mathbb{R}^M$. PyTorch automatically broadcasts the bias vector $\mathbf{b}$ across the batch dimension during addition.
PyTorch Linear Layer
Let's inspect how PyTorch implements linear transformations using its built-in linear modules.
Using nn.Linear
PyTorch provides the nn.Linear module to perform affine transformations. It manages the initialization of weights and biases automatically, storing them as internal parameters:
In this code, we print the shapes of the weight matrix and bias vector. Notice that the weight matrix is stored in transpose format, allowing PyTorch to perform the forward computation efficiently as x @ weight.T + bias.
Mathematical Verification of PyTorch Linear Pass
To verify the linear operation, we can execute the calculation manually using tensor operations. By extracting the weights and bias from nn.Linear, performing matrix multiplication, and adding the bias, we can verify that the outputs match PyTorch's internal calculations.
This verification confirms that nn.Linear is a wrapper around standard matrix multiplication. Under the hood, PyTorch uses optimized BLAS libraries like Intel MKL or NVIDIA cuBLAS to run these operations in parallel.