PyTorch Basics: Tensors and Gradients

Tensors are the fundamental multi-dimensional array structures in PyTorch, acting similarly to NumPy arrays but with GPU acceleration and automatic gradient tracking.

Tensor Operations and Hardware Acceleration

Tensors form the mathematical foundation of deep learning, supporting quick matrix arithmetic and hardware acceleration.

Creating and Manipulating Tensors

Tensors are multi-dimensional arrays with uniform data types (such as float32 or int64). PyTorch offers various creation methods, including `torch.tensor` from Python lists, `torch.from_numpy` to share memory buffers with NumPy, and factory methods like `torch.zeros`, `torch.ones`, and `torch.randn` for random initializations. Tensor dimensions are manipulated using methods like `view`, `reshape`, `permute`, and `squeeze` to match model input requirements.

Understanding memory layout is crucial when reshaping tensors. The `view` method returns a tensor sharing the same underlying data buffer but requires the dimensions to be contiguous in memory. If operations like transpose have disrupted memory layout, the tensor must be made contiguous by calling `contiguous()` before calling `view`. Alternatively, `reshape` handles contiguity checks and copies data if necessary.

GPU Allocation with CUDA Tensors

One of PyTorch's primary advantages over standard array libraries is its seamless integration with CUDA-enabled NVIDIA GPUs. Tensors can be allocated directly on the GPU or moved from CPU memory using the `.cuda()` or `.to(device)` calls. Moving computations to the GPU parallelizes matrix multiplications, accelerating neural network training by orders of magnitude.

When working with GPU tensors, developers must ensure that all tensors involved in a mathematical operation reside on the same physical device. Attempting to add a CPU tensor to a GPU tensor will throw a runtime exception. Managing tensor locations requires writing device-agnostic code by checking `torch.cuda.is_available()` and defining a target device object.

Autograd: Automatic Differentiation Engine

PyTorch's autograd engine automates the computation of backward passes by building a dynamic execution graph.

Computation Graphs and requires_grad

During the forward pass of a model, PyTorch builds a directed acyclic graph (DAG) where nodes represent tensors and edges represent mathematical operations. If a tensor has `requires_grad=True`, the autograd engine tracks all operations performed on it. This enables the calculation of gradients with respect to this tensor during backpropagation.

The DAG is dynamically created during the forward pass. Every time an operation is executed, a corresponding backward function is added to the graph to calculate derivatives. When the terminal node (usually the loss tensor) has `.backward()` called on it, PyTorch traverses this graph in reverse, computing gradients via the chain rule and storing them in each leaf tensor's `.grad` attribute.

Disabling Gradient Tracking

During model evaluation and inference, calculating gradients is unnecessary because weights are not being updated. To prevent PyTorch from building the computation graph and storing intermediate activations, the forward pass should be wrapped in the `with torch.no_grad():` context manager. This reduces memory consumption and accelerates inference speeds.

Alternatively, the `.detach()` method can be called on a tensor to return a new tensor that shares the same storage but has `requires_grad=False`, breaking the gradient propagation chain. This is useful when feeding intermediate model outputs to external packages or freezing specific layers during transfer learning.

Practical Gradient Calculations

Let's write a complete PyTorch code block demonstrating tensor creation, forward pass execution, and backward gradient computation.

Gradient Calculation Example

We will define input features and learnable weights, perform a matrix multiplication, calculate a loss, and compute gradients using autograd.

<pre><code class="language-python">import torch # Define inputs and target x = torch.tensor([[1.0, 2.0]], requires_grad=False) # Shape: [1, 2] y_target = torch.tensor([[5.0]], requires_grad=False) # Shape: [1, 1] # Define weight and bias with gradient tracking enabled w = torch.tensor([[0.5], [1.5]], requires_grad=True) # Shape: [2, 1] b = torch.tensor([0.1], requires_grad=True) # Shape: [1] # Forward pass: y = xW + b y_pred = torch.matmul(x, w) + b # Loss: Mean Squared Error loss = (y_pred - y_target) ** 2 # Backward pass loss.backward() print("Predicted y:", y_pred.item()) print("Loss:", loss.item()) print("Weight gradients dw:", w.grad) print("Bias gradient db:", b.grad)</pre>

In this code, calling `.backward()` populates `w.grad` and `b.grad` with the partial derivatives of the loss with respect to `w` and `b`. If we run another forward and backward pass, PyTorch will accumulate (add) the new gradients to the existing ones in `.grad`. To prevent this accumulating behavior during training, we must explicitly call `zero_()` on gradients or use optimizer facilities.

Managing Gradient Accumulation

Because PyTorch accumulates gradients in leaf tensors, failing to reset them before a new training iteration will lead to incorrect weight updates. In a standard training loop, the optimizer's `zero_grad()` method or the model's `zero_grad()` method is called at the start of each step. This resets the `.grad` fields of all parameters to zero.

In some advanced training setups, gradient accumulation is used intentionally. If a model is too large to fit in GPU memory with a desired batch size, developers can use a smaller batch size and accumulate gradients over multiple mini-batches before calling `optimizer.step()`. This mimics a larger batch size without increasing GPU memory usage.