Introduction to PyTorch: Tensors and autograd

PyTorch is a popular deep learning framework that provides two main features: a multi-dimensional tensor library with GPU acceleration, and a dynamic autograd engine for automatic differentiation.

Tensors as Multi-Dimensional Arrays

Tensors are the fundamental data structures in PyTorch, representing multi-dimensional arrays optimized for parallel arithmetic.

Tensor Creation and Attributes

A PyTorch tensor is a multi-dimensional array that shares similarities with NumPy arrays but can run on hardware accelerators like GPUs. Every tensor has three main attributes: dtype (the data type, such as float32 or int64), device (the location of memory storage, either CPU or CUDA), and shape (the dimensions of the array).

We can create tensors from existing Python lists or NumPy arrays using torch.tensor(), or initialize them directly using helper functions like torch.zeros(), torch.ones(), or torch.randn() for random values.

Tensor Operations and Broadcasting

PyTorch supports hundreds of tensor operations, including slicing, reshaping, transposition, and matrix multiplications. Reshaping tensors is commonly performed using view() or reshape(), which alter dimensions without copying underlying memory.

Additionally, PyTorch implements broadcasting rules, which automatically stretch smaller tensors across dimensions to match larger tensors during element-wise operations. This is critical for performing operations like batch-wise bias addition without copying data.

Dynamic Graphs and Autograd

PyTorch uses a define-by-run autograd engine to calculate gradients dynamically during execution.

The requires_grad Flag and backward()

To compute gradients automatically, we set the tensor's requires_grad property to True. This instructs PyTorch to track all mathematical operations involving that tensor, building a dynamic computational graph during the forward pass.

When we call backward() on a scalar output tensor (typically the loss), PyTorch traverses this graph in reverse, computing gradients using the chain rule. The computed derivatives are accumulated directly in each parameter's .grad attribute.

Detaching and Context Management

During model inference or validation, we do not need to compute gradients. We can disable gradient tracking using the with torch.no_grad() context manager, which reduces memory usage and speeds up computations by avoiding graph construction.

Alternatively, we can use .detach() to create a new tensor that shares storage with the original tensor but does not track gradients. This is useful when extracting features or passing predictions to external plotting libraries.

PyTorch Implementation

We can write a PyTorch script to demonstrate tensor arithmetic, dynamic gradient calculation, and device transfers.

Basic Autograd Demo

This code shows how to define parameters, perform operations, and evaluate gradients automatically:

<pre><code class="language-python">import torch # Define a tensor with gradient tracking enabled x = torch.tensor([3.0], requires_grad=True) # Compute y = x^2 + 5x y = x ** 2 + 5.0 * x # Compute gradients: dy/dx = 2x + 5 y.backward() print("Output y:", y.item()) # 3^2 + 5(3) = 24.0 print("Gradient of y wrt x:", x.grad.item()) # 2(3) + 5 = 11.0</pre>

In this code, we print the output and gradient. The autograd engine evaluates the derivative of the quadratic expression at $x=3$, showing that it matches the analytical derivative ($2x+5=11.0$).

GPU Acceleration

We can accelerate tensor operations by transferring the tensors to a GPU device. Checking CUDA availability and using the .to() method is standard practice in PyTorch pipelines:

<pre><code class="language-python"># Check if a GPU is available device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # Create tensor on host memory (CPU) x_cpu = torch.randn(1000, 1000) # Move tensor to the accelerator device (GPU) x_gpu = x_cpu.to(device) print("Tensor device:", x_gpu.device)</pre>

This code transfers the tensor memory to the GPU if CUDA is available. Once moved, all operations on x_gpu will execute on the GPU's parallel cores, accelerating computation.