Understanding CUDA Cores and VRAM

Graphics Processing Units (GPUs) accelerate deep learning computations using CUDA cores for standard arithmetic and Tensor cores for mixed-precision matrix multiplication. Effectively managing Video RAM (VRAM) is essential for scaling models and preventing Out of Memory (OOM) errors.


GPU Architecture and CUDA Cores

GPUs parallelize computations across Streaming Multiprocessors, executing instructions concurrently using standard CUDA cores and specialized Tensor cores.

CUDA Execution Model

NVIDIA's CUDA (Compute Unified Device Architecture) defines the execution model for parallel processing on GPUs. At the hardware level, a GPU is divided into multiple Streaming Multiprocessors (SMs). Each SM contains a collection of execution units called CUDA Cores, along with shared registers and high-speed cache. The execution model operates on a Single Instruction, Multiple Threads (SIMT) paradigm. Threads are grouped into blocks, which are mapped to SMs. Within an SM, threads are scheduled and executed in parallel units of 32 threads called warps.

All threads within a warp execute the same instruction simultaneously on different data components. If code contains conditional branching (e.g., if/else statements), the warp must execute each branch sequentially, a problem known as warp divergence that severely degrades execution throughput. Designers must structure computations to minimize conditional branching, ensuring that all CUDA cores in a warp execute uniform mathematical operations in parallel.

Tensor Cores

NVIDIA introduced Tensor Cores (beginning with the Volta architecture) to accelerate deep learning matrix operations. While standard CUDA cores compute one multiply-accumulate operation per clock cycle, a Tensor Core calculates an entire matrix multiply-accumulate operation \\( \\mathbf{D} = \\mathbf{A} \\times \\mathbf{B} + \\mathbf{C} \\) in a single cycle. The matrices \\( \\mathbf{A} \\) and \\( \\mathbf{B} \\) are typically \\( 4 \\times 4 \\) matrices represented in half-precision (FP16 or BF16), while \\( \\mathbf{C} \\) and \\( \\mathbf{D} \\) can be in half or single-precision (FP32).

This design provides a massive speedup for the core computations of deep learning (linear layers and convolutions). Modern model architectures are designed to align their layer dimensions (such as channel counts and hidden sizes) to multiples of 8 or 16. This alignment ensures that the compiler can map the operations directly to Tensor Cores, maximizing execution speed and hardware efficiency.

Video RAM (VRAM) Management

Managing the GPU memory hierarchy is crucial, and PyTorch provides dedicated APIs to profile and manage VRAM allocations.

Memory Hierarchy

A GPU's VRAM consists of multiple memory tiers with varying capacity and latency profiles. Global memory (VRAM) is the largest tier (typically 8 to 80 GB), but it has the highest access latency. To speed up calculations, SMs use high-speed shared memory and registers. Shared memory is a small, on-chip cache shared among threads in a block, while registers are assigned to individual threads. Transferring data from global VRAM to registers is a frequent bottleneck in deep learning operations.

If a model performs many element-wise operations (such as ReLU activations or additions) sequentially, the GPU cores spend more time waiting for memory transfers than executing calculations. This scenario is memory-bandwidth bound. Techniques like operator fusion combine multiple sequential operations into a single kernel, reducing global VRAM access and improving throughput.

PyTorch Memory Profiling

PyTorch maintains a caching memory allocator to speed up VRAM management. When a tensor is deleted in Python, PyTorch does not immediately release the memory back to the operating system; instead, it retains the allocation in a cache to accelerate future allocations. This script demonstrates how to profile and manage this cache:

<pre><code class="language-python">import torch import torch.nn as nn # Check if CUDA is available if torch.cuda.is_available(): device = torch.device("cuda") # Simulate model loading model = nn.Sequential(nn.Linear(10000, 10000), nn.ReLU()).to(device) x = torch.randn(100, 10000, device=device) # Run forward pass out = model(x) # Log memory metrics allocated = torch.cuda.memory_allocated(device) / (1024**2) # Convert to MB reserved = torch.cuda.memory_reserved(device) / (1024**2) # Convert to MB print(f"Allocated VRAM: {allocated:.2f} MB") print(f"Cached/Reserved VRAM: {reserved:.2f} MB") # Clear the caching allocator cache to free unused memory del out, x torch.cuda.empty_cache() print("Cache cleared.")</pre>

Out of Memory (OOM) Errors and Optimization

Out of Memory errors occur when activations and parameters exceed VRAM capacity, requiring optimization techniques to resolve.

Causes of OOM

The notorious "CUDA Out of Memory" error occurs when the requested VRAM allocation exceeds the available global memory. During model training, VRAM must store three main components: the model parameters (weights and biases), the optimizer states (e.g., momentum vectors in Adam, which require twice the memory of the weights), and the intermediate activations computed during the forward pass. These activations must be cached in memory because they are required to calculate gradients during the backward pass.

As batch size, sequence length, or image dimensions increase, the memory occupied by activations scales up. If a model has many layers, this activation cache can easily exceed VRAM limits, triggering an OOM error. Minimizing batch sizes, using mixed precision, or discarding activations and recomputing them as needed are common strategies to manage VRAM.

VRAM Optimization Techniques

Several optimization techniques can prevent OOM errors. The first is gradient accumulation: instead of updating the model weights using a large batch size of \\( N \\), we process the data in smaller micro-batches of size \\( M \\) (where \\( M < N \\)), accumulate the gradients over \\( N/M \\) steps, and execute a single optimizer update. This achieves the mathematical behavior of a large batch size while keeping activation memory within the limits of the smaller micro-batch.

The second technique is activation checkpointing (or gradient checkpointing). Instead of caching all intermediate activations during the forward pass, we cache activations for only a subset of bottleneck layers. During the backward pass, we recompute the missing intermediate activations on-the-fly by running local forward passes. This trade-off increases computational time by about 30%, but drastically reduces activation VRAM usage, allowing developers to train much larger models.