Hardware: CPU vs. GPU vs. TPU processing

Deep learning models require massive computational resources for training and inference. Understanding the architectural differences between CPUs (latency-oriented), GPUs (throughput-oriented), and TPUs (matrix-oriented) is critical for optimizing execution profiles.

Architectural Comparisons

CPUs focus on low-latency sequential execution, while GPUs leverage thousands of simple cores to maximize parallel throughput.

CPU Architecture (Latency-Oriented)

Central Processing Units (CPUs) are designed for general-purpose computing. A CPU architecture is latency-oriented, optimized to execute a single thread of instructions as quickly as possible. To achieve this, CPUs feature a small number of highly complex cores (typically 4 to 64) with large caches (L1, L2, L3) and sophisticated control units that handle out-of-order execution, branch prediction, and instruction-level pipelining.

This design is ideal for tasks that require complex decision-making, sequential logic, and low-latency execution. However, deep learning computations consist of millions of simple, repetitive arithmetic operations (like matrix multiplications) that can be run in parallel. A CPU's limited core count restricts its parallel throughput, making it inefficient for training large deep learning networks.

GPU Architecture (Throughput-Oriented)

Graphics Processing Units (GPUs) are throughput-oriented processors designed for massive parallel execution. Instead of a few complex cores, a GPU features thousands of smaller, simpler cores optimized for floating-point arithmetic. The hardware architecture prioritizes raw compute cores over large caches and control units, allocating most of the silicon area to Arithmetic Logic Units (ALUs).

This layout is designed for SIMD (Single Instruction, Multiple Data) or SIMT (Single Instruction, Multiple Threads) execution models, where the same operation (such as multiplying two matrices) is executed simultaneously across different threads. This parallel capacity allows GPUs to compute large matrix operations in a fraction of the time required by a CPU, making them the standard hardware for deep learning training.

Tensor Processing Units (TPUs)

TPUs optimize matrix math using specialized hardware like systolic arrays, integrated into PyTorch via the XLA compiler.

Systolic Arrays

Tensor Processing Units (TPUs), developed by Google, are Application-Specific Integrated Circuits (ASICs) designed specifically for deep learning workloads. The core innovation in TPUs is the Systolic Array. In standard CPU and GPU architectures, the processor must constantly read and write intermediate results to registers or caches between operations, consuming memory bandwidth and power.

A systolic array addresses this by connecting ALUs directly in a 2D grid. Data flows through the grid in a continuous stream, with each cell performing a multiply-accumulate operation and passing the result directly to its neighbor without accessing registers. This design achieves massive computational density and energy efficiency, allowing TPUs to compute matrix multiplications (such as \\( 256 \\times 256 \\) matrices) in a single clock cycle.

PyTorch XLA Integration

To execute PyTorch models on TPUs, we use the Accelerated Linear Algebra (XLA) compiler via the torch_xla library. XLA compiles PyTorch's dynamic computational graphs into a set of optimized machine instructions for TPUs. Unlike GPUs, which run operations eagerly, XLA traces the model's execution, groups operations together, and compiles them into a static execution graph.

This compilation step reduces overhead, fuses adjacent layers, and optimizes memory access. However, because XLA compiles static graphs, any change in input tensor shapes during training triggers recompilation, causing significant delays. Practitioners must ensure that all input batches are padded to a fixed shape to maintain training performance on TPUs.

Performance Profiles and Bandwidth

Hardware selection depends on memory bandwidth and the balance between compute-bound and memory-bound operations.

Memory Bandwidth Bottlenecks

The execution speed of a deep learning model is determined by two limits: the computational capacity of the processor (measured in FLOPS) and the memory bandwidth (the rate at which data can be transferred from RAM to the processor cores). Operations like matrix multiplications are compute-bound: they perform many calculations per byte of data, so performance is limited by the processor's speed.

Operations like element-wise activations, batch normalization, and dropout are memory-bound: they perform few calculations per byte of data, so performance is limited by the rate at which data can be read from memory. GPUs and TPUs use specialized high-bandwidth memory (such as GDDR6 or HBM2) to mitigate these bottlenecks, but memory access remains a key limit in deep learning optimization.

Execution Latency vs. Throughput

When selecting hardware, practitioners must balance latency and throughput. For training large models, throughput is the primary metric: we want to process as many samples per second as possible, making TPUs and GPUs the ideal choice. For real-time inference (such as autonomous driving or speech synthesis), latency is the primary metric: we want to generate a prediction for a single sample as quickly as possible.

At low batch sizes (e.g., a batch size of 1), the overhead of parallel GPU initialization can make it slower than a high-frequency CPU. Understanding these trade-offs is essential for selecting the right hardware architecture for each phase of the deep learning pipeline.