Limitations of Floating Point Math in Computing
Computers store numbers in binary using a finite number of bits. This means most decimal numbers — even something as simple as 0.1 — cannot be represented exactly. In a single operation, the error is tiny. Across billions of operations in a deep network, these errors can accumulate into NaN values, diverging training runs, and subtly wrong model outputs.
Why 0.1 + 0.2 ≠ 0.3
A 64-bit float (FP64/float64) stores a number as a sign bit, an 11-bit exponent, and a 52-bit mantissa. The mantissa can only represent $2^{52}$ distinct values in a given range — and 0.1 in binary is a repeating fraction, like $1/3$ in decimal, so it gets rounded to the nearest representable value.
The Classic Floating Point Surprise
Float Precision in AI: FP64, FP32, FP16, BF16
Different training scenarios use different precisions, trading accuracy for memory and speed:
- FP64 (float64): 64 bits, highest precision. Used in scientific computing but rarely in deep learning — too slow.
- FP32 (float32): 32 bits, the standard for training neural networks. Good balance of precision and speed.
- FP16 (float16): 16 bits, 2× faster, 2× less memory. Prone to underflow/overflow. Must use loss scaling.
- BF16 (bfloat16): 16 bits with a larger exponent range than FP16. Preferred on modern GPUs/TPUs for stability.
Specifying Dtype in NumPy
Mixed Precision Training
Modern deep learning uses mixed precision: keep a master copy of weights in FP32 (for accurate gradient accumulation), but run the forward and backward passes in FP16 (for speed). A loss scaler multiplies the loss by a large factor before backprop to prevent FP16 underflow in gradients.
Mixed Precision in PyTorch
This typically gives a 2–3× speedup on modern GPUs with negligible accuracy loss.