Limitations of Floating Point Math in Computing

Computers store numbers in binary using a finite number of bits. This means most decimal numbers — even something as simple as 0.1 — cannot be represented exactly. In a single operation, the error is tiny. Across billions of operations in a deep network, these errors can accumulate into NaN values, diverging training runs, and subtly wrong model outputs.


Why 0.1 + 0.2 ≠ 0.3

A 64-bit float (FP64/float64) stores a number as a sign bit, an 11-bit exponent, and a 52-bit mantissa. The mantissa can only represent $2^{52}$ distinct values in a given range — and 0.1 in binary is a repeating fraction, like $1/3$ in decimal, so it gets rounded to the nearest representable value.

The Classic Floating Point Surprise

<pre><code class="language-python">print(0.1 + 0.2) # 0.30000000000000004 print(0.1 + 0.2 == 0.3) # False # The right way to compare floats: import numpy as np print(np.isclose(0.1 + 0.2, 0.3)) # True print(abs((0.1 + 0.2) - 0.3) < 1e-9) # True </pre>

Float Precision in AI: FP64, FP32, FP16, BF16

Different training scenarios use different precisions, trading accuracy for memory and speed:

  • FP64 (float64): 64 bits, highest precision. Used in scientific computing but rarely in deep learning — too slow.
  • FP32 (float32): 32 bits, the standard for training neural networks. Good balance of precision and speed.
  • FP16 (float16): 16 bits, 2× faster, 2× less memory. Prone to underflow/overflow. Must use loss scaling.
  • BF16 (bfloat16): 16 bits with a larger exponent range than FP16. Preferred on modern GPUs/TPUs for stability.

Specifying Dtype in NumPy

<pre><code class="language-python">import numpy as np # FP32 — standard training precision arr32 = np.array([1.0, 2.0, 3.0], dtype=np.float32) print(arr32.dtype) # float32 print(arr32.nbytes) # 12 bytes (4 bytes × 3) # FP64 — default NumPy arr64 = np.array([1.0, 2.0, 3.0], dtype=np.float64) print(arr64.nbytes) # 24 bytes (8 bytes × 3) # FP16 — smaller, faster, less precise arr16 = np.array([1.0, 2.0, 3.0], dtype=np.float16) print(arr16.nbytes) # 6 bytes </pre>

Mixed Precision Training

Modern deep learning uses mixed precision: keep a master copy of weights in FP32 (for accurate gradient accumulation), but run the forward and backward passes in FP16 (for speed). A loss scaler multiplies the loss by a large factor before backprop to prevent FP16 underflow in gradients.

Mixed Precision in PyTorch

<pre><code class="language-python">import torch from torch.cuda.amp import autocast, GradScaler model = ... # your model optimizer = torch.optim.Adam(model.parameters()) scaler = GradScaler() # manages loss scaling for X, y in dataloader: optimizer.zero_grad() with autocast(): # runs in FP16 output = model(X) loss = loss_fn(output, y) scaler.scale(loss).backward() # scales loss to avoid underflow scaler.step(optimizer) # updates in FP32 scaler.update() </pre>

This typically gives a 2–3× speedup on modern GPUs with negligible accuracy loss.