Weight Quantization (INT8)
Weight quantization converts a model's weights and activations from 32-bit floating-point (FP32) to lower-precision formats like 8-bit integers (INT8). This compression reduces model size by 4x and accelerates inference on target deployment hardware.
Mathematical Principles of Quantization
Quantization maps continuous float values to discrete integer grids using scale factors and zero-point parameters.
Uniform Quantization
Uniform quantization maps a continuous range of real-valued floating-point numbers \\( [r_{min}, r_{max}] \\) to a discrete grid of \\( B \\)-bit integers \\( [q_{min}, q_{max}] \\) (for INT8, the range is \\( [-128, 127] \\) or \\( [0, 255] \\)). The mapping is defined as:
\\( q = \\text{round}\\left(\\frac{r}{S}\\right) + Z \\)
Where \\( S \\) is the Scale (a positive floating-point number) and \\( Z \\) is the Zero-point (an integer matching the float value of 0). The scale \\( S \\) and zero-point \\( Z \\) are calculated as: \\( S = \\frac{r_{max} - r_{min}}{q_{max} - q_{min}} \\) and \\( Z = \\text{round}\\left(\\frac{-r_{min}}{S}\\right) + q_{min} \\). The de-quantization step maps the integer back to an approximate float: \\( \\hat{r} = S \\cdot (q - Z) \\). The rounding step introduces quantization noise, which can degrade model accuracy if the dynamic range is not calibrated correctly.
Symmetric vs. Asymmetric Quantization
Quantization schemes are classified as symmetric or asymmetric. Asymmetric quantization maps the minimum and maximum values of the float range directly to the minimum and maximum values of the integer range, setting the zero-point \\( Z \\) dynamically. This approach uses the full integer range efficiently, but the non-zero zero-point adds computational overhead during inference.
Symmetric quantization constrains the float range to be symmetric around zero: \\( [-r_{max}, r_{max}] \\) where \\( r_{max} = \\max(|r_{min}|, |r_{max}|) \\). This constraint forces the zero-point \\( Z \\) to be 0, simplifying the de-quantization calculation to: \\( \\hat{r} = S \\cdot q \\). While symmetric quantization simplifies hardware acceleration, it can be less accurate if the distribution of values is highly skewed, as it wastes representation range on empty spaces.
Quantization Paradigms
Quantization can be applied after training via Post-Training Quantization or during training using Quantization-Aware Training.
Post-Training Quantization (PTQ)
Post-Training Quantization (PTQ) quantizes a pre-trained FP32 model without retraining. In dynamic PTQ, the weights are quantized to INT8 ahead of time, while the activations are kept in FP32 and quantized dynamically during inference. This requires no calibration data and is highly effective for LSTM and linear layers.
In static PTQ, both weights and activations are quantized ahead of time. Because the activation ranges vary depending on the inputs, static PTQ requires a calibration dataset. We pass a representative subset of the training data through the model to monitor the activation distributions, calculate optimal scale and zero-point parameters, and freeze them for deployment. Static PTQ is highly efficient, but requires representative calibration data to prevent accuracy drops.
Quantization-Aware Training (QAT)
For models that suffer from significant accuracy loss under PTQ (such as deep CNNs or MobileNets), Quantization-Aware Training (QAT) is the preferred alternative. In QAT, the model is trained with quantization effects modeled in the forward pass. We insert fake quantization modules after weight and activation layers. These modules quantize the values to INT8 and immediately de-quantize them back to FP32, simulating the rounding errors of quantization during training.
During backpropagation, because the round operation is non-differentiable (having zero gradient everywhere), we use a Straight-Through Estimator (STE). The STE passes the gradients through the round function unchanged: \\( \\frac{\\partial \\hat{r}}{\\partial r} = 1 \\). This allows the optimizer to adjust the weights to compensate for the rounding errors, maintaining high accuracy when the model is converted to INT8.
PyTorch Quantization Implementation
PyTorch supports dynamic quantization APIs to compress models and utilizes specialized hardware vector instructions for speedups.
Dynamic PTQ
This PyTorch script demonstrates how to apply dynamic quantization to an LSTM model, converting targeted layers to INT8 representations:
<pre><code class="language-python">import torch import torch.nn as nn class LSTMModel(nn.Module): def __init__(self): super().__init__() self.lstm = nn.LSTM(16, 32, batch_first=True) self.fc = nn.Linear(32, 5) def forward(self, x): out, _ = self.lstm(x) return self.fc(out[:, -1, :]) model_fp32 = LSTMModel() # Apply dynamic quantization to the LSTM and Linear layers # This converts float weights to qint8 model_int8 = torch.quantization.quantize_dynamic( model_fp32, # FP32 model {nn.LSTM, nn.Linear}, # Target layer types dtype=torch.qint8 # Target integer type ) print("FP32 Model structure:\n", model_fp32) print("\nQuantized Model structure:\n", model_int8) # The quantized model runs inference using optimized integer kernels</pre>Hardware Acceleration
Simply converting weights to INT8 does not speed up inference unless the deployment hardware supports integer vector instructions. Modern CPUs use instruction sets like AVX-512 VNNI (Vector Neural Network Instructions), and GPUs use specialised DP4A instructions, which execute 8-bit integer dot products in parallel.
These instruction sets execute multiple INT8 multiplications and accumulations in a single clock cycle, increasing computational throughput. Additionally, because INT8 tensors require 4x less memory bandwidth than FP32 tensors, the hardware spends less time waiting for VRAM transfers, accelerating memory-bound layers.