Edge Computing for ML Models
Edge ML runs inference directly on devices — smartphones, IoT sensors, embedded systems — eliminating round-trip latency to a cloud server and enabling real-time predictions even without internet connectivity.
Why Deploy to the Edge
Edge inference is essential when latency, connectivity, privacy, or cost make cloud-based prediction impractical.
Use Cases and Benefits
- Real-time applications: Autonomous vehicles, speech recognition, AR/VR (requires <10ms inference)
- Connectivity-constrained: Industrial IoT, remote monitoring, wearables
- Privacy-sensitive: Medical devices, in-ear assistants (data never leaves the device)
- Cost reduction: No cloud compute or data transfer costs at scale
Exporting Models to Edge-Compatible Formats
Cloud models must be converted to efficient, hardware-agnostic formats before deployment to edge devices.
Exporting Scikit-Learn to ONNX
<pre><code class="language-python">from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from skl2onnx import convert_sklearn
from skl2onnx.common.data_types import FloatTensorType
import onnxruntime as rt
import numpy as np
X, y = load_iris(return_X_y=True)
model = RandomForestClassifier(n_estimators=10, random_state=42)
model.fit(X, y)
# Convert to ONNX
initial_type = [("float_input", FloatTensorType([None, X.shape[1]]))]
onnx_model = convert_sklearn(model, initial_types=initial_type)
with open("rf_iris.onnx", "wb") as f:
f.write(onnx_model.SerializeToString())
# Run inference with ONNX Runtime (cross-platform, optimised)
sess = rt.InferenceSession("rf_iris.onnx")
input_name = sess.get_inputs()[0].name
pred = sess.run(None, {input_name: X[:3].astype(np.float32)})
print("ONNX predictions:", pred[0])</pre>
Model Quantization for Smaller Footprint
Quantization reduces model weights from 32-bit floats to 8-bit integers, shrinking the model size by ~4x and speeding up inference on CPUs with minimal accuracy loss.
<pre><code class="language-python">import onnx from onnxruntime.quantization import quantize_dynamic, QuantType # Dynamic quantization (post-training, no calibration data needed) quantize_dynamic( model_input="rf_iris.onnx", model_output="rf_iris_quantized.onnx", weight_type=QuantType.QUInt8 ) import os orig_size = os.path.getsize("rf_iris.onnx") / 1024 quant_size = os.path.getsize("rf_iris_quantized.onnx") / 1024 print(f"Original: {orig_size:.1f} KB | Quantized: {quant_size:.1f} KB")</pre>Optimisation Techniques for Edge Deployment
Beyond quantization, several model compression techniques reduce memory and compute requirements for resource-constrained devices.
Compression Techniques Overview
- Quantization: FP32 → INT8 weights (4x size reduction, 2–4x speedup on CPU)
- Pruning: Remove near-zero weights to create sparse models (up to 90% sparsity with <1% accuracy drop)
- Knowledge Distillation: Train a small student model to mimic a large teacher model's outputs
- Hardware-Specific Compilation: TVM, Apache TFLite, Core ML compile models into device-native binaries for maximum throughput