Edge Computing for ML Models

Edge ML runs inference directly on devices — smartphones, IoT sensors, embedded systems — eliminating round-trip latency to a cloud server and enabling real-time predictions even without internet connectivity.


Why Deploy to the Edge

Edge inference is essential when latency, connectivity, privacy, or cost make cloud-based prediction impractical.

Use Cases and Benefits

  • Real-time applications: Autonomous vehicles, speech recognition, AR/VR (requires <10ms inference)
  • Connectivity-constrained: Industrial IoT, remote monitoring, wearables
  • Privacy-sensitive: Medical devices, in-ear assistants (data never leaves the device)
  • Cost reduction: No cloud compute or data transfer costs at scale

Exporting Models to Edge-Compatible Formats

Cloud models must be converted to efficient, hardware-agnostic formats before deployment to edge devices.

Exporting Scikit-Learn to ONNX

<pre><code class="language-python">from sklearn.ensemble import RandomForestClassifier from sklearn.datasets import load_iris from skl2onnx import convert_sklearn from skl2onnx.common.data_types import FloatTensorType import onnxruntime as rt import numpy as np X, y = load_iris(return_X_y=True) model = RandomForestClassifier(n_estimators=10, random_state=42) model.fit(X, y) # Convert to ONNX initial_type = [("float_input", FloatTensorType([None, X.shape[1]]))] onnx_model = convert_sklearn(model, initial_types=initial_type) with open("rf_iris.onnx", "wb") as f: f.write(onnx_model.SerializeToString()) # Run inference with ONNX Runtime (cross-platform, optimised) sess = rt.InferenceSession("rf_iris.onnx") input_name = sess.get_inputs()[0].name pred = sess.run(None, {input_name: X[:3].astype(np.float32)}) print("ONNX predictions:", pred[0])</pre>

Model Quantization for Smaller Footprint

Quantization reduces model weights from 32-bit floats to 8-bit integers, shrinking the model size by ~4x and speeding up inference on CPUs with minimal accuracy loss.

<pre><code class="language-python">import onnx from onnxruntime.quantization import quantize_dynamic, QuantType # Dynamic quantization (post-training, no calibration data needed) quantize_dynamic( model_input="rf_iris.onnx", model_output="rf_iris_quantized.onnx", weight_type=QuantType.QUInt8 ) import os orig_size = os.path.getsize("rf_iris.onnx") / 1024 quant_size = os.path.getsize("rf_iris_quantized.onnx") / 1024 print(f"Original: {orig_size:.1f} KB | Quantized: {quant_size:.1f} KB")</pre>

Optimisation Techniques for Edge Deployment

Beyond quantization, several model compression techniques reduce memory and compute requirements for resource-constrained devices.

Compression Techniques Overview

  • Quantization: FP32 → INT8 weights (4x size reduction, 2–4x speedup on CPU)
  • Pruning: Remove near-zero weights to create sparse models (up to 90% sparsity with &lt;1% accuracy drop)
  • Knowledge Distillation: Train a small student model to mimic a large teacher model&apos;s outputs
  • Hardware-Specific Compilation: TVM, Apache TFLite, Core ML compile models into device-native binaries for maximum throughput