Unit Testing Machine Learning Code
Unit testing ML code is more nuanced than testing deterministic software — you must test data pipeline correctness, model API contracts, and output constraints rather than exact numerical outputs.
What to Test in ML Code
Focus on testing the things you control: data transformers, feature engineering functions, model interface compliance, and output shape/type — not the model's exact predictions.
Testing Transformers and Feature Engineering
<pre><code class="language-python"># tests/test_transformers.py
import pytest
import numpy as np
from sklearn.preprocessing import StandardScaler
def test_standard_scaler_zero_mean():
X = np.array([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
np.testing.assert_allclose(X_scaled.mean(axis=0), [0, 0], atol=1e-10)
def test_standard_scaler_unit_variance():
X = np.array([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
np.testing.assert_allclose(X_scaled.std(axis=0), [1, 1], atol=1e-10)
def test_scaler_does_not_transform_validation_before_fit():
scaler = StandardScaler()
with pytest.raises(Exception):
scaler.transform(np.array([[1.0, 2.0]]))</pre>
Testing Model Output Shape and Type
<pre><code class="language-python"># tests/test_model.py
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
@pytest.fixture
def trained_model():
X, y = load_iris(return_X_y=True)
model = RandomForestClassifier(n_estimators=10, random_state=42)
model.fit(X, y)
return model, X
def test_predict_returns_correct_shape(trained_model):
model, X = trained_model
preds = model.predict(X[:10])
assert preds.shape == (10,)
def test_predict_proba_sums_to_one(trained_model):
model, X = trained_model
probas = model.predict_proba(X[:10])
np.testing.assert_allclose(probas.sum(axis=1), np.ones(10), atol=1e-6)
def test_predict_classes_are_valid(trained_model):
model, X = trained_model
preds = model.predict(X)
assert set(preds).issubset({0, 1, 2})</pre>
Testing Pipeline and Data Contracts
Data contracts verify that input data meets expected schemas and constraints — catching upstream data quality issues before they silently corrupt model predictions.
Data Schema Tests
<pre><code class="language-python"># tests/test_data.py
import pandas as pd
import pytest
@pytest.fixture
def sample_df():
return pd.DataFrame({
"age": [25, 32, 45],
"income": [50000.0, 80000.0, 60000.0],
"label": [0, 1, 0]
})
def test_required_columns_present(sample_df):
required = ["age", "income", "label"]
assert all(col in sample_df.columns for col in required)
def test_no_negative_income(sample_df):
assert (sample_df["income"] >= 0).all()
def test_label_is_binary(sample_df):
assert set(sample_df["label"]).issubset({0, 1})</pre>
Parametrized Tests for Robustness
<pre><code class="language-python"># Test that prediction doesn't crash on various input sizes
@pytest.mark.parametrize("n_samples", [1, 10, 100, 1000])
def test_predict_various_sizes(trained_model, n_samples):
model, X = trained_model
X_sub = X[:min(n_samples, len(X))]
preds = model.predict(X_sub)
assert len(preds) == len(X_sub)</pre>
Running Tests and Coverage
Use pytest with pytest-cov to run your test suite and measure code coverage, targeting ≥80% for production ML code.
Pytest Commands
<pre><code class="language-python"># Install test dependencies
# pip install pytest pytest-cov
# Run all tests with coverage report
# pytest tests/ -v --cov=src --cov-report=term-missing
# Run only tests matching a keyword
# pytest tests/ -k "transformer" -v
# Fail if coverage drops below 80%
# pytest tests/ --cov=src --cov-fail-under=80</pre>