Unit Testing Machine Learning Code

Unit testing ML code is more nuanced than testing deterministic software — you must test data pipeline correctness, model API contracts, and output constraints rather than exact numerical outputs.


What to Test in ML Code

Focus on testing the things you control: data transformers, feature engineering functions, model interface compliance, and output shape/type — not the model's exact predictions.

Testing Transformers and Feature Engineering

<pre><code class="language-python"># tests/test_transformers.py import pytest import numpy as np from sklearn.preprocessing import StandardScaler def test_standard_scaler_zero_mean(): X = np.array([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]]) scaler = StandardScaler() X_scaled = scaler.fit_transform(X) np.testing.assert_allclose(X_scaled.mean(axis=0), [0, 0], atol=1e-10) def test_standard_scaler_unit_variance(): X = np.array([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]]) scaler = StandardScaler() X_scaled = scaler.fit_transform(X) np.testing.assert_allclose(X_scaled.std(axis=0), [1, 1], atol=1e-10) def test_scaler_does_not_transform_validation_before_fit(): scaler = StandardScaler() with pytest.raises(Exception): scaler.transform(np.array([[1.0, 2.0]]))</pre>

Testing Model Output Shape and Type

<pre><code class="language-python"># tests/test_model.py import numpy as np from sklearn.ensemble import RandomForestClassifier from sklearn.datasets import load_iris @pytest.fixture def trained_model(): X, y = load_iris(return_X_y=True) model = RandomForestClassifier(n_estimators=10, random_state=42) model.fit(X, y) return model, X def test_predict_returns_correct_shape(trained_model): model, X = trained_model preds = model.predict(X[:10]) assert preds.shape == (10,) def test_predict_proba_sums_to_one(trained_model): model, X = trained_model probas = model.predict_proba(X[:10]) np.testing.assert_allclose(probas.sum(axis=1), np.ones(10), atol=1e-6) def test_predict_classes_are_valid(trained_model): model, X = trained_model preds = model.predict(X) assert set(preds).issubset({0, 1, 2})</pre>

Testing Pipeline and Data Contracts

Data contracts verify that input data meets expected schemas and constraints — catching upstream data quality issues before they silently corrupt model predictions.

Data Schema Tests

<pre><code class="language-python"># tests/test_data.py import pandas as pd import pytest @pytest.fixture def sample_df(): return pd.DataFrame({ "age": [25, 32, 45], "income": [50000.0, 80000.0, 60000.0], "label": [0, 1, 0] }) def test_required_columns_present(sample_df): required = ["age", "income", "label"] assert all(col in sample_df.columns for col in required) def test_no_negative_income(sample_df): assert (sample_df["income"] >= 0).all() def test_label_is_binary(sample_df): assert set(sample_df["label"]).issubset({0, 1})</pre>

Parametrized Tests for Robustness

<pre><code class="language-python"># Test that prediction doesn't crash on various input sizes @pytest.mark.parametrize("n_samples", [1, 10, 100, 1000]) def test_predict_various_sizes(trained_model, n_samples): model, X = trained_model X_sub = X[:min(n_samples, len(X))] preds = model.predict(X_sub) assert len(preds) == len(X_sub)</pre>

Running Tests and Coverage

Use pytest with pytest-cov to run your test suite and measure code coverage, targeting ≥80% for production ML code.

Pytest Commands

<pre><code class="language-python"># Install test dependencies # pip install pytest pytest-cov # Run all tests with coverage report # pytest tests/ -v --cov=src --cov-report=term-missing # Run only tests matching a keyword # pytest tests/ -k "transformer" -v # Fail if coverage drops below 80% # pytest tests/ --cov=src --cov-fail-under=80</pre>