Saving Models with Pickle and Joblib

Once a model is trained, serializing it to disk allows you to reload it instantly for inference without retraining — the two standard tools in Python are pickle (built-in) and joblib (preferred for large NumPy arrays).


Saving and Loading with Pickle

pickle serializes any Python object to a byte stream. It's universally available but can be slow for models with large internal NumPy arrays.

Pickle Workflow

<pre><code class="language-python">import pickle from sklearn.ensemble import RandomForestClassifier from sklearn.datasets import load_iris X, y = load_iris(return_X_y=True) model = RandomForestClassifier(n_estimators=100, random_state=42) model.fit(X, y) # Save with open("rf_model.pkl", "wb") as f: pickle.dump(model, f) # Load with open("rf_model.pkl", "rb") as f: loaded_model = pickle.load(f) print(loaded_model.predict(X[:3]))</pre>

Saving and Loading with Joblib

joblib uses memory-mapped files and compression to efficiently serialize objects containing large NumPy arrays — it is the scikit-learn team's recommended approach.

Joblib Workflow with Compression

<pre><code class="language-python">import joblib # Save (compress=3 is a good balance of speed vs. file size) joblib.dump(model, "rf_model.joblib", compress=3) # Load loaded_model = joblib.load("rf_model.joblib") print(loaded_model.predict(X[:3])) # Check file size import os print(f"File size: {os.path.getsize('rf_model.joblib') / 1024:.1f} KB")</pre>

Versioning Saved Models

<pre><code class="language-python">import datetime timestamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S") filename = f"rf_model_v1_{timestamp}.joblib" joblib.dump(model, filename) print(f"Model saved as: {filename}")</pre>

Security and Compatibility Considerations

Pickle files execute arbitrary code on load — never unpickle untrusted files. Additionally, models are not guaranteed to load correctly across different scikit-learn or Python versions.

Best Practices

  • Always record the scikit-learn version alongside the saved file (sklearn.__version__)
  • Use environment lock files (requirements.txt or conda env export) to reproduce the exact environment
  • For production or cross-language serving, consider ONNX or PMML export instead of pickle
  • Store model files in versioned storage (S3, GCS, DVC) rather than local disk