Ridge Regression (L2 Penalty)

Ridge regression adds the sum of squared coefficients to the OLS loss, shrinking all coefficients toward — but never exactly to — zero to reduce overfitting.


Ridge Loss Function

Ridge minimises: SSR + \u03b1 \u03a3\u03b2\u1d62\u00b2, where \u03b1 (alpha) is the regularisation strength. The L2 penalty keeps all features in the model, making Ridge ideal when many features contribute weak signals.

Fitting Ridge in scikit-learn

<pre><code class="language-python">from sklearn.linear_model import Ridge, RidgeCV from sklearn.preprocessing import StandardScaler from sklearn.pipeline import make_pipeline from sklearn.datasets import fetch_california_housing from sklearn.model_selection import train_test_split from sklearn.metrics import r2_score import numpy as np data = fetch_california_housing() X_tr, X_te, y_tr, y_te = train_test_split( data.data, data.target, test_size=0.2, random_state=42 ) # Automatically selects best alpha via cross-validation ridge_cv = make_pipeline(StandardScaler(), RidgeCV(alphas=np.logspace(-3, 3, 50))) ridge_cv.fit(X_tr, y_tr) best_alpha = ridge_cv.named_steps["ridgecv"].alpha_ print(f"Best alpha: {best_alpha:.4f}") print(f"Test R\u00b2: {r2_score(y_te, ridge_cv.predict(X_te)):.3f}")</pre>

When to Use Ridge

Ridge excels when you believe most features are genuinely informative but their signals are noisy or correlated.

Ridge vs. OLS on Collinear Data

When features are highly correlated, (X\u1d40X)\u207b\u00b9 becomes numerically unstable. Ridge adds \u03b1I to the diagonal of X\u1d40X before inversion — (X\u1d40X + \u03b1I)\u207b\u00b9 — which is always well-conditioned. This is why Ridge is the go-to solution for multicollinearity.