Overfitting vs. Underfitting in Regression

Overfitting and underfitting are the two failure modes of supervised learning, and diagnosing them correctly is the first step toward fixing them.

Recognising Each Problem

The clearest diagnostic signal is the gap between training and validation error: a large gap signals overfitting, while both metrics being high signals underfitting.

Underfitting (High Bias)

An underfit model has high training error and high validation error. Both curves plateau at a suboptimal level. The fix is to increase model complexity — use more features, a higher polynomial degree, or a more powerful model class.

Overfitting (High Variance)

An overfit model has low training error but high validation error. The model has memorised the training set. The fix is to gather more data, reduce complexity, apply regularisation, or use dropout/early stopping in neural networks.

Learning Curves as a Diagnostic Tool

A learning curve plots training and validation error as a function of training set size, revealing which regime you are in.

Generating Learning Curves

<pre><code class="language-python">from sklearn.model_selection import learning_curve from sklearn.linear_model import LinearRegression from sklearn.datasets import fetch_california_housing import numpy as np data = fetch_california_housing() train_sizes, train_scores, val_scores = learning_curve( LinearRegression(), data.data, data.target, train_sizes=np.linspace(0.1, 1.0, 10), scoring="neg_mean_squared_error", cv=5 ) print("Train sizes:", train_sizes) print("Mean val MSE:", (-val_scores.mean(axis=1)).round(3))</pre>