Polynomial Regression for Non-Linear Data

Real-world relationships are rarely straight lines — polynomial regression bends the line into a curve by adding powers of features while still using the familiar OLS machinery.


How Polynomial Features Work

Rather than adding a non-linear model, polynomial regression expands the original features into [x, x\u00b2, x\u00b3, ...] and then fits a standard linear model on those expanded features. The underlying estimator is still linear in its parameters.

Creating Polynomial Features

<pre><code class="language-python">import numpy as np from sklearn.preprocessing import PolynomialFeatures from sklearn.linear_model import LinearRegression from sklearn.pipeline import make_pipeline rng = np.random.default_rng(1) X = rng.uniform(-3, 3, (80, 1)) y = 0.5 * X.ravel()**2 - X.ravel() + 2 + rng.normal(0, 0.5, 80) # Degree-2 polynomial regression poly_model = make_pipeline(PolynomialFeatures(degree=2), LinearRegression()) poly_model.fit(X, y) print("Polynomial model fitted.")</pre>

Choosing the Degree

The polynomial degree is a hyperparameter. A degree that is too low underfits (misses the curve), while one that is too high overfits (chases noise). Use cross-validation to select the best degree.

Trade-offs and Risks

Higher-degree polynomials can perfectly fit training data but catastrophically extrapolate outside the training range.

Overfitting and Runge's Phenomenon

High-degree polynomials exhibit wild oscillations between data points — a phenomenon known as Runge's phenomenon. For highly non-linear data, consider splines, tree-based models, or neural networks rather than very high degree polynomials.