Lasso Regression (L1 Penalty) and Feature Selection

Lasso regression uses an L1 penalty that has the remarkable property of shrinking some coefficients to exactly zero, automatically selecting the most important features.


The L1 Penalty and Sparsity

Lasso minimises: SSR + \u03b1 \u03a3|\u03b2\u1d62|. The absolute-value geometry of the L1 ball has corners at the coordinate axes — solutions are pushed into these corners, zeroing out non-essential coefficients.

Fitting Lasso in scikit-learn

<pre><code class="language-python">from sklearn.linear_model import LassoCV from sklearn.preprocessing import StandardScaler from sklearn.pipeline import make_pipeline from sklearn.datasets import fetch_california_housing from sklearn.model_selection import train_test_split import numpy as np data = fetch_california_housing() X_tr, X_te, y_tr, y_te = train_test_split( data.data, data.target, test_size=0.2, random_state=42 ) lasso = make_pipeline(StandardScaler(), LassoCV(cv=5, random_state=0)) lasso.fit(X_tr, y_tr) lasso_model = lasso.named_steps["lassocv"] print(f"Best alpha: {lasso_model.alpha_:.4f}") selected = [(n, c) for n, c in zip(data.feature_names, lasso_model.coef_) if c != 0] print("Non-zero features:", selected)</pre>

Limitations of Lasso

Lasso tends to arbitrarily select one feature from a group of correlated features and discard the rest, which can be misleading when correlated features are jointly important.

Lasso vs. Ridge: When to Choose Which

Use Lasso when you believe only a few features truly matter and want an interpretable, sparse model. Use Ridge when many features contribute and correlated features should be kept together. Use Elastic Net when you want sparsity but also want correlated features to be selected as a group.