Setting up a Supervised Learning Problem

Supervised learning begins long before any model is trained — it starts with carefully defining what you are trying to predict and what data you have to predict it from.

Inputs, Outputs, and Labels

Every supervised learning problem has two sides: a set of features (inputs, X) and a target (output, y). Getting this definition right determines everything else.

Features and Target Variables

Features are the measurable properties used to make a prediction. The target is the quantity you want to predict — a continuous number for regression or a class label for classification. Choosing relevant features and a well-defined target is the most important design decision in supervised learning.

Splitting Data into Train and Test Sets

To evaluate how well a model generalises, you must hold out data it never sees during training. scikit-learn makes this straightforward:

<pre><code class="language-python">from sklearn.model_selection import train_test_split import numpy as np X = np.random.rand(200, 4) # 200 samples, 4 features y = np.random.rand(200) # continuous target X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42 ) print(X_train.shape, X_test.shape) # (160, 4) (40, 4)</pre>

Common Pitfalls When Framing the Problem

Poorly framed problems lead to models that learn the wrong thing, even if training accuracy looks high.

Data Leakage

Data leakage occurs when information from the future or from the test set bleeds into training, causing unrealistically optimistic evaluation scores. A classic example is scaling your entire dataset before the train/test split — the test set statistics contaminate the scaler fitted on training data. Always fit transformers only on the training fold.

Label Quality

Garbage labels produce garbage models. Before training, audit your target variable for noise, class imbalance, and ambiguous labelling criteria. Even a 5% label error rate can significantly degrade model performance on rare classes.

A Minimal scikit-learn Workflow

scikit-learn provides a consistent fit / predict API that applies to nearly every supervised model.

The Estimator API

<pre><code class="language-python">from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error model = LinearRegression() model.fit(X_train, y_train) y_pred = model.predict(X_test) print("MSE:", mean_squared_error(y_test, y_pred))</pre>