The F1-Score: Balancing Metrics

The F1-score is the harmonic mean of precision and recall, providing a single metric that rewards models that are both precise and sensitive.

Computing the F1-Score

F1 = 2 \u00d7 (Precision \u00d7 Recall) / (Precision + Recall). The harmonic mean is used instead of the arithmetic mean because it is dominated by the lower of the two values — a model with near-zero recall cannot hide behind high precision.

F1 in scikit-learn

<pre><code class="language-python">from sklearn.metrics import f1_score from sklearn.linear_model import LogisticRegression from sklearn.datasets import load_breast_cancer from sklearn.model_selection import train_test_split X, y = load_breast_cancer(return_X_y=True) X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=42) clf = LogisticRegression(max_iter=5000).fit(X_tr, y_tr) y_pred = clf.predict(X_te) print(f"F1 (binary): {f1_score(y_te, y_pred):.4f}") print(f"F1 (macro): {f1_score(y_te, y_pred, average='macro'):.4f}") print(f"F1 (weighted):{f1_score(y_te, y_pred, average='weighted'):.4f}")</pre>

F-Beta Score: Tilting the Balance

The F1-score weights precision and recall equally. The F\u03b2 score adjusts this: F\u03b2 = (1+\u03b2\u00b2) \u00d7 Precision \u00d7 Recall / (\u03b2\u00b2 \u00d7 Precision + Recall).

Choosing Beta

\u03b2 = 0.5 weights precision twice as heavily (use when false positives are more costly). \u03b2 = 2 weights recall twice as heavily (use when false negatives are more costly — e.g., disease detection). \u03b2 = 1 is the standard F1.

<pre><code class="language-python">from sklearn.metrics import fbeta_score # F2: penalises missed positives more print(f"F2 score: {fbeta_score(y_te, y_pred, beta=2):.4f}")</pre>

Macro vs. Weighted Averaging

For multi-class problems: macro computes F1 per class and averages, treating all classes equally regardless of size. Weighted averages by class support (frequency), giving larger classes more influence. Macro is preferred for balanced evaluation; weighted reflects real-world importance when class sizes differ.