The F1-Score: Balancing Metrics
The F1-score is the harmonic mean of precision and recall, providing a single metric that rewards models that are both precise and sensitive.
Computing the F1-Score
F1 = 2 \u00d7 (Precision \u00d7 Recall) / (Precision + Recall). The harmonic mean is used instead of the arithmetic mean because it is dominated by the lower of the two values — a model with near-zero recall cannot hide behind high precision.
F1 in scikit-learn
F-Beta Score: Tilting the Balance
The F1-score weights precision and recall equally. The F\u03b2 score adjusts this: F\u03b2 = (1+\u03b2\u00b2) \u00d7 Precision \u00d7 Recall / (\u03b2\u00b2 \u00d7 Precision + Recall).
Choosing Beta
\u03b2 = 0.5 weights precision twice as heavily (use when false positives are more costly). \u03b2 = 2 weights recall twice as heavily (use when false negatives are more costly — e.g., disease detection). \u03b2 = 1 is the standard F1.
<pre><code class="language-python">from sklearn.metrics import fbeta_score # F2: penalises missed positives more print(f"F2 score: {fbeta_score(y_te, y_pred, beta=2):.4f}")</pre>Macro vs. Weighted Averaging
For multi-class problems: macro computes F1 per class and averages, treating all classes equally regardless of size. Weighted averages by class support (frequency), giving larger classes more influence. Macro is preferred for balanced evaluation; weighted reflects real-world importance when class sizes differ.