Identifying Outliers using Z-Scores

A Z-score measures how many standard deviations a data point lies from the column mean — values beyond ±3 are conventionally flagged as outliers. This method is fast and interpretable but assumes an approximately normal distribution.

Computing Z-Scores

The Z-score of a value x is: z = (x − \\mu) / \\sigma. Values with |z| > 3 represent roughly the 0.3% tails of a normal distribution and are strong outlier candidates.

Detecting Outliers with SciPy

<pre><code class="language-python">import pandas as pd import numpy as np from scipy import stats df = pd.read_csv("data.csv") z_scores = np.abs(stats.zscore(df[["age", "income", "spend"]])) outlier_mask = (z_scores > 3).any(axis=1) print(f"Outliers detected: {outlier_mask.sum()}") df_clean = df[~outlier_mask]</pre>

Limitations and Best Practices

Z-scores are sensitive to the very outliers they aim to detect — a single extreme value inflates the mean and standard deviation, masking other outliers. For heavily skewed data, use the IQR method instead.

When Z-Scores Work and When They Don't

Z-scores work well on normally distributed, unimodal columns. They fail on skewed distributions, multi-modal data, or small samples. Always visualize the distribution with a histogram before applying Z-score filtering. For robust detection, consider Modified Z-score using the median absolute deviation (MAD).

<pre><code class="language-python">from scipy.stats import median_abs_deviation mad = median_abs_deviation(df["income"]) median = df["income"].median() modified_z = 0.6745 * (df["income"] - median) / mad outliers = df[np.abs(modified_z) > 3.5]</pre>