Identifying Outliers using Z-Scores
A Z-score measures how many standard deviations a data point lies from the column mean — values beyond ±3 are conventionally flagged as outliers. This method is fast and interpretable but assumes an approximately normal distribution.
Computing Z-Scores
The Z-score of a value x is: z = (x − \\mu) / \\sigma. Values with |z| > 3 represent roughly the 0.3% tails of a normal distribution and are strong outlier candidates.
Detecting Outliers with SciPy
Limitations and Best Practices
Z-scores are sensitive to the very outliers they aim to detect — a single extreme value inflates the mean and standard deviation, masking other outliers. For heavily skewed data, use the IQR method instead.
When Z-Scores Work and When They Don't
Z-scores work well on normally distributed, unimodal columns. They fail on skewed distributions, multi-modal data, or small samples. Always visualize the distribution with a histogram before applying Z-score filtering. For robust detection, consider Modified Z-score using the median absolute deviation (MAD).
<pre><code class="language-python">from scipy.stats import median_abs_deviation mad = median_abs_deviation(df["income"]) median = df["income"].median() modified_z = 0.6745 * (df["income"] - median) / mad outliers = df[np.abs(modified_z) > 3.5]</pre>