Identifying Outliers using the IQR Method

The interquartile range (IQR) method defines outliers as points that fall more than 1.5 × IQR below Q1 or above Q3, making it robust to skewed distributions where Z-scores fail. It is the same rule used by box plots.


Computing the IQR Fence

The IQR is the range between the 25th and 75th percentiles (Q3 − Q1). The standard Tukey fences are Q1 − 1.5 × IQR (lower) and Q3 + 1.5 × IQR (upper). Points outside these fences are outliers.

Implementing IQR Outlier Removal

<pre><code class="language-python">import pandas as pd df = pd.read_csv("data.csv") def remove_outliers_iqr(df, col): Q1 = df[col].quantile(0.25) Q3 = df[col].quantile(0.75) IQR = Q3 - Q1 lower = Q1 - 1.5 * IQR upper = Q3 + 1.5 * IQR return df[(df[col] >= lower) & (df[col] <= upper)] for col in ["age", "income", "spend"]: df = remove_outliers_iqr(df, col) print(df.shape)</pre>

Handling vs Removing Outliers

Removing outliers is not always the right answer. Sometimes outliers represent the most interesting or high-value cases (e.g., fraud detection, rare disease). Consider capping (winsorization) as an alternative to outright removal.

Winsorization: Capping Instead of Dropping

<pre><code class="language-python">import numpy as np def winsorize_col(series, lower_pct=0.01, upper_pct=0.99): lower = series.quantile(lower_pct) upper = series.quantile(upper_pct) return series.clip(lower=lower, upper=upper) df["income"] = winsorize_col(df["income"]) # Scipy also has a winsorize function: from scipy.stats.mstats import winsorize df["income"] = winsorize(df["income"], limits=[0.01, 0.01])</pre>