Identifying Outliers using the IQR Method
The interquartile range (IQR) method defines outliers as points that fall more than 1.5 × IQR below Q1 or above Q3, making it robust to skewed distributions where Z-scores fail. It is the same rule used by box plots.
Computing the IQR Fence
The IQR is the range between the 25th and 75th percentiles (Q3 − Q1). The standard Tukey fences are Q1 − 1.5 × IQR (lower) and Q3 + 1.5 × IQR (upper). Points outside these fences are outliers.
Implementing IQR Outlier Removal
<pre><code class="language-python">import pandas as pd
df = pd.read_csv("data.csv")
def remove_outliers_iqr(df, col):
Q1 = df[col].quantile(0.25)
Q3 = df[col].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
return df[(df[col] >= lower) & (df[col] <= upper)]
for col in ["age", "income", "spend"]:
df = remove_outliers_iqr(df, col)
print(df.shape)</pre>
Handling vs Removing Outliers
Removing outliers is not always the right answer. Sometimes outliers represent the most interesting or high-value cases (e.g., fraud detection, rare disease). Consider capping (winsorization) as an alternative to outright removal.
Winsorization: Capping Instead of Dropping
<pre><code class="language-python">import numpy as np
def winsorize_col(series, lower_pct=0.01, upper_pct=0.99):
lower = series.quantile(lower_pct)
upper = series.quantile(upper_pct)
return series.clip(lower=lower, upper=upper)
df["income"] = winsorize_col(df["income"])
# Scipy also has a winsorize function:
from scipy.stats.mstats import winsorize
df["income"] = winsorize(df["income"], limits=[0.01, 0.01])</pre>