Isolation Forests for Anomaly Detection

Isolation Forests detect anomalies by exploiting the fact that outliers are rare and different — they require fewer random splits to isolate than normal, densely packed points.

How Isolation Forest Works

Random trees are built by recursively splitting random features at random thresholds. Anomalies, being isolated in sparse regions, have shorter average path lengths from root to leaf than normal points.

Anomaly Score

The anomaly score is based on the average path length h(x) across all trees: s(x, n) = 2^{-E[h(x)] / c(n)}, where c(n) is the average path length for a sample of size n. Scores close to 1 indicate anomalies; near 0.5 indicates normal; below 0.5 means definitely normal.

IsolationForest in scikit-learn

IsolationForest predicts -1 for anomalies and 1 for inliers, making it easy to filter outliers in preprocessing pipelines.

Basic Usage

<pre><code class="language-python">import numpy as np from sklearn.ensemble import IsolationForest import matplotlib.pyplot as plt # Generate normal data and inject anomalies rng = np.random.RandomState(42) X_normal = 0.3 * rng.randn(500, 2) X_outliers = rng.uniform(-4, 4, size=(20, 2)) X = np.vstack([X_normal, X_outliers]) iso = IsolationForest( n_estimators=100, contamination=0.04, # expected fraction of outliers random_state=42 ) predictions = iso.fit_predict(X) # -1 = anomaly, 1 = inlier scores = iso.score_samples(X) # lower score = more anomalous print(f"Anomalies detected: {(predictions == -1).sum()}") plt.scatter(X[predictions==1, 0], X[predictions==1, 1], c='blue', alpha=0.5, label='Inlier') plt.scatter(X[predictions==-1, 0], X[predictions==-1, 1], c='red', s=80, label='Anomaly') plt.legend(); plt.title('Isolation Forest'); plt.show()</pre>

Contamination Parameter

contamination sets the threshold: the top contamination fraction of points by anomaly score are labeled as outliers. If the true anomaly rate is unknown, set contamination='auto' (sklearn \u2265 0.22) which uses the theoretical score threshold of 0.5.

Applications and Advantages

Isolation Forests are linear in time complexity O(N \u00b7 t), scale to large datasets, and work well in high dimensions — unlike density-based methods that suffer from the curse of dimensionality.

Common Use Cases

Fraud detection: Flag unusual transactions in financial data.
Network intrusion: Identify anomalous network traffic patterns.
Manufacturing: Detect defective products from sensor readings.
Data cleaning: Remove extreme outliers before model training.