Isolation Forests for Anomaly Detection
Isolation Forests detect anomalies by exploiting the fact that outliers are rare and different — they require fewer random splits to isolate than normal, densely packed points.
How Isolation Forest Works
Random trees are built by recursively splitting random features at random thresholds. Anomalies, being isolated in sparse regions, have shorter average path lengths from root to leaf than normal points.
Anomaly Score
The anomaly score is based on the average path length h(x) across all trees: s(x, n) = 2^{-E[h(x)] / c(n)}, where c(n) is the average path length for a sample of size n. Scores close to 1 indicate anomalies; near 0.5 indicates normal; below 0.5 means definitely normal.
IsolationForest in scikit-learn
IsolationForest predicts -1 for anomalies and 1 for inliers, making it easy to filter outliers in preprocessing pipelines.
Basic Usage
Contamination Parameter
contamination sets the threshold: the top contamination fraction of points by anomaly score are labeled as outliers. If the true anomaly rate is unknown, set contamination='auto' (sklearn \u2265 0.22) which uses the theoretical score threshold of 0.5.
Applications and Advantages
Isolation Forests are linear in time complexity O(N \u00b7 t), scale to large datasets, and work well in high dimensions — unlike density-based methods that suffer from the curse of dimensionality.
Common Use Cases
- Fraud detection: Flag unusual transactions in financial data.
- Network intrusion: Identify anomalous network traffic patterns.
- Manufacturing: Detect defective products from sensor readings.
- Data cleaning: Remove extreme outliers before model training.