Random Forests: Architecture

Random Forests combine bagging with random feature subsets at each split to build an ensemble of decorrelated trees that collectively achieve strong generalization.


Building the Forest

Each tree is grown on a bootstrap sample of the data. At every node, only a random subset of max_features features is considered for splitting, preventing all trees from using the same dominant feature.

Training a Random Forest

<pre><code class="language-python">from sklearn.ensemble import RandomForestClassifier from sklearn.datasets import load_breast_cancer from sklearn.model_selection import train_test_split X, y = load_breast_cancer(return_X_y=True) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) rf = RandomForestClassifier( n_estimators=200, # number of trees max_features='sqrt', # sqrt(n_features) features per split max_depth=None, # fully grown trees oob_score=True, random_state=42, n_jobs=-1 ) rf.fit(X_train, y_train) print(f"OOB Score: {rf.oob_score_:.3f}") print(f"Test Score: {rf.score(X_test, y_test):.3f}")</pre>

Key Hyperparameters

Random Forests have several hyperparameters that control the size and diversity of the ensemble.

Tuning n_estimators and max_features

n_estimators: More trees reduce variance but add computation; 100–500 is typical. max_features: 'sqrt' is standard for classification; 'log2' or a fixed integer are alternatives. min_samples_leaf controls tree depth and smoothness of predictions.

GridSearch Tuning

<pre><code class="language-python">from sklearn.model_selection import GridSearchCV param_grid = { 'n_estimators': [100, 200], 'max_features': ['sqrt', 'log2'], 'min_samples_leaf': [1, 2, 4] } grid = GridSearchCV(RandomForestClassifier(random_state=42), param_grid, cv=5, n_jobs=-1) grid.fit(X_train, y_train) print("Best params:", grid.best_params_)</pre>

Prediction Aggregation

For classification, each tree votes for a class; the forest returns the class with the most votes (hard voting) or highest average probability (soft voting via predict_proba).

Probability Outputs

Calling rf.predict_proba(X_test) returns the fraction of trees voting for each class, providing calibrated probability estimates that can be thresholded or used directly in downstream decision-making.