Splitting Criteria: Information Gain (Entropy)

Information gain measures how much a split reduces uncertainty (entropy) in the class labels — the split with the highest gain is chosen at each node.


Shannon Entropy

Entropy H = -\u03a3 p_i \u00b7 log2(p_i) quantifies disorder in a distribution. A pure node (one class) has H = 0; a balanced binary node has H = 1 bit.

Computing Entropy

<pre><code class="language-python">import numpy as np def entropy(y): _, counts = np.unique(y, return_counts=True) probs = counts / counts.sum() # Avoid log(0) by filtering zero probabilities probs = probs[probs > 0] return -np.sum(probs * np.log2(probs)) y_node = np.array([0]*50 + [1]*50) print(f"Entropy: {entropy(y_node):.4f}") # 1.0000 (max disorder)</pre>

Information Gain Calculation

Information gain (IG) = parent entropy minus the weighted entropy of the two children: IG = H(parent) - (N_L/N) H(left) - (N_R/N) H(right).

Why Maximize IG?

A high IG split means the children are much purer than the parent, i.e., the feature provides strong discriminative power. The algorithm greedily picks the split with the highest IG at each node.

Using Entropy Criterion in sklearn

<pre><code class="language-python">from sklearn.tree import DecisionTreeClassifier from sklearn.datasets import load_iris from sklearn.model_selection import cross_val_score X, y = load_iris(return_X_y=True) clf_entropy = DecisionTreeClassifier(criterion='entropy', max_depth=4, random_state=42) scores = cross_val_score(clf_entropy, X, y, cv=5) print(f"CV Accuracy: {scores.mean():.3f} \u00b1 {scores.std():.3f}")</pre>

Gain Ratio (Addressing Bias)

Information gain favors features with many unique values. Gain ratio normalizes IG by the feature's intrinsic information, used in the C4.5 algorithm to correct this bias.

When to Use Entropy

Entropy is preferred when interpretability matters and you want splits that closely follow information-theoretic principles. For speed and similar accuracy, Gini remains the practical default in scikit-learn.