Splitting Criteria: Information Gain (Entropy)
Information gain measures how much a split reduces uncertainty (entropy) in the class labels — the split with the highest gain is chosen at each node.
Shannon Entropy
Entropy H = -\u03a3 p_i \u00b7 log2(p_i) quantifies disorder in a distribution. A pure node (one class) has H = 0; a balanced binary node has H = 1 bit.
Computing Entropy
Information Gain Calculation
Information gain (IG) = parent entropy minus the weighted entropy of the two children: IG = H(parent) - (N_L/N) H(left) - (N_R/N) H(right).
Why Maximize IG?
A high IG split means the children are much purer than the parent, i.e., the feature provides strong discriminative power. The algorithm greedily picks the split with the highest IG at each node.
Using Entropy Criterion in sklearn
Gain Ratio (Addressing Bias)
Information gain favors features with many unique values. Gain ratio normalizes IG by the feature's intrinsic information, used in the C4.5 algorithm to correct this bias.
When to Use Entropy
Entropy is preferred when interpretability matters and you want splits that closely follow information-theoretic principles. For speed and similar accuracy, Gini remains the practical default in scikit-learn.