Splitting Criteria: Gini Impurity
Gini impurity quantifies the probability of misclassifying a randomly chosen sample if it were labeled according to the class distribution at a node — lower is purer.
The Gini Impurity Formula
For a node with C classes, Gini impurity is defined as G = 1 - \u03a3 p_i^2, where p_i is the proportion of class i at that node. A perfectly pure node has G = 0; a maximally impure binary node has G = 0.5.
Computing Gini by Hand
Weighted Gini for Split Evaluation
When a split divides a node into left and right children, the overall impurity is the weighted average of child Gini values, weighted by the number of samples in each child.
Choosing the Best Split
The tree algorithm evaluates every possible threshold for every feature and selects the split that minimizes the weighted Gini impurity. scikit-learn uses this criterion by default (criterion='gini') for classification trees.
Setting the Criterion in sklearn
Gini vs. Entropy
Gini impurity is computationally cheaper than entropy (no logarithm required) and tends to produce similar trees. Entropy can be slightly more balanced in edge cases.
Practical Comparison
In practice, switch between criterion='gini' and criterion='entropy' during hyperparameter tuning with cross-validation — the difference in accuracy is usually small but the best choice is dataset-dependent.