Splitting Criteria: Gini Impurity

Gini impurity quantifies the probability of misclassifying a randomly chosen sample if it were labeled according to the class distribution at a node — lower is purer.

The Gini Impurity Formula

For a node with C classes, Gini impurity is defined as G = 1 - \u03a3 p_i^2, where p_i is the proportion of class i at that node. A perfectly pure node has G = 0; a maximally impure binary node has G = 0.5.

Computing Gini by Hand

<pre><code class="language-python">import numpy as np def gini(y): classes, counts = np.unique(y, return_counts=True) probs = counts / counts.sum() return 1 - np.sum(probs ** 2) # Example: node with 80 class-0 and 20 class-1 samples y_node = np.array([0]*80 + [1]*20) print(f"Gini impurity: {gini(y_node):.4f}") # 0.3200</pre>

Weighted Gini for Split Evaluation

When a split divides a node into left and right children, the overall impurity is the weighted average of child Gini values, weighted by the number of samples in each child.

Choosing the Best Split

The tree algorithm evaluates every possible threshold for every feature and selects the split that minimizes the weighted Gini impurity. scikit-learn uses this criterion by default (criterion='gini') for classification trees.

Setting the Criterion in sklearn

<pre><code class="language-python">from sklearn.tree import DecisionTreeClassifier from sklearn.datasets import load_breast_cancer X, y = load_breast_cancer(return_X_y=True) # Gini is the default criterion clf_gini = DecisionTreeClassifier(criterion='gini', max_depth=4, random_state=0) clf_gini.fit(X, y) print("Gini tree depth:", clf_gini.get_depth())</pre>

Gini vs. Entropy

Gini impurity is computationally cheaper than entropy (no logarithm required) and tends to produce similar trees. Entropy can be slightly more balanced in edge cases.

Practical Comparison

In practice, switch between criterion='gini' and criterion='entropy' during hyperparameter tuning with cross-validation — the difference in accuracy is usually small but the best choice is dataset-dependent.