Decision Trees: Structure and Nodes

A decision tree is a flowchart-like model that splits data into branches based on feature values, ending at leaf nodes that hold predictions.

Anatomy of a Decision Tree

Every decision tree is composed of a root node (the first split), internal nodes (subsequent splits), branches (outcomes of a test), and leaf nodes (final predictions).

Nodes and Splits

At each internal node, the tree tests one feature against a threshold. For a numeric feature like age <= 30, data flows left if true and right if false. This recursive binary splitting continues until a stopping criterion is met (e.g., maximum depth or minimum samples per leaf).

Fitting a Tree with scikit-learn

<pre><code class="language-python">from sklearn.tree import DecisionTreeClassifier, export_text from sklearn.datasets import load_iris X, y = load_iris(return_X_y=True) clf = DecisionTreeClassifier(max_depth=3, random_state=42) clf.fit(X, y) # Print a text representation of the tree print(export_text(clf, feature_names=load_iris().feature_names))</pre>

Depth, Leaves, and Predictions

Tree depth controls model complexity: shallow trees underfit, while deep trees overfit. Leaf nodes store the majority class (classification) or mean value (regression) of training samples that reached them.

Leaf Node Predictions

For classification, a leaf predicts the class with the highest count among samples at that node. For regression (DecisionTreeRegressor), it predicts the mean target value. Inspecting clf.tree_.value reveals the sample distribution at every node.

Visualizing the Tree

scikit-learn provides built-in utilities to visualize trees as text or graphical plots.

Plotting with plot_tree

<pre><code class="language-python">import matplotlib.pyplot as plt from sklearn.tree import plot_tree fig, ax = plt.subplots(figsize=(12, 6)) plot_tree(clf, feature_names=load_iris().feature_names, class_names=load_iris().target_names, filled=True, ax=ax) plt.show()</pre>