Decision Trees: Structure and Nodes
A decision tree is a flowchart-like model that splits data into branches based on feature values, ending at leaf nodes that hold predictions.
Anatomy of a Decision Tree
Every decision tree is composed of a root node (the first split), internal nodes (subsequent splits), branches (outcomes of a test), and leaf nodes (final predictions).
Nodes and Splits
At each internal node, the tree tests one feature against a threshold. For a numeric feature like age <= 30, data flows left if true and right if false. This recursive binary splitting continues until a stopping criterion is met (e.g., maximum depth or minimum samples per leaf).
Fitting a Tree with scikit-learn
Depth, Leaves, and Predictions
Tree depth controls model complexity: shallow trees underfit, while deep trees overfit. Leaf nodes store the majority class (classification) or mean value (regression) of training samples that reached them.
Leaf Node Predictions
For classification, a leaf predicts the class with the highest count among samples at that node. For regression (DecisionTreeRegressor), it predicts the mean target value. Inspecting clf.tree_.value reveals the sample distribution at every node.
Visualizing the Tree
scikit-learn provides built-in utilities to visualize trees as text or graphical plots.