decision trees explained

Decision trees explained simply: how tree-based models make decisions

Linear models work great for simple relationships but struggle when patterns get complex. What if the relationship between your features and target isn’t a straight line? What if different rules apply to different groups in your data? That’s where decision trees shine.

Understanding tree-based models gives you access to some of the most powerful and interpretable machine learning algorithms available. Trees handle non-linear relationships naturally, work with mixed data types, and show you exactly how they make decisions. No black box mystery.

Decision trees explained simply are algorithms that make predictions by asking a series of yes/no questions about your data. Each question splits the data into groups, and this process continues until the tree reaches a final prediction. Think of it like playing 20 questions to guess what something is.

How decision trees actually work

Imagine you’re trying to predict whether someone will buy a product. You have data about their age, income, and whether they clicked on ads. A decision tree learns which questions to ask and in what order.

The tree might start by asking: is income greater than 50,000? If yes, go down one branch. If no, go down another branch. Each branch then asks another question, splitting the data further.

Eventually you reach a leaf node that makes the final prediction. Someone with income over 50,000 who clicked ads and is under 40 gets predicted as likely to buy. The path from the root to that leaf shows exactly why the tree made that prediction.

The beauty of decision trees is their interpretability. You can literally draw out the tree and see the decision rules. Compare this to neural networks where billions of parameters interact in ways impossible to visualize.

Trees work for both classification and regression. Classification trees predict categories like yes/no or spam/not spam. Regression trees predict continuous values like price or temperature. The structure is the same but the final predictions differ.

from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

# Create sample data
X, y = make_classification(
    n_samples=200, 
    n_features=4, 
    n_informative=3,
    n_redundant=1,
    random_state=42
)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train decision tree
tree = DecisionTreeClassifier(max_depth=3, random_state=42)
tree.fit(X_train, y_train)

# Visualize the tree
plt.figure(figsize=(15, 10))
plot_tree(tree, filled=True, feature_names=[f'feature_{i}' for i in range(4)])
plt.savefig('decision_tree.png')

print(f"Training accuracy: {tree.score(X_train, y_train):.4f}")
print(f"Test accuracy: {tree.score(X_test, y_test):.4f}")

How trees choose where to split

The critical question is how does the tree decide which feature to split on and at what value? Random splits would create useless trees. The tree needs a systematic way to find the best splits.

Decision trees use metrics that measure how mixed or pure the groups are after a split. A pure group has all examples from the same class. A mixed group has examples from multiple classes. Good splits create purer groups.

Gini impurity is one common metric. It measures the probability of incorrectly classifying a random element if you randomly assigned it a class based on the group’s class distribution.

A group with all examples from one class has Gini impurity of 0, perfectly pure. A group with equal examples from two classes has Gini impurity of 0.5, maximally mixed. The tree tries splits that minimize weighted average Gini impurity across both resulting groups.

Entropy is another metric that measures disorder or randomness. Pure groups have entropy of 0. Maximally mixed groups have higher entropy. Information gain measures how much entropy decreases after a split. The tree chooses splits that maximize information gain.

import numpy as np

def gini_impurity(y):
    """Calculate Gini impurity for a group"""
    classes, counts = np.unique(y, return_counts=True)
    probabilities = counts / len(y)
    return 1 - np.sum(probabilities ** 2)

def entropy(y):
    """Calculate entropy for a group"""
    classes, counts = np.unique(y, return_counts=True)
    probabilities = counts / len(y)
    return -np.sum(probabilities * np.log2(probabilities + 1e-10))

# Example groups
pure_group = np.array([1, 1, 1, 1, 1])
mixed_group = np.array([0, 0, 1, 1, 1])

print(f"Pure group Gini: {gini_impurity(pure_group):.4f}")
print(f"Mixed group Gini: {gini_impurity(mixed_group):.4f}")
print(f"Pure group entropy: {entropy(pure_group):.4f}")
print(f"Mixed group entropy: {entropy(mixed_group):.4f}")

In practice, Gini and entropy usually produce similar trees. Gini is slightly faster to compute. Entropy might create slightly more balanced trees. Most practitioners stick with Gini as the default.

The tree evaluates every possible split for every feature. For a numerical feature, it considers splitting at every unique value. For a categorical feature, it considers splitting on each category. It picks whichever split reduces impurity most.

This process repeats recursively. After the first split, the tree has two groups. It evaluates all possible splits for each group and picks the best ones. This continues until some stopping condition is met.

Controlling tree growth and preventing overfitting

Without any constraints, decision trees grow until every training example sits in its own leaf node. The tree memorizes the training data perfectly, achieving 100 percent training accuracy. But it fails miserably on new data because it overfit.

Overfitting is the biggest problem with decision trees. Deep trees with many splits learn noise and random patterns in training data rather than genuine signals that generalize.

Maximum depth limits how many questions the tree can ask in sequence. A max depth of 3 means the tree can split at most 3 times from root to leaf. This prevents the tree from becoming too complex and specialized.

Minimum samples per split requires that a node has at least this many examples before considering a split. Setting this to 10 means nodes with fewer than 10 examples become leaves without further splitting.

Minimum samples per leaf requires that each final prediction group has at least this many examples. This prevents the tree from creating tiny leaves with just one or two examples.

Maximum leaf nodes limits the total number of final prediction groups. The tree stops growing once it reaches this many leaves even if other stopping criteria aren’t met.

from sklearn.metrics import accuracy_score

# Tree with no constraints (overfitting)
overfit_tree = DecisionTreeClassifier(random_state=42)
overfit_tree.fit(X_train, y_train)

# Tree with proper constraints
constrained_tree = DecisionTreeClassifier(
    max_depth=5,
    min_samples_split=10,
    min_samples_leaf=5,
    random_state=42
)
constrained_tree.fit(X_train, y_train)

print("Overfitting tree:")
print(f"  Training accuracy: {overfit_tree.score(X_train, y_train):.4f}")
print(f"  Test accuracy: {overfit_tree.score(X_test, y_test):.4f}")

print("\nConstrained tree:")
print(f"  Training accuracy: {constrained_tree.score(X_train, y_train):.4f}")
print(f"  Test accuracy: {constrained_tree.score(X_test, y_test):.4f}")

The overfit tree has perfect or near perfect training accuracy but poor test accuracy. The constrained tree has lower training accuracy but better test accuracy because it learned general patterns instead of memorizing noise.

Finding the right constraints requires experimentation. Too strict and the tree underfits, too loose and it overfits. Cross-validation helps find the sweet spot where test performance peaks.

Pruning is another approach to control overfitting. Grow a large tree first, then remove branches that don’t improve performance on validation data. This can work better than setting constraints upfront but requires more computation.

Advantages and disadvantages of decision trees

Decision trees have compelling strengths that make them popular despite their limitations. Understanding both helps you know when to use them.

The biggest advantage is interpretability. You can visualize the tree and understand exactly why it made each prediction. Business stakeholders can follow the logic without machine learning expertise. This matters for applications needing explainability like loan approvals or medical diagnoses.

Trees handle mixed data types naturally. Numerical and categorical features work together without needing separate preprocessing. You don’t need to scale features or one-hot encode categories.

Trees capture non-linear relationships and interactions automatically. If the relationship between features and target is complex, trees adapt by creating appropriate splits. Linear models struggle with these patterns.

Trees require minimal data preparation. Missing values can be handled directly. Outliers don’t distort the model because trees split on thresholds. You can often train trees on raw data with little preprocessing.

The disadvantages are equally important. Single decision trees overfit easily, especially on noisy data. They’re unstable, meaning small changes in training data can produce very different trees. They struggle with linear relationships that linear models capture easily.

Trees create discontinuous predictions with sharp boundaries. A tiny change in input can flip the prediction if it crosses a split threshold. This can be undesirable for some applications.

Trees are biased toward features with more unique values. They tend to split on these features more often even if they’re not actually more informative. Categorical features with many categories can dominate the tree.

Real world applications of decision trees

Despite their limitations, decision trees solve many real problems effectively. Understanding where they excel helps you apply them appropriately.

Credit scoring uses trees to decide loan approvals. The tree learns which combinations of income, credit history, and other factors predict default risk. Banks can explain rejections by showing the decision path.

Medical diagnosis systems use trees to classify diseases from symptoms and test results. Doctors can verify that the tree’s reasoning matches medical knowledge rather than trusting an opaque black box.

Customer segmentation uses trees to group customers based on behavior and demographics. Marketing teams can understand what defines each segment and target them specifically.

Fraud detection uses trees to flag suspicious transactions. The transparent rules let investigators understand why transactions triggered alerts rather than blindly trusting algorithm outputs.

As standalone models, decision trees work well for smaller datasets and when interpretability matters most. For maximum predictive performance, trees are typically combined into ensembles like random forests or gradient boosting.

Decision trees explained simply show you how tree-based models ask questions to make predictions. The intuitive structure, natural handling of complex patterns, and complete transparency make them valuable tools. Their tendency to overfit makes them work best as building blocks for ensemble methods. Ready to see how combining multiple trees creates dramatically better models? Check out our guide on random forest vs gradient boosting to learn how ensemble methods overcome the weaknesses of individual trees.