Decision Trees
A decision tree is a data structure used in machine learning for both regression and classification. As the name suggests, a decision tree is based on a binary tree structure in computer science, where each node is a decision point that has two child nodes, a left child and a right child. Unlike biological trees, computer scientists imagine that trees grow downward, with the root at the top and the leaves toward the bottom.

A decision tree works by considering a single data point and passing it down from the root of the tree to a leaf node. At each internal node of the tree, a decision is made to pass the current data point to either the left child or the right child based on a feature value compared against a threshold value at that node. Once a data point arrives at a leaf node, it is assigned the label associated with that leaf node. Decision trees can be applied to both regression and classification problems.
Decision trees are fast, interpretable, robust for serving, and support numerical and categorical features with equal ease. They are used directly or within ensemble models such as random forests. Decision trees are widely applied in credit scoring, fraud detection, and customer segmentation.
In the above explanation, we have skipped the learning process, such as how the depth of the decision tree is determined, the choice of features, and the thresholds at each node. These are important considerations when learning (also called growing) a decision tree from labeled data.
By the end of this lesson, you'll have a better idea of how to answer the following commonly asked questions:
- Can we use batches of data to train a decision tree?
- How do we train a decision tree model if the entire dataset cannot be loaded into memory during the training process?
- How do we update an existing Decision Tree model with new incremental data?
We'll provide the answers at the end of this lesson.
Overview
Algorithm
Let be the feature names and (for class) be the label name. Each data point is a row that contains a label value and feature values . We shall use a generic score function in the description below to capture the fact that the criteria for growing trees differ yet are closely related concepts for regression trees and classification trees.
For classification decision trees, we use entropy reduction (or Gini impurity gain), while for regression decision trees, we use the reduction in variance from a node to its child nodes as the score function to decide the split.
The training process is as follows:
- Initialize the hyperparameters for the decision tree -
max_depthandmin_samples.max_depthis the maximum depth of the decision tree.min_samplesis the minimum number of training data samples that must occupy a decision tree node so that the node becomes eligible for further splitting of data, otherwise the node remains a leaf node and no further child nodes are created during training. - For each feature , sort the data points by their feature values.
- For each of these features, pick threshold values and split the data into a left child and a right child based on the data points under consideration at the current node.
- Compute the score function reduction between the current node, its left child and right child.
- The feature and threshold value that achieves the greatest reduction in this score function value will be assigned to the node as the decision feature and decision threshold value
- Repeat steps 2 to 5 for the left child node and right child node, until the node under consideration has fewer than
min_samplesormax_depthhas been achieved. - For the leaf nodes, we assign class labels or prediction values. In the case of classification, the label associated with a leaf node is the majority class label. For regression, the prediction value is the average of the target values among all the training data points that occupy the leaf node.
Computing feature importances in a decision tree involves calculating the weighted reduction in entropy attributable to a feature across all nodes where that feature was used for data splitting. This weighting is typically based on the proportion of data points arriving at each node. When normalized across all data points, it yields a feature importance value that can be expressed as a percentage.
Example on computing feature importance. Sample dataset:
A decision tree learnt here will split the data on the number of hours studied < 3.

Gini impurity computation. At the root there are 2 negative labels and 3 positive labels. 1 - (2/5)^2 - (3/5)^2 = 0.48. At the left child: 1 - (2/2)^2 - (0/2)^2 = 0. At the right child: 1 - (0/3)^2 - (3/3)^2 = 0. So, the reduction in impurity is parent impurity - weighted child impurities = 0.48 . Total impurity decrease = 0.48. Impurity decrease due to feature “Hours studied” is 0.48, and due to “Previous GPA” is 0. So, the feature importance of “Hours studied” is 0.48/0.48 = 100%, and due to “Previous GPA” is 0.
In the context of decision trees, “serving” refers to the process where the trained decision tree model is used to make predictions or classifications on new, unseen data. During serving:
- New data inputs are provided to the decision tree model.
- The input traverses from the root of the decision tree down through internal nodes based on feature values, following the decision rules learned during training.
- Once the input reaches a leaf node, the prediction or classification associated with that leaf node is returned as the output.
Although decision trees are robust computationally during serving for new data points with missing values or null data, it’s important to note that learned decision trees can vary significantly between training sessions due to the selection of features for data splitting when growing nodes. This instability can be a concern from a training or learning perspective. Therefore, machine learning engineers and data scientists often consider ensembles of trees, such as random forests, as robust models. Random forests aggregate multiple independent decision tree models. The term “random” in random forests signifies that the subsample of data points at each node, along with the features and feature thresholds, are randomly chosen.
During the training of a decision tree, no batching is performed, unless one intends to build an ensemble model like a random forest, which comprises multiple trees. All training data must be utilized during decision tree construction, which can impose size limitations on the tree or training data.
There are several decision tree algorithms, with CART and ID3 being popular ones. They use Gini impurity values for classification instead of entropy, chosen for computational efficiency. Some decision trees have multiple child nodes and are not limited to binary decisions. Certain decision trees allow for decision thresholds that represent linear relationships between two features. These configurations enable inclined decision boundaries, rather than strictly axis-aligned ones.
Computing entropy at node
Where
- is the number of classes
- is the proportion of data points at node D that belong to class i.
Suppose there are two classes, Red and Blue, with 2 instances of Red and 3 instances of Blue at the current node. The entropy calculation for this node would be:
This value indicates relatively high entropy because the node contains a mixed distribution of Red and Blue instances. Entropy is highest when the data is evenly split among all classes, reflecting maximum uncertainty in classification.

If entropy can be used to split data, why do many people prefer to use the Gini index in practice?
The Gini index is preferred over entropy in decision tree algorithms for two main reasons: computational efficiency and robustness to changes in class probabilities.
Where represents the number of classes, and is the proportion of cases belonging to class at node .
Unlike entropy, which involves logarithmic calculations, the Gini index only requires straightforward arithmetic operations. This computational simplicity makes it faster to compute, especially beneficial for large datasets or applications requiring real-time decision making.

From the impurity-probability graph above, we observe that both entropy and the Gini index exhibit parabolic relationships with class probabilities, indicating their robust nature. Specifically, the curve of the Gini index is flatter compared to entropy. This means that small fluctuations in class probabilities lead to smaller changes in the Gini index value, demonstrating its greater robustness. This stability ensures that decision trees using the Gini index as a splitting criterion are less likely to overfit or generate unstable splits. As a result, models built with the Gini index are generally more reliable and interpretable across various datasets and classification tasks, making it a preferred choice in practical implementations of decision tree algorithms.
Now, assume a data split at a node as shown below:

Computing entropy reduction by the data split
Compute the entropy of parent minus weighted entropy of children = H(node) - H(children) = 0.97 - (3/5)*0.918 - (2/5)*1.0= 0.0199
Computing variance of data at a node
- Compute the mean for the r data points that arrive at the node. where is the regression label for the th data point that arrives at the node.
- Compute the variance at the node ,
Computing variance reduction
- Compute the variance at the parent node and its child nodes.
- Weight the variances of the child nodes by the proportion of data points arriving at each node.
Pseudocode
From the outlined algorithm, we can construct a pseudocode for a decision tree as follows:
Pythonclass DecisionTreeNode:
def __init__(self, indices, feature, threshold):
self.indices = indices[::]
self.feature_ix = feature
self.threshold = threshold
self.label = None
# stitch together a tree during training
def train(X, y, max_depth=10, min_samples=0.1, clf_or_reg):
# input features, labels, hyperparameters max_depth, min_samples
# type: classification or regression
# create multi-index sorted by each feature
multi_indices = {..}
# level-wise traversal to create nodes and wire them together
depth, root = 0, None
stack = [(range(len(X)), None, 'L')] # root has no parent
while stack:
# deep enough already
if depth == max_depth:
break
for _ in range(len(stack)):
indices, parent, left_right = stack.pop(0)
# too few samples
if len(indices) <= min_samples:
continue
current, lxs, rxs = compute_node_lx_rx(indices)
if parent:
{
'L': parent.left,
'R': parent.right,
}[left_right] = current
elif not root:
root = current
stack.append((lxs, current, 'L'))
stack.append((rxs, current, 'R'))
depth += 1
return root
def compute_node_lx_rx(indices):
#
# use multi-index pick the feature and threshold
# that yields the best reduction in score fn
#
#
# left_ixs are the data indices that end up in the left child
# right_ixs end up in the right child
#
return DecisionTreeNode(indices, feature, threshold), left_ixs, right_ixs)
def score_fn(clf_or_regr, indices):
# entropy if classifier
if clf_or_regr == 'classifier':
return entropy(indices)
# variance if regression
return variance(indices)
def variance(indices):
mean = sum([y[i] for i in indices])/len(indices)
return sum([(y[i] - mean)**2 for i in indices])/len(indices)
def weighted_variance(lxs, rxs):
L, R = len(lxs), len(rxs)
return L/(L+R) * variance(lxs) + R/(L+R) * variance(rxs)
def entropy(indices):
class1 = sum([y[i] == 1 for i in indices])
p1 = class1/len(indices)
return -p1*log(p1) - (1 - p1)*log (1 - p1)
# evaluate for a datapoint
def predict_single(x, root):
node = root
while not node.is_leaf:
if x[node.feature_ix] <= node.threshold:
node = node.left
else:
node = node.right
return node.label
# evaluate for batch of data
def predict(X, model):
return [predict_single(x, model) for x in X]
Evaluation
Decision trees can be used for both classification and regression tasks. Check out our lesson on model evalution to learn more about how to evaluate the performance of this model.
Limitations
- Brittleness: Decision trees are susceptible to changes in training data, which can lead to learning significantly different trees for similar datasets. This inherent brittleness is addressed by using ensemble methods such as random forests or gradient boosted decision trees, which combine multiple trees to improve robustness and predictive performance.
- Computational Complexity: Finding the optimal feature and threshold for each split in a decision tree is an NP-complete problem, which is computationally exhaustive. To mitigate this, heuristics are typically embedded in algorithms. Random forests address this by randomly selecting subsets of features and samples to determine thresholds within each tree, reducing computational demand.
- Overfitting: Decision trees have a tendency to overfit to the training data, capturing noise and specifics of the training set that may not generalize well to new data. Techniques like tree pruning (e.g., limiting the maximum depth of trees, setting minimum samples per leaf) are commonly used to prevent overfitting. These strategies help simplify decision trees and improve their ability to generalize to unseen data.
Employing decision trees
Here are a few prevalent methods and libraries for utilizing decision trees in code:
Using scikit-learn (Python) for Classification:
Pythonfrom sklearn.tree import DecisionTreeClassifier # create a new classifier clf = DecisionTreeClassifier() # labeled dataset X = [[0, 0], [1, 1]] y = [0, 1] # learn clf = clf.fit(X, y) # predict for new data y_pred = clf.predict([[-1,0]]) # plot metrics, AUC-ROC curve etc.
Using scikit-learn (Python) for Regression:
Pythonfrom sklearn.tree import DecisionTreeRegressor # model clf = tree.DecisionTreeRegressor() # dataset X = [[0, 0], [2, 2]] y = [0.5, 2.5] # learn clf = clf.fit(X, y) # predict y_pred = clf.predict([[1, 1]]) # plot eval / test metrics, MSE etc.
Note that due to the uniform API contract supported by sklearn, there is no code difference in the code between classification and regression.
Common questions
Q: Can we use batches of data to train a decision tree?
A: Training a vanilla decision tree requires the entire dataset, or its multi-indices, to be loaded into memory. This ensures effective selection of feature thresholds at each node. Therefore, using batches of training data is not practical for training a single decision tree. However, ensemble methods allow us to train multiple decision trees, each on a subset of the training data. During prediction or serving, the decisions from these individual trees are typically averaged out to provide the final prediction. This approach not only leverages batch-like training for efficiency but also enhances the model’s robustness and predictive accuracy through ensemble averaging.
Q: How do we train a decision tree model if the entire dataset cannot be loaded into memory during the training process?
A: In such cases, we employ ensemble methods where multiple decision trees are trained, each using a subset of the dataset. These individual trees collectively form an ensemble that is used for making predictions. Ensemble methods provide a robust approach to prediction, leveraging the diversity of multiple models to improve accuracy and generalization.
This ensemble approach also enhances efficiency during serving, as each tree can independently and concurrently compute predictions. The final prediction is typically obtained through aggregation, which may involve averaging the predictions or using majority voting, potentially with weighting to emphasize more accurate models within the ensemble.
There are other approaches for handling incremental updates or online learning with decision trees. One such method is the use of Hoeffding trees, which utilize estimates of the mean of feature values based on Hoeffding bounds and confidences. These trees can grow incrementally as new data arrives.
Additionally, feature preprocessing techniques such as histograms of features and feature binning can reduce data size, making it more manageable to load into memory. Vertical partitioning of data, where data is processed one feature at a time, also helps in reducing memory consumption, allowing a decision tree to be learned progressively.
It is noteworthy that LightGBM (a fast gradient-boosted decision tree) employs feature binning and histograms of features, which contribute to its ability to handle large datasets efficiently and support rapid learning and inference.
Q: How do we update an existing Decision Tree model with new incremental data (akin to online training)? This question is for mid to senior candidates
A: Performing incremental updates or online learning directly with a single decision tree is not feasible. In practical scenarios, engineers typically utilize ensembles of decision trees. Each decision tree in the ensemble is weighted based on the number of examples it was trained on.
For updating with new incremental data, the approach involves mini-batching the new data points to train a new decision tree. This new tree’s weight in the ensemble is determined by factors such as the number of samples it has processed and the recency of the data. The newly trained decision tree is then incorporated into the existing ensemble, possibly replacing a previous decision tree to maintain ensemble size and efficiency.
This method enables the adaptation of decision tree models to evolving data streams, facilitating scenarios resembling online training while ensuring the ensemble remains effective and up-to-date.