Describe How Decision Trees are Created
Full prompt: Describe how decision trees are created.
First you need to define the features and the target of interest to decide what kind of model to run.
Then you want to think about the relevant evaluation metrics for this type of model based on real-life outcomes to the business.
Finally, based on those real-life outcomes, weigh in the trade-offs in trying to maximize or minimize certain metrics.
How does the model work? Is it providing a continuous or categorical output?
Do your evaluation metrics tie into real-life business outcomes?
What trade-offs are being evaluated based on the evaluation metrics?
Decision trees are a supervised learning technique that can be used for both classification and regression-based problems. In either case, “splits” are created to improve the homogeneity of data on each side of the split.
For example, consider we want to predict if a customer will convert on an email we send out. We send the email to 500 customers and 50 of them convert. We know how much time they spent on the email and whether they are located in the US or Canada.
The conversion rates for different splits are shown below:
By email time:
Greater than 10 minutes on the email: 30 convert, 20 no-convert
Less than or equal to 10 minutes on the email: 20 convert, 430 no-convert
By country:
US: 25 convert, 200 no-convert
Canada: 25 convert, 250 no-convert
We can calculate information gain to determine which split is better.
The initial distribution for conversion is 50/500 = 10%.
Using email time, we have 50 of the emails in a group with 30/50 = 60%, and 450 emails in a group with 20/450 = 4.44%.
Information gain by splitting on email time:
(Gini pre-split) - (weighted average of gini within each split)
= (0.1(1-0.1)) - [(50/500)(0.6(1-0.6)) + (450/500)(0.0444(1-0.0444))]
= (0.09) - (0.062) = 0.028
Information gain by splitting on country:
(Gini pre-split) - (weighted average of gini within each split)
= (0.1(1-0.1)) - [(225/500)(0.111(1-0.111)) + (275/500)(0.091(1-0.091))]
= (0.09) - (0.0899) = 0.0001
Since the first value is larger, there is more information gained by this split. This also makes sense, considering the higher homogeneity of the groups using the email time split, rather than the country split. The country appears to offer very little information about whether or not an individual will convert.
In a real-world problem, we can continue this splitting process within each of the individual splits using additional features. With a continuous feature, we could easily overfit our data by continuing to split until all the final splits are completely homogeneous, even if they only contain a single data point in the terminal node.
Additional context about decision trees:
- They are greedy because they choose the best split using the above approach without regard for if it will be better in the next split.
- They are common as a base method in some of the most popular ensemble methods, such as random forests and gradient-boosted trees (XGBoost)
- Individual trees are quite easy to interpret. You can look at each of the splits to understand which features were most important in determining the result. In the previous example, we would be able to say that an email time greater than 10 minutes indicates a much higher probability of conversion. The country doesn’t appear to have any meaningful impact on conversion.
- Trees with just a single split are called “stumps.”
What makes this answer effective
This answer shows a strong understanding of decision trees’ high-level usefulness and strong technical knowledge of how the algorithm works under the hood. The concrete example of data dives deeper into details, allowing the interviewee to directly show whether a feature is the “correct” or “truly the best” option to split on.
Other considerations
With more time or a different approach, you could provide the following:
- An example that shows how decision trees would work for regression, rather than only classification
- The efficiency of this algorithm
- The rationale behind arriving at the 10-minute cutoff vs. other amounts of time
- Additional information about how individual trees are used in ensemble algorithms