The next part is also divided into two parts. First I give a brief introduction to decision trees, how a decision tree look like. The last section is about how to train a decision tree.
The second part will be about overfitting a decision tree and how to control the size of a tree.
Decision Trees – Part 1/2
This time we want to use the ready-to-use data set from the last article to predict if customers are going to default or not. We want to use decision trees for that.
Introduction to Decision Trees
Decision trees are powerful tools in the field of machine learning and data analysis. They are a versatile and interpretable way to make decisions and predictions based on a set of input features. Imagine a tree-like structure where each internal node represents a feature or attribute, each branch signifies a decision or outcome, and each leaf node provides a final prediction or classification.
Decision trees are widely used for tasks such as classification and regression. They are known for their simplicity and ease of interpretation, making them a valuable resource for understanding and solving complex problems. In this blog post, we’ll delve into the world of decision trees, exploring how they work, and how to build them.
How a decision tree looks like
A decision tree is a data structure where we have a node which is the condition. And from this node there is one arrow to the left (condition = false) and one to the right (condition=true). Then there is the next condition which can be true or false. … until there is the final decision ‘OK’ or ‘DEFAULT’

Let’s write a function that implements this rule set, and test it with one sample client record.
def assess_risk(client):
if client['records'] == 'yes':
if client['job'] == 'parttime':
return 'default'
else:
return 'ok'
else:
if client['assets'] > 6000:
return 'ok'
else:
return 'default'
# just take one record to test
xi = df_train.iloc[0].to_dict()
xi
# Output:
# {'seniority': 10,
# 'home': 'owner',
# 'time': 36,
# 'age': 36,
# 'marital': 'married',
# 'records': 'no',
# 'job': 'freelance',
# 'expenses': 75,
# 'income': 0.0,
# 'assets': 10000.0,
# 'debt': 0.0,
# 'amount': 1000,
# 'price': 1400}
When we look at the decision tree, what would be the result for this client? First condition is “RECORDS = YES.” Our client has no records, so we go to the left. The second condition is “ASSETS > 6000.” The assets of our client are 10,000, so we go to the right. Now we reach the decision node, which in this case is “OK.” Let’s use the implemented function shown in the previous snippet and compare the output.
assess_risk(xi)
# Output: 'ok'
Indeed the output is also “OK”. We don’t want to always implement the whole rule set.These rules can be learned from the data using the decision tree algorithm.
Training a decision tree
Before we can train a decision tree, we first need to import necessary packages. From Scikit-Learn, we import DecisionTreeClassifier. Because we have categorical variables, we also need to import DictVectorizer as seen before.
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_extraction import DictVectorizer
from sklearn.metrics import roc_auc_score
Now we need to turn our training dataframe into a list of dictionaries then turn this list of dictionaries into the feature matrix. After that we train the model.
# This will lead us later to an error
#train_dicts = df_train.to_dict(orient='records')
# ... so we use instead:
train_dicts = df_train.fillna(0).to_dict(orient='records')
train_dicts[:5]
# Output:
#[{'seniority': 10,
# 'home': 'owner',
# 'time': 36,
# 'age': 36,
# 'marital': 'married',
# 'records': 'no',
# 'job': 'freelance',
# 'expenses': 75,
# 'income': 0.0,
# 'assets': 10000.0,
# 'debt': 0.0,
# 'amount': 1000,
# 'price': 1400},
# {'seniority': 6,
# 'home': 'parents',
# 'time': 48,
# 'age': 32,
# 'marital': 'single',
# 'records': 'yes',
# 'job': 'fixed',
# 'expenses': 35,
# 'income': 85.0,
# 'assets': 0.0,
# 'debt': 0.0,
# 'amount': 1100,
# 'price': 1330},
# {'seniority': 1,
# 'home': 'parents',
# 'time': 48,
# 'age': 40,
# 'marital': 'married',
# 'records': 'no',
# 'job': 'fixed',
# 'expenses': 75,
# 'income': 121.0,
# 'assets': 0.0,
# 'debt': 0.0,
# 'amount': 1320,
# 'price': 1600},
# {'seniority': 1,
# 'home': 'parents',
# 'time': 48,
# 'age': 23,
# 'marital': 'single',
# 'records': 'no',
# 'job': 'partime',
# 'expenses': 35,
# 'income': 72.0,
# 'assets': 0.0,
# 'debt': 0.0,
# 'amount': 1078,
# 'price': 1079},
# {'seniority': 5,
# 'home': 'owner',
# 'time': 36,
# 'age': 46,
# 'marital': 'married',
# 'records': 'no',
# 'job': 'freelance',
# 'expenses': 60,
# 'income': 100.0,
# 'assets': 4000.0,
# 'debt': 0.0,
# 'amount': 1100,
# 'price': 1897}]
dv = DictVectorizer(sparse=False)
X_train = dv.fit_transform(train_dicts)
X_train
# Output:
# array([[3.60e+01, 1.00e+03, 1.00e+04, ..., 0.00e+00, 1.00e+01, 3.60e+01],
# [3.20e+01, 1.10e+03, 0.00e+00, ..., 1.00e+00, 6.00e+00, 4.80e+01],
# [4.00e+01, 1.32e+03, 0.00e+00, ..., 0.00e+00, 1.00e+00, 4.80e+01],
# ...,
# [1.90e+01, 4.00e+02, 0.00e+00, ..., 0.00e+00, 1.00e+00, 2.40e+01],
# [4.30e+01, 2.50e+03, 1.80e+04, ..., 0.00e+00, 1.50e+01, 4.80e+01],
# [2.70e+01, 4.50e+02, 5.00e+03, ..., 1.00e+00, 1.20e+01, 4.80e+01]])
All the numerical features remain unchanged, but we have encoding for categorical features.
dv.get_feature_names_out()
# Ouput:
# array(['age', 'amount', 'assets', 'debt', 'expenses', 'home=ignore',
# 'home=other', 'home=owner', 'home=parents', 'home=private',
# 'home=rent', 'home=unk', 'income', 'job=fixed', 'job=freelance',
# 'job=others', 'job=partime', 'job=unk', 'marital=divorced',
# 'marital=married', 'marital=separated', 'marital=single',
# 'marital=unk', 'marital=widow', 'price', 'records=no',
# 'records=yes', 'seniority', 'time'], dtype=object)