The is part 2 of Decision Trees. While part 1 introduces the concept of a Decision Tree briefly, this section is about overfitting a decision tree and how to control the size of a tree.
Decision Trees – Part 2/2
Let’s look back at the performance of our trained Decision Tree from part 1.
dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)
val_dicts = df_val.fillna(0).to_dict(orient='records')
X_val = dv.transform(val_dicts)
y_pred = dt.predict_proba(X_val)[:, 1]
roc_auc_score(y_val, y_pred)
# Output: 0.6547098641350415
Overfitting
0.65 is not really a great value, let’s look at training data and calulate auc score.
y_pred = dt.predict_proba(X_train)[:, 1]
roc_auc_score(y_train, y_pred)
# Output: 1.0
This is called overfitting. Overfitting is when our model simply memorizes the data, but it memorizes in such a way that when it sees a new example it doesn’t know what to do with this example. So it memorizes the training data but it fails to generalize. The reason why this happens to decision trees is that the model creates a specific rule for each example. That works fine for training data, but it doesn’t work for any unseen example. The reason why this can happens is, that we let the tree grow too deep. If we restrict the tree to only grow up to three levels deep, the tree will learn rules that are less specific.
dt = DecisionTreeClassifier(max_depth=3)
dt.fit(X_train, y_train)
y_pred = dt.predict_proba(X_train)[:, 1]
auc = roc_auc_score(y_train, y_pred)
print('train', auc)
# Output: train 0.7761016984958594
y_pred = dt.predict_proba(X_val)[:, 1]
auc = roc_auc_score(y_val, y_pred)
print('val', auc)
# Output: val 0.7389079944782155
Decision Stump
If we restrict the depth to 3, the model performance on validation is significantly better. It’s now 74% compared to 65%. By the way a decistion tree with a depth of 1 is called Decision Stump. It’s not really a tree, because this is only one condition.
dt = DecisionTreeClassifier(max_depth=1)
dt.fit(X_train, y_train)
y_pred = dt.predict_proba(X_train)[:, 1]
auc = roc_auc_score(y_train, y_pred)
print('train', auc)
# Output: train 0.6282660131823559
y_pred = dt.predict_proba(X_val)[:, 1]
auc = roc_auc_score(y_val, y_pred)
print('val', auc)
# Output: val 0.6058644740984719
The auc score of this decision stump is only a bit worse than the overfitted one.
Visualizing Decision Stump
Let’s examine this tree to understand the rules it has learned. To do that, we can use a specialized function in Scikit-Learn for visualizing trees.
from sklearn.tree import export_text
print(export_text(dt))
# Output:
# |--- feature_25 <= 0.50
# | |--- class: 1
# |--- feature_25 > 0.50
# | |--- class: 0
To understand the meaning of ‘feature_25,’ we need to consult the DictVectorizer feature names dictionary.
# in the video Alexey uses
# print(export_text(dt,feature_names=dv.get_feature_names()))
names = dv.get_feature_names_out().tolist()
print(export_text(dt, feature_names=names))
# Output:
# |--- records=no <= 0.50
# | |--- class: 1
# |--- records=no > 0.50
# | |--- class: 0
This means that if there are no records (records=no <= 0.50), it’s labeled as ‘DEFAULT.’ If there are records (records=no > 0.50), it’s labeled as ‘OK.’ It’s important to note that one-hot encoding is applied here, and there is a column named ‘records=NO,’ which is encoded as 0 when it’s not ‘no’ and 1 when it’s ‘no’.
Decision tree with depth of 2
dt = DecisionTreeClassifier(max_depth=2)
dt.fit(X_train, y_train)
y_pred = dt.predict_proba(X_train)[:, 1]
auc = roc_auc_score(y_train, y_pred)
print('train', auc)
# Output: train 0.7054989859726213
y_pred = dt.predict_proba(X_val)[:, 1]
auc = roc_auc_score(y_val, y_pred)
print('val', auc)
# Output: val 0.6685264343319367
Even two levels in a decision tree already performs better than the overfitted one.
Visualizing Decision tree
print(export_text(dt))
# Output:
# |--- feature_26 <= 0.50
# | |--- feature_16 <= 0.50
# | | |--- class: 0
# | |--- feature_16 > 0.50
# | | |--- class: 1
# |--- feature_26 > 0.50
# | |--- feature_27 <= 6.50
# | | |--- class: 1
# | |--- feature_27 > 6.50
# | | |--- class: 0
names = dv.get_feature_names_out().tolist()
print(export_text(dt, feature_names=names))
# Output:
# |--- records=yes <= 0.50
# | |--- job=partime <= 0.50
# | | |--- class: 0
# | |--- job=partime > 0.50
# | | |--- class: 1
# |--- records=yes > 0.50
# | |--- seniority <= 6.50
# | | |--- class: 1
# | |--- seniority > 6.50
# | | |--- class: 0