ML Zoomcamp 2023 – Decision Trees and Ensemble Learning– Part 6

  1. Decision Tree Learning Algorithm – Part 1/2
    1. Finding the best split for one column
      1. Misclassification Rate
      2. Impurity

Decision Tree Learning Algorithm – Part 1/2

Before we dive deeper into parameter tuning, which is the topic of Part 8 of this chapter (Decision Trees and Ensemble Learning), let’s take a step back for a moment and explore how a decision tree can generate rules, as we’ve seen in the last article.

In this article, we simplify the problem by using a much smaller dataset with just one feature. Additionally, we employ a very small tree, known as a “decision stump.” This approach offers a clearer understanding of how a decision tree learns from data.

So, let’s begin with a decision stump, which is essentially a decision tree with only one level. This means we have a condition node with the condition inside. Conditions are in the form of “feature > T”, where T represents a threshold. The condition “feature > T” is referred to as a “split”. This split divides the data into records that satisfy the condition (TRUE) and those that do not (FALSE). At the top of the tree, we have the condition node, and at the bottom, there are the leaves of the tree. These leaves are called decision nodes. Decision nodes are where the tree doesn’t go any deeper, and this is where decisions are made. In this context, “decision making” means that we choose the most frequently occurring status label within each group and use it as the final decision for that group.

 FEATURE > T 
   /      \                            
FALSE     TRUE
DEFAULT     OK

To illustrate how the learning algorithm works, let’s use a simple dataset. This dataset contains 8 records with one value for ‘assets’ and one for ‘status,’ which is our target variable.

data = [
    [8000, 'default'],
    [2000, 'default'],
    [   0, 'default'],
    [5000, 'ok'],
    [5000, 'ok'],
    [4000, 'ok'],
    [9000, 'ok'],
    [3000, 'default'],
]

df_example = pd.DataFrame(data, columns=['assets', 'status'])
df_example
assetsstatus
08000default
12000default
20default
35000ok
45000ok
54000ok
69000ok
73000default
Simple dataset with 8 records

Finding the best split for one column

What we want to do now is, we want to use this numerical column ‘assets’ in our training. We want to train a decision tree using this column. So we want to train a model based on the condition ASSETS > T The question is: What is the best T? So what is the best split? We want to split the dataset into two parts. One part where the condition holds true, and the other part where the condition does not hold true.

 ASSETS > T 
   /      \                            
FALSE     TRUE
DF_LEFT     DF_RIGHT  

DF_RIGHT is where this condition is true, so for all records where ‘assets’ is greater than T. DF_LEFT is where this condition is false, so for all records where ‘assets’ is not greater than T.

This is called splitting, so we’re splitting the data set into two parts where it’s true and where it’s false.

What we can do now is sorting the data by the ‘assets’ column.

df_example.sort_values('assets')
AssetsStatus
20default
12000default
73000default
54000ok
35000ok
45000ok
08000default
69000ok
Simple dataset sortedd by ‘assets’ column

We are considering different threshold values “T” for the condition “ASSETS > T.” The possibilities for the threshold “T” are:

  • T = 2000 (ASSETS<=2,000 becomes LEFT, ASSETS>2,000 becomes RIGHT)
  • T = 3000 (ASSETS<=3,000 becomes LEFT, ASSETS>3,000 becomes RIGHT)
  • T = 4000 (ASSETS<=4,000 becomes LEFT, ASSETS>4,000 becomes RIGHT)
  • T = 5000 (ASSETS<=5,000 becomes LEFT, ASSETS>5,000 becomes RIGHT)
  • T = 8000 (ASSETS<=8,000 becomes LEFT, ASSETS>8,000 becomes RIGHT)

It’s important to note that values like T = 9000 and T = 0 don’t make sense for splitting because there would be nothing on the right or left side, respectively.

Ts = [0, 2000, 3000, 4000, 5000, 8000]

We want to evaluate each of these threshold values “T” by splitting our dataset into the left (LEFT) and right (RIGHT) subsets and determine which split provides the best results.

from IPython.display import display

for T in Ts:
    print(T)
    df_left = df_example[df_example.assets <= T]
    df_right = df_example[df_example.assets > T]
    
    display(df_left)
    display(df_right)

    print()

0

Assetsstatus
20default
LEFT for T=0
AssetsStatus
08000default
12000default
35000ok
45000ok
54000ok
69000ok
73000default
RIGHT for T=0

2000

AssetsStatus
12000default
20default
LEFT for T=2000
Assetsstatus
08000default
35000ok
45000ok
54000ok
69000ok
73000default
RIGHT for T=2000

3000

AssetsStatus
12000default
20default
73000default
LEFT for T=3000
assetsstatus
08000default
35000ok
45000ok
54000ok
69000ok
RIGHT for T=3000

4000

assetsstatus
12000default
20default
54000ok
73000default
LEFT for T=4000
assetsstatus
08000default
35000ok
45000ok
69000ok
RIGHT for T=4000

5000

assetsstatus
12000default
20default
35000ok
45000ok
54000ok
73000default
LEFT for T=5000
assetsstatus
08000default
69000ok
RIGHT for T=5000

8000

assetsstatus
08000default
12000default
20default
35000ok
45000ok
54000ok
73000default
LEFT for T=8000
ASSETSSTATUS
69000ok
RIGHT for T=8000

Now that we have generated many splits using different threshold values, we need to determine which split is the best. To evaluate this, we can use various split evaluation criteria. Let’s take the example where T=4000. In this case, on the LEFT side, we have 3 cases labeled as DEFAULT and 1 case labeled as OK. On the RIGHT side, we have 3 cases labeled as OK and 1 case labeled as DEFAULT. As mentioned earlier, we select the most frequently occurring status within each group and use it as the final decision. In the case of the LEFT group, the most frequent status is “default,” and for the RIGHT group, it’s “ok.”

T = 4000
df_left = df_example[df_example.assets <= T]
df_right = df_example[df_example.assets > T]

display(df_left)
display(df_right)
AssetsStatus
12000default
20default
54000ok
73000default
LEFT for T=4000
ASSETSSTATUS
08000default
35000ok
45000ok
69000ok
RIGHT for T=4000
 ASSETS > 4000 
   /      \                            
FALSE     TRUE
DEFAULT     OK  

Misclassification Rate

To evaluate the accuracy of our predictions, we can calculate the misclassification rate. The misclassification rate measures the fraction of errors when we predict that everyone in the LEFT group is a DEFAULT, and similarly for the RIGHT group. It gives us insight into how well our model is performing.

For the LEFT group (3xDEFAULT, 1xOK), where we predict everything as DEFAULT, the misclassification rate is 1/4, which equals 25%.

For the RIGHT group (3xOK, 1xDEFAULT), where we predict everything as OK, the misclassification rate is also 1/4, or 25%.

This is how we evaluate the quality of our split. For T=4000, the average misclassification rate is 25%. We can also use a weighted average, which is particularly useful when there are significantly more examples on one side than on the other. You can print the normalized value counts as shown in the next snippet.

T = 4000
df_left = df_example[df_example.assets <= T]
df_right = df_example[df_example.assets > T]

display(df_left)
print(df_left.status.value_counts(normalize=True))
display(df_right)
print(df_left.status.value_counts(normalize=True))
ASSETSStatus
12000default
20default
54000ok
73000default
LEFT for T=4000
default    0.75
ok         0.25
Name: status, dtype: float64
ASSETSSTATUS
08000default
35000ok
45000ok
69000ok
RIGHT for T=4000
default    0.75
ok         0.25
Name: status, dtype: float64

Let’s print out the average misclassification rate for each split. You can use the code implemented in the snippet below to calculate and display this information.

for T in Ts:
    print(T)
    df_left = df_example[df_example.assets <= T]
    df_right = df_example[df_example.assets > T]
    
    display(df_left)
    print(df_left.status.value_counts(normalize=True))
    display(df_right)
    print(df_right.status.value_counts(normalize=True))

    print()

0

ASSETSSTATUS
20default
LEFT for T=0
default    1.0
Name: status, dtype: float64
ASSETSSTATUS
08000default
12000default
35000ok
45000ok
54000ok
69000ok
73000default
RIGHT for T=0
ok         0.571429
default    0.428571
Name: status, dtype: float64

2000

ASSETSSTATUS
12000default
20default
LEFT for T=2000
default    1.0
Name: status, dtype: float64
ASSETSSTATUS
08000default
35000ok
45000ok
54000ok
69000ok
73000default
RIGHT for T=2000
ok         0.666667
default    0.333333
Name: status, dtype: float64

3000

ASSETSSTATUS
12000default
20default
73000default
LEFT for T=3000
default    1.0
Name: status, dtype: float64
ASSETSSTATUS
08000default
35000ok
45000ok
54000ok
69000ok
RIGHT for T=3000
ok         0.8
default    0.2
Name: status, dtype: float64

4000

ASSETSSTATUS
12000default
20default
54000ok
73000default
LEFT for T=4000
default    0.75
ok         0.25
Name: status, dtype: float64
ASSETSSTATUS
08000default
35000ok
45000ok
69000ok
RIGHT for T=4000
ok         0.75
default    0.25
Name: status, dtype: float64

5000

ASSETSSTATUS
12000default
20default
35000ok
45000ok
54000ok
73000default
LEFT for T=5000
default    0.5
ok         0.5
Name: status, dtype: float64
ASSETSSTATUS
08000default
69000ok
RIGHT for T=5000
default    0.5
ok         0.5
Name: status, dtype: float64

8000

ASSETSSTATUS
08000default
12000default
20default
35000ok
45000ok
54000ok
73000default
LEFT for T=8000
default    0.571429
ok         0.428571
Name: status, dtype: float64
ASSETSSTATUS
69000ok
RIGHT for T=8000
ok    1.0
Name: status, dtype: float64

Impurity

The misclassification rate is just one way of measuring impurity. We aim for our leaves to be as pure as possible, meaning that the groups resulting from the split should ideally contain only observations of one class, making them pure. The misclassification rate informs us of how impure these groups are by quantifying the error rate in classifying observations.

There are also alternative methods to measure impurity. Scikit-Learn offers more advanced criteria for splitting, such as entropy and Gini impurity. These criteria provide different ways to evaluate the quality of splits, with the goal of creating decision trees with more homogeneous leaves.

TDecision LEFTImpurity LEFTDecision RIGHTImpurity RIGHTAVG
0DEFAULT0%OK43%21%
2000DEFAULT0%OK33%16%
————————————————————————————————————
3000DEFAULT0%OK20%10%
————————————————————————————————————
4000DEFAULT25%OK25%25%
5000DEFAULT50%OK50%50%
8000DEFAULT43%OK0%21%
Average Impurity

We observe that the split with T = 3000 yields the lowest average misclassification rate, resulting in the best impurity of 10%. This is how we determine the best split when dealing with just one column. To summarize our approach:

  • We sort the dataset.
  • We identify all possible thresholds.
  • For each of the thresholds, we split the dataset.
  • For each split, we calculate the impurity on the left and the right.
  • We then compute the average impurity.
  • Among all the splits considered, we select the one with the lowest average impurity.

The decision tree we’ve learned based on this split is as follows:

 ASSETS > 3000 
   /      \                            
FALSE     TRUE
DEFAULT     OK

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.