ML Zoomcamp 2023 – Decision Trees and Ensemble Learning– Part 8

  1. Decision trees parameter tuning
    1. Selecting max_depth
    2. Selecting min_samples_leaf

Decision trees parameter tuning

Part 8 of the ‘Decision Trees and Ensemble Learning’ section is dedicated to tuning decision tree parameters. Parameter tuning involves selecting the best parameters for training. In this context, ‘tuning’ means choosing parameters in a way that maximizes or minimizes a chosen performance metric (such as AUC or RMSE) on the validation set. In the case of AUC, our goal is to maximize it on the validation set by finding the parameter values that yield the highest score.

Let’s take a closer look at the parameters available for the DecisionTreeClassifier. For a more comprehensive list of parameters, you can refer to the official website. Some of the key parameters include:

  • criterion: This parameter determines the impurity measure used for splitting. You can choose between ‘gini’ for Gini impurity and ‘entropy’ for information gain. The choice of criterion can significantly impact the quality of the splits in the decision tree.
  • max_depth: This parameter controls the maximum depth of the decision tree. It plays a crucial role in preventing overfitting by limiting the complexity of the tree. Selecting an appropriate value for max_depth helps strike a balance between model simplicity and complexity.
  • min_samples_leaf: This parameter specifies the minimum number of samples required in a leaf node. It influences the granularity of the splits. Smaller values can result in finer splits and a more complex tree, while larger values lead to coarser splits and a simpler tree.

By carefully tuning these parameters, you can find the right configuration that optimizes your model’s performance, ensuring it’s well-suited for your specific machine learning task.

Selecting max_depth

To start the parameter tuning process, our initial focus will be on the ‘max_depth‘ parameter. Our goal is to identify the optimal ‘max_depth‘ value before proceeding to fine-tune other parameters. ‘max_depth‘ governs the maximum depth of the decision tree. When set to ‘None,’ it imposes no restrictions, allowing the tree to grow as deeply as possible, potentially resulting in numerous layers.

We will conduct experiments using various values for ‘max_depth,’ including the ‘None’ setting, which serves as a baseline for comparison and enables us to understand the consequences of not constraining the tree’s depth.

depths = [1, 2, 3, 4, 5, 6, 10, 15, 20, None]

for depth in depths: 
    dt = DecisionTreeClassifier(max_depth=depth)
    dt.fit(X_train, y_train)
    
    # remember we need the column with negative scores
    y_pred = dt.predict_proba(X_val)[:, 1]
    auc = roc_auc_score(y_val, y_pred)
    
    print('%4s -> %.3f' % (depth, auc))

# Output:
# 1 -> 0.606
# 2 -> 0.669
# 3 -> 0.739
# 4 -> 0.761
# 5 -> 0.767
# 6 -> 0.760
# 10 -> 0.706
# 15 -> 0.663
# 20 -> 0.654
# None -> 0.657

What we can observe here is that the optimal values appear to be around 76% for ‘max_depth’ values of 4, 5, and 6. This indicates that our best-performing tree should have between 4 to 6 layers. If there were no other parameters to consider, we could choose a depth of 4 to make the tree simpler, with only 4 layers instead of 5. A simpler tree is generally easier to read and understand, making it more transparent in terms of what’s happening.

Selecting min_samples_leaf

But ‘max_depth‘ is not the only parameter; there is another one called ‘min_samples_leaf.’ We have already determined that the optimal depth falls between 4 and 6. For each of these values, we can experiment with different ‘min_samples_leaf‘ values to observe their effects.

scores = []

for d in [4, 5, 6]:
    for s in [1, 2, 5, 10, 15, 20, 100, 200, 500]:
        dt = DecisionTreeClassifier(max_depth=depth, min_samples_leaf=s)
        dt.fit(X_train, y_train)

        y_pred = dt.predict_proba(X_val)[:, 1]
        auc = roc_auc_score(y_val, y_pred)
        
        scores.append((d, s, auc))

columns = ['max_depth', 'min_samples_leaf', 'auc']
df_scores = pd.DataFrame(scores, columns=columns)
df_scores.head()
max_depthmin_samples_leafauc
0410.655327
1420.697389
2450.712749
34100.762578
44150.786521
df_scores.sort_values(by='auc', ascending=False).head()
max_depthmin_samples_leafAuc
226150.786997
44150.786521
135150.785398
145200.782159
236200.782120
Dataframe sorted by AUC column

This information can be presented differently by transforming it into a DataFrame. In this DataFrame, ‘min_samples_leaf‘ will be on the rows, ‘max_depth‘ will be on the columns, and the cell values will represent the ‘auc‘ scores. To achieve this, we can utilize the ‘pivot’ function. This tabular format is more user-friendly, and it’s evident that 0.787 is the highest value.

# index - rows
df_scores_pivot = df_scores.pivot(index='min_samples_leaf', columns=['max_depth'], values=['auc'])
df_scores_pivot.round(3)
max_depth
min_samples_leaf
456
10.6550.6550.655
20.6970.7010.697
50.7130.7090.719
100.7630.7620.762
150.7870.7850.787
200.7810.7820.782
1000.7790.7800.779
2000.7680.7680.768
5000.6800.6800.680
AUC score for different values for max_depth and min_samples_leaf

Another visualization option is to create a heatmap.

sns.heatmap(df_scores_pivot, annot=True, fmt=".3f")

In this heatmap, it’s easy to identify the highest value as it appears the lightest, while the darkest shade represents the lowest or poorest value. However, it’s important to note that this method of selecting the best parameter might not always be optimal. There’s a possibility that a ‘max_depth’ of 7, 10, or another value works better, but we haven’t explored those possibilities. This is because we initially tuned the ‘max_depth’ parameter and then selected the best ‘min_samples_leaf.’ For small datasets, it’s feasible to try a variety of values, but with larger datasets, we need to constrain our search space to be more efficient. Therefore, it’s often a good practice to first optimize the ‘max_depth’ parameter and then fine-tune the other parameters.

Nevertheless, given the small size of this dataset, training is fast, allowing us to experiment with different combinations. Let’s explore a few more combinations.

scores = []

for d in [4, 5, 6, 7, 10, 15, 20, None]:
    for s in [1, 2, 5, 10, 15, 20, 100, 200, 500]:
        dt = DecisionTreeClassifier(max_depth=depth, min_samples_leaf=s)
        dt.fit(X_train, y_train)

        y_pred = dt.predict_proba(X_val)[:, 1]
        auc = roc_auc_score(y_val, y_pred)
        
        scores.append((d, s, auc))

columns = ['max_depth', 'min_samples_leaf', 'auc']
df_scores = pd.DataFrame(scores, columns=columns)

df_scores.sort_values(by='auc', ascending=False).head()
max_depthmin_samples_leafauc
226.0150.787878
317.0150.787762
5820.0150.787711
4915.0150.787293
135.0150.786939
AUC score for different values for max_depth and min_samples_leaf sorted by AUC column

The preceding table demonstrates that the AUC scores of the top five options are quite close. Interestingly, in each of these top five cases, ‘min_samples_leaf‘ is set to 15. Let’s explore alternative visualizations for deeper insights.

df_scores_pivot = df_scores.pivot(index='min_samples_leaf', columns=['max_depth'], values=['auc'])
df_scores_pivot.round(3)
max_depth
min_samples_leaf
NaN4.05.06.07.010.015.020.0
10.6460.6670.6570.6460.6650.6600.6500.659
20.6910.6870.6920.6950.6940.6900.6960.686
50.7100.7190.7200.7140.7160.7190.7190.721
100.7620.7610.7660.7620.7630.7630.7640.763
150.7860.7860.7870.7880.7880.7850.7870.788
200.7810.7830.7820.7830.7830.7830.7820.784
1000.7790.7800.7800.7800.7800.7790.7790.779
2000.7680.7680.7680.7680.7680.7680.7680.768
5000.6800.6800.6800.6800.6800.6800.6800.680
AUC score for different values for max_depth and min_samples_leaf
sns.heatmap(df_scores_pivot, annot=True, fmt=".3f")

In the last snippet, we train our DecisionTreeClassifier with the final tuned parameters: ‘max_depth‘ set to 6 and ‘min_samples_leaf‘ set to 15.

dt = DecisionTreeClassifier(max_depth=6, min_samples_leaf=15)
dt.fit(X_train, y_train)

# Output: DecisionTreeClassifier(max_depth=6, min_samples_leaf=15)

#print(export_text(dt, feature_names=list(dv.get_feature_names_out())))

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.