Decision trees parameter tuning
1. Selecting max_depth
2. Selecting min_samples_leaf

Decision trees parameter tuning

Part 8 of the ‘Decision Trees and Ensemble Learning’ section is dedicated to tuning decision tree parameters. Parameter tuning involves selecting the best parameters for training. In this context, ‘tuning’ means choosing parameters in a way that maximizes or minimizes a chosen performance metric (such as AUC or RMSE) on the validation set. In the case of AUC, our goal is to maximize it on the validation set by finding the parameter values that yield the highest score.

Let’s take a closer look at the parameters available for the DecisionTreeClassifier. For a more comprehensive list of parameters, you can refer to the official website. Some of the key parameters include:

criterion: This parameter determines the impurity measure used for splitting. You can choose between ‘gini’ for Gini impurity and ‘entropy’ for information gain. The choice of criterion can significantly impact the quality of the splits in the decision tree.
max_depth: This parameter controls the maximum depth of the decision tree. It plays a crucial role in preventing overfitting by limiting the complexity of the tree. Selecting an appropriate value for max_depth helps strike a balance between model simplicity and complexity.
min_samples_leaf: This parameter specifies the minimum number of samples required in a leaf node. It influences the granularity of the splits. Smaller values can result in finer splits and a more complex tree, while larger values lead to coarser splits and a simpler tree.

By carefully tuning these parameters, you can find the right configuration that optimizes your model’s performance, ensuring it’s well-suited for your specific machine learning task.

Selecting max_depth

To start the parameter tuning process, our initial focus will be on the ‘max_depth‘ parameter. Our goal is to identify the optimal ‘max_depth‘ value before proceeding to fine-tune other parameters. ‘max_depth‘ governs the maximum depth of the decision tree. When set to ‘None,’ it imposes no restrictions, allowing the tree to grow as deeply as possible, potentially resulting in numerous layers.

We will conduct experiments using various values for ‘max_depth,’ including the ‘None’ setting, which serves as a baseline for comparison and enables us to understand the consequences of not constraining the tree’s depth.

depths = [1, 2, 3, 4, 5, 6, 10, 15, 20, None]

for depth in depths: 
    dt = DecisionTreeClassifier(max_depth=depth)
    dt.fit(X_train, y_train)
    
    # remember we need the column with negative scores
    y_pred = dt.predict_proba(X_val)[:, 1]
    auc = roc_auc_score(y_val, y_pred)
    
    print('%4s -> %.3f' % (depth, auc))

# Output:
# 1 -> 0.606
# 2 -> 0.669
# 3 -> 0.739
# 4 -> 0.761
# 5 -> 0.767
# 6 -> 0.760
# 10 -> 0.706
# 15 -> 0.663
# 20 -> 0.654
# None -> 0.657

What we can observe here is that the optimal values appear to be around 76% for ‘max_depth’ values of 4, 5, and 6. This indicates that our best-performing tree should have between 4 to 6 layers. If there were no other parameters to consider, we could choose a depth of 4 to make the tree simpler, with only 4 layers instead of 5. A simpler tree is generally easier to read and understand, making it more transparent in terms of what’s happening.

Selecting min_samples_leaf

But ‘max_depth‘ is not the only parameter; there is another one called ‘min_samples_leaf.’ We have already determined that the optimal depth falls between 4 and 6. For each of these values, we can experiment with different ‘min_samples_leaf‘ values to observe their effects.

scores = []

for d in [4, 5, 6]:
    for s in [1, 2, 5, 10, 15, 20, 100, 200, 500]:
        dt = DecisionTreeClassifier(max_depth=depth, min_samples_leaf=s)
        dt.fit(X_train, y_train)

        y_pred = dt.predict_proba(X_val)[:, 1]
        auc = roc_auc_score(y_val, y_pred)
        
        scores.append((d, s, auc))

columns = ['max_depth', 'min_samples_leaf', 'auc']
df_scores = pd.DataFrame(scores, columns=columns)
df_scores.head()

	max_depth	min_samples_leaf	auc
0	4	1	0.655327
1	4	2	0.697389
2	4	5	0.712749
3	4	10	0.762578
4	4	15	0.786521

df_scores.sort_values(by='auc', ascending=False).head()

	max_depth	min_samples_leaf	Auc
22	6	15	0.786997
4	4	15	0.786521
13	5	15	0.785398
14	5	20	0.782159
23	6	20	0.782120

Dataframe sorted by AUC column

This information can be presented differently by transforming it into a DataFrame. In this DataFrame, ‘min_samples_leaf‘ will be on the rows, ‘max_depth‘ will be on the columns, and the cell values will represent the ‘auc‘ scores. To achieve this, we can utilize the ‘pivot’ function. This tabular format is more user-friendly, and it’s evident that 0.787 is the highest value.

# index - rows
df_scores_pivot = df_scores.pivot(index='min_samples_leaf', columns=['max_depth'], values=['auc'])
df_scores_pivot.round(3)

max_depth min_samples_leaf	4	5	6
1	0.655	0.655	0.655
2	0.697	0.701	0.697
5	0.713	0.709	0.719
10	0.763	0.762	0.762
15	0.787	0.785	0.787
20	0.781	0.782	0.782
100	0.779	0.780	0.779
200	0.768	0.768	0.768
500	0.680	0.680	0.680

AUC score for different values for max_depth and min_samples_leaf

Another visualization option is to create a heatmap.

sns.heatmap(df_scores_pivot, annot=True, fmt=".3f")

In this heatmap, it’s easy to identify the highest value as it appears the lightest, while the darkest shade represents the lowest or poorest value. However, it’s important to note that this method of selecting the best parameter might not always be optimal. There’s a possibility that a ‘max_depth’ of 7, 10, or another value works better, but we haven’t explored those possibilities. This is because we initially tuned the ‘max_depth’ parameter and then selected the best ‘min_samples_leaf.’ For small datasets, it’s feasible to try a variety of values, but with larger datasets, we need to constrain our search space to be more efficient. Therefore, it’s often a good practice to first optimize the ‘max_depth’ parameter and then fine-tune the other parameters.

Nevertheless, given the small size of this dataset, training is fast, allowing us to experiment with different combinations. Let’s explore a few more combinations.

scores = []

for d in [4, 5, 6, 7, 10, 15, 20, None]:
    for s in [1, 2, 5, 10, 15, 20, 100, 200, 500]:
        dt = DecisionTreeClassifier(max_depth=depth, min_samples_leaf=s)
        dt.fit(X_train, y_train)

        y_pred = dt.predict_proba(X_val)[:, 1]
        auc = roc_auc_score(y_val, y_pred)
        
        scores.append((d, s, auc))

columns = ['max_depth', 'min_samples_leaf', 'auc']
df_scores = pd.DataFrame(scores, columns=columns)

df_scores.sort_values(by='auc', ascending=False).head()

	max_depth	min_samples_leaf	auc
22	6.0	15	0.787878
31	7.0	15	0.787762
58	20.0	15	0.787711
49	15.0	15	0.787293
13	5.0	15	0.786939

AUC score for different values for max_depth and min_samples_leaf sorted by AUC column

The preceding table demonstrates that the AUC scores of the top five options are quite close. Interestingly, in each of these top five cases, ‘min_samples_leaf‘ is set to 15. Let’s explore alternative visualizations for deeper insights.

df_scores_pivot = df_scores.pivot(index='min_samples_leaf', columns=['max_depth'], values=['auc'])
df_scores_pivot.round(3)

max_depth min_samples_leaf	NaN	4.0	5.0	6.0	7.0	10.0	15.0	20.0
1	0.646	0.667	0.657	0.646	0.665	0.660	0.650	0.659
2	0.691	0.687	0.692	0.695	0.694	0.690	0.696	0.686
5	0.710	0.719	0.720	0.714	0.716	0.719	0.719	0.721
10	0.762	0.761	0.766	0.762	0.763	0.763	0.764	0.763
15	0.786	0.786	0.787	0.788	0.788	0.785	0.787	0.788
20	0.781	0.783	0.782	0.783	0.783	0.783	0.782	0.784
100	0.779	0.780	0.780	0.780	0.780	0.779	0.779	0.779
200	0.768	0.768	0.768	0.768	0.768	0.768	0.768	0.768
500	0.680	0.680	0.680	0.680	0.680	0.680	0.680	0.680

AUC score for different values for max_depth and min_samples_leaf

sns.heatmap(df_scores_pivot, annot=True, fmt=".3f")

In the last snippet, we train our DecisionTreeClassifier with the final tuned parameters: ‘max_depth‘ set to 6 and ‘min_samples_leaf‘ set to 15.

dt = DecisionTreeClassifier(max_depth=6, min_samples_leaf=15)
dt.fit(X_train, y_train)

# Output: DecisionTreeClassifier(max_depth=6, min_samples_leaf=15)

#print(export_text(dt, feature_names=list(dv.get_feature_names_out())))

ML Zoomcamp 2023 – Decision Trees and Ensemble Learning– Part 8

Decision trees parameter tuning

Selecting max_depth

Selecting min_samples_leaf

Leave a comment Cancel reply

Decision trees parameter tuning

Selecting max_depth

Selecting min_samples_leaf

Teilen mit:

Related

Leave a comment Cancel reply