ML Zoomcamp 2023 – Decision Trees and Ensemble Learning– Part 12

  1. XGBoost parameter tuning – Part 1/2
    1. Tuning Eta
      1. Eta = 0.3
      2. Eta = 1.0
      3. Eta = 0.1
      4. Eta = 0.05
      5. Eta = 0.01
    2. Plotting Eta

XGBoost parameter tuning – Part 1/2

This part is about XGBoost parameter tuning. It’s the first part of a two-part series, where we begin by tuning the initial parameter – ‘eta‘. The subsequent article will explore parameter tuning for ‘max_depth‘ and ‘min_child_weight‘. In the final phase, we’ll train the final model. Let’s start tuning the first parameter.

Tuning Eta

Eta, also known as the learning rate, determines the influence of the following model when correcting the results of the previous model. If the weight is set to 1.0, all new predictions are used to correct the previous ones. However, when the weight is 0.3, only 30% of the new predictions are considered. In essence, eta governs the size of the steps taken during the learning process.

Now, let’s explore how different values of eta impact model performance. To facilitate this, we’ll create a dictionary called ‘scores‘ to store the performance scores for each value of eta.

Eta = 0.3

scores = {}
%%capture output

xgb_params = {
    'eta': 0.3,
    'max_depth': 6,
    'min_child_weight': 1,
    
    'objective': 'binary:logistic',
    'eval_metric': 'auc',
    
    'nthread': 8,
    'seed': 1,
    'verbosity': 1
}

model = xgb.train(xgb_params, dtrain, num_boost_round=200,
                 verbose_eval=5,
                 evals=watchlist)

We aim to structure keys in the format ‘eta=0.3’ to serve as identifiers in the scores dictionary.

'eta=%s' % (xgb_params['eta'])
# Output: 'eta=0.3'

In the next snippet, we once again employ the ‘parse_xgb_output‘ function, which we defined in the previous article. This function returns a dataframe that contains train_auc and val_auc values for different num_iter.

key = 'eta=%s' % (xgb_params['eta'])
scores[key] = parse_xgb_output(output)
key

# Output: 'eta=0.3'

Now the dictionary scores contains the dataframe for this eta.

scores

# Output:
# {'eta=0.3':     num_iter  train_auc  val_auc
#  0          0    0.86730  0.77938
#  1          5    0.93086  0.80858
#  2         10    0.95447  0.80851
#  3         15    0.96554  0.81334
#  4         20    0.97464  0.81729
#  5         25    0.97953  0.81686
#   ...
#  36       180    1.00000  0.80723
#  37       185    1.00000  0.80678
#  38       190    1.00000  0.80672
#  39       195    1.00000  0.80708
#  40       199    1.00000  0.80725}
scores['eta=0.3']
num_itertrain_aucval_auc
000.867300.77938
150.930860.80858
2100.954470.80851
3150.965540.81334
4200.974640.81729
5250.979530.81686
6300.985790.81543
7350.990110.81206
8400.994210.80922
9450.995480.80842
10500.996530.80918
11550.997650.81114
12600.998170.81172
13650.998870.80798
14700.999340.80870
15750.999650.80555
16800.999790.80549
17850.999880.80374
18900.999930.80409
19950.999960.80548
201000.999980.80509
211050.999990.80629
221101.000000.80637
231151.000000.80494
241201.000000.80574
251251.000000.80727
261301.000000.80746
271351.000000.80753
281401.000000.80899
291451.000000.80733
301501.000000.80841
311551.000000.80734
321601.000000.80711
331651.000000.80707
341701.000000.80734
351751.000000.80704
361801.000000.80723
371851.000000.80678
381901.000000.80672
391951.000000.80708
401991.000000.80725
Dataframe for ‘eta=0.3’

Eta = 1.0

Now, let’s set the eta value to its maximum, which is 1.0.

%%capture output

xgb_params = {
    'eta': 1.0, 
    'max_depth': 6,
    'min_child_weight': 1,
    
    'objective': 'binary:logistic',
    'eval_metric': 'auc',

    'nthread': 8,
    'seed': 1,
    'verbosity': 1,
}

model = xgb.train(xgb_params, dtrain, num_boost_round=200,
                  verbose_eval=5,
                  evals=watchlist)
key = 'eta=%s' % (xgb_params['eta'])
scores[key] = parse_xgb_output(output)
key
# Output: 'eta=1.0'

With the updated eta value, the scores dictionary should now encompass two values and two corresponding keys.

scores['eta=1.0']
num_itertrain_aucval_auc
000.867300.77938
150.958570.79136
2100.980610.78355
3150.995490.78050
4200.998940.78591
5250.999890.78401
6301.000000.78371
7351.000000.78234
8401.000000.78184
9451.000000.77963
10501.000000.78645
11551.000000.78644
12601.000000.78545
13651.000000.78612
14701.000000.78515
15751.000000.78516
16801.000000.78420
17851.000000.78570
18901.000000.78793
19951.000000.78865
201001.000000.79075
211051.000000.79107
221101.000000.79022
231151.000000.79036
241201.000000.79021
251251.000000.79025
261301.000000.78994
271351.000000.79084
281401.000000.79048
291451.000000.78967
301501.000000.78969
311551.000000.78992
321601.000000.79064
331651.000000.79067
341701.000000.79115
351751.000000.79126
361801.000000.79199
371851.000000.79179
381901.000000.79165
391951.000000.79201
401991.000000.79199
Dataframe for ‘eta=1.0’

Eta = 0.1

Let’s go through the process once more for ‘eta=0.1’ and subsequently print out the dataframe.

%%capture output

xgb_params = {
    'eta': 0.1, 
    'max_depth': 6,
    'min_child_weight': 1,
    
    'objective': 'binary:logistic',
    'eval_metric': 'auc',

    'nthread': 8,
    'seed': 1,
    'verbosity': 1,
}

model = xgb.train(xgb_params, dtrain, num_boost_round=200,
                  verbose_eval=5,
                  evals=watchlist)
key = 'eta=%s' % (xgb_params['eta'])
scores[key] = parse_xgb_output(output)
key

# Output: 'eta=0.1'
scores['eta=0.1']
num_itertrain_aucval_auc
000.867300.77938
150.903250.79290
2100.918740.80510
3150.931260.81380
4200.938730.81804
5250.946380.82065
6300.953380.82063
7350.958740.82404
8400.963250.82644
9450.966940.82602
10500.971950.82549
11550.974750.82648
12600.977080.82781
13650.979370.82775
14700.982140.82681
15750.983150.82728
16800.985170.82560
17850.987210.82503
18900.988400.82443
19950.989720.82389
201000.990610.82456
211050.991570.82359
221100.992240.82274
231150.992880.82147
241200.993780.82154
251250.994810.82195
261300.995410.82252
271350.995640.82190
281400.996300.82219
291450.996730.82177
301500.997110.82136
311550.997500.82154
321600.997740.82102
331650.998210.82060
341700.998380.82060
351750.998610.82012
361800.998820.82053
371850.998980.82028
381900.999040.81973
391950.999200.81909
401990.999270.81864
Dataframe for ‘eta=0.1’

Eta = 0.05

Let’s do it again for ‘eta=0.05’ and print out the dataframe.

%%capture output

xgb_params = {
    'eta': 0.05, 
    'max_depth': 6,
    'min_child_weight': 1,
    
    'objective': 'binary:logistic',
    'eval_metric': 'auc',

    'nthread': 8,
    'seed': 1,
    'verbosity': 1,
}

model = xgb.train(xgb_params, dtrain, num_boost_round=200,
                  verbose_eval=5,
                  evals=watchlist)
key = 'eta=%s' % (xgb_params['eta'])
scores[key] = parse_xgb_output(output)
key

# Output: 'eta=0.05'
scores['eta=0.05']
num_itertrain_aucval_auc
000.867300.77938
150.886500.79584
2100.903680.79623
3150.910720.79938
4200.917740.80510
5250.923850.80895
6300.929870.81175
7350.933790.81480
8400.938560.81547
9450.943160.81807
10500.947530.81793
11550.950280.81926
12600.953240.81998
13650.955810.82159
14700.957620.82299
15750.959440.82368
16800.961000.82524
17850.963080.82604
18900.965720.82666
19950.967980.82667
201000.969550.82719
211050.971330.82745
221100.972880.82819
231150.974260.82822
241200.975780.82768
251250.977020.82790
261300.977880.82760
271350.979230.82764
281400.980120.82725
291450.981130.82665
301500.981900.82575
311550.982850.82581
321600.983810.82560
331650.984570.82576
341700.985410.82591
351750.986520.82581
361800.987110.82526
371850.987890.82525
381900.988760.82535
391950.989320.82538
401990.989770.82522
Dataframe for ‘eta=0.05’

Eta = 0.01

Once more, let’s assess the performance for ‘eta=0.01’.

%%capture output

xgb_params = {
    'eta': 0.01, 
    'max_depth': 6,
    'min_child_weight': 1,
    
    'objective': 'binary:logistic',
    'eval_metric': 'auc',

    'nthread': 8,
    'seed': 1,
    'verbosity': 1,
}

model = xgb.train(xgb_params, dtrain, num_boost_round=200,
                  verbose_eval=5,
                  evals=watchlist)
key = 'eta=%s' % (xgb_params['eta'])
scores[key] = parse_xgb_output(output)
key

# Output: 'eta=0.01'
scores['eta=0.01']
num_itertrain_aucval_auc
000.867300.77938
150.871570.77925
2100.872470.78051
3150.875410.78302
4200.875840.78707
5250.884060.79331
6300.890270.79763
7350.895590.79914
8400.897820.79883
9450.899830.79845
10500.901820.79697
11550.903940.79775
12600.905310.79684
13650.906300.79616
14700.907960.79672
15750.909550.79807
16800.911160.79976
17850.912270.80130
18900.913680.80285
19950.915150.80390
201000.916540.80499
211050.917910.80534
221100.919020.80523
231150.920320.80515
241200.921350.80497
251250.922620.80520
261300.924050.80605
271350.925470.80692
281400.926670.80730
291450.927710.80818
301500.929190.80928
311550.929900.81039
321600.930770.81125
331650.931550.81160
341700.932460.81189
351750.933480.81257
361800.934660.81340
371850.935850.81429
381900.936850.81481
391950.937770.81554
401990.938620.81575
Dataframe for ‘eta=0.01’

Plotting Eta

Now that we’ve inserted key-value pairs and gathered information from different runs, we can examine the keys in the dictionary. Next we can compare all runs of ‘eta=0.3’, ‘eta=1.0’, ‘eta=0.1’, ‘eta=0.05’, and ‘eta=0.01’.

scores.keys()
# Output: dict_keys(['eta=0.3', 'eta=1.0', 'eta=0.1', 'eta=0.05', 'eta=0.01'])

Let’s plot the information of our runs.

for key, df_score in scores.items():
    plt.plot(df_score.num_iter, df_score.val_auc, label=key)
plt.legend()

Let’s concentrate on a few graphs for the initial analysis. We will plot three of them: ‘eta=1.0’, ‘eta=0.3’, and ‘eta=0.1’.

etas = ['eta=1.0', 'eta=0.3', 'eta=0.1']
for eta in etas:
    df_score = scores[eta]
    plt.plot(df_score.num_iter, df_score.val_auc, label=eta)
plt.legend()

This plot provides a clearer view of the results. Notably, ‘eta=1.0’ exhibits the worst performance. It quickly reaches peak performance but then experiences a sharp decline, maintaining a consistently poor level. ‘eta=0.3’ performs reasonably well until around iteration 25, after which it steadily deteriorates. On the other hand, ‘eta=0.1’ demonstrates a slower growth rate, reaching its peak at a later stage before descending. This pattern is a direct reflection of the learning rate’s influence.

The learning rate controls both the speed at which the model learns and the size of the steps it takes during each iteration. If the steps are too large, the model learns rapidly but eventually starts to degrade due to the excessive step size, resulting in overfitting. Conversely, a smaller learning rate signifies slower but more stable learning. Such models tend to degrade more gradually, and their overfitting tendencies are less pronounced compared to models with higher learning rates.

Next let’s look at eta=0.3, eta=0.1, and eta=0.01

etas = ['eta=0.3', 'eta=0.1', 'eta=0.01']
for eta in etas:
    df_score = scores[eta]
    plt.plot(df_score.num_iter, df_score.val_auc, label=eta)
plt.legend()

‘eta=0.01’ displays an extremely slow learning rate, making it challenging to estimate how long it might take to outperform the other model (represented by the orange curve). This model’s progress is painstakingly slow, as the steps it takes are exceedingly tiny.

On the other hand, ‘eta=0.3’ takes a few significant steps initially but succumbs to overfitting more rapidly. In this plot, ‘eta=0.1’ seems to strike the ideal balance, particularly between 50 and 75 iterations. It may take a bit longer to reach its peak performance, but the resulting performance improvement justifies the wait.

There was also eta=0.05 let’s finally look also at this plot.

etas = ['eta=0.1', 'eta=0.05', 'eta=0.01']
for eta in etas:
    df_score = scores[eta]
    plt.plot(df_score.num_iter, df_score.val_auc, label=eta)
plt.legend()

The ‘eta=0.05’ model requires approximately twice as many iterations to converge when compared to the blue model (‘eta=0.1’). Although it takes smaller steps and requires more time, the end result is still inferior to the blue model. Thus, it’s evident that the ‘eta=0.1’ model stands out as the best option, as it achieves better performance with fewer steps.

Now that we’ve found the best value for ‘eta,’ in the second part of XGBoost parameter tuning, we’ll focus on tuning two more parameters: ‘max_depth‘ and ‘min_child_weight‘ before proceeding to train the final model.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.