ML Zoomcamp 2023 – Decision Trees and Ensemble Learning– Part 14

  1. Selecting the final model
    1. Choosing between XGBoost, random forest and decision tree
      1. Retrain the best model of each type
      2. Evaluate all the best of models on validation data
    2. Training the final model
    3. Evaluate the final model¶

Selecting the final model

This is the final part of the module ‘Decision Trees and Ensemble Learning – Part 14.’ This time, we revisit the best model of each type and evaluate their performance on the validation data. Based on these evaluations, we will select the overall best model and train it on the full training dataset. The final model will then be evaluated on the test set.

Choosing between XGBoost, random forest and decision tree

Retrain the best model of each type

Let’s retrain the best Decision Tree model we had.

dt = DecisionTreeClassifier(max_depth=6, min_samples_leaf=15)
dt.fit(X_train, y_train)

# Output: 
# DecisionTreeClassifier(max_depth=6, min_samples_leaf=15)

Let’s retrain the best Random Forest model we had.

rf = RandomForestClassifier(n_estimators=200,
                            max_depth=10,
                            min_samples_leaf=3,
                            random_state=1)
rf.fit(X_train, y_train)

# Output: 
# RandomForestClassifier(max_depth=10, min_samples_leaf=3, n_estimators=200,
#                                                random_state=1)

Let’s retrain the best XGBoost model we had.

xgb_params = {
    'eta': 0.1, 
    'max_depth': 3,
    'min_child_weight': 1,

    'objective': 'binary:logistic',
    'eval_metric': 'auc',

    'nthread': 8,
    'seed': 1,
    'verbosity': 1,
}

model = xgb.train(xgb_params, dtrain, num_boost_round=175)

Evaluate all the best of models on validation data

# Decision Tree
y_pred = dt.predict_proba(X_val)[:, 1]
roc_auc_score(y_val, y_pred)

# Output: 0.7850954203095104
# Random Forest
y_pred = rf.predict_proba(X_val)[:, 1]
roc_auc_score(y_val, y_pred)

# Output: 0.8246258264512848
# XGBoost Model
y_pred = model.predict(dval)
roc_auc_score(y_val, y_pred)

# Output: 0.8309347073212081

We see that the XGBoost model has the best auc score. We’ll use this to train the final model.

Training the final model

To train the final model, we will use the entire dataset. Following the training, we will evaluate the final model on our test dataset.

df_full_train
statusseniorityhometimeagemaritalrecordsjobexpensesincomeassetsdebtamountprice
3668ok22owner4848marriednofixed60110.03000.00.010001460
2540default8other6041marriednofreelance4562.00.00.018002101
279ok2parents3619marriednofixed35162.04000.0100.0400570
3536ok1owner1261marriednoothers45103.020000.00.0300650
3866ok13owner6027marriednofixed35253.07000.00.010601750
332default4owner6047marriednofreelance750.013500.00.019001976
1293ok2rent6028singlenofixed45101.00.00.013001333
4023ok2parents3625singlenofixed35110.00.00.05001200
3775ok4other6025singlenofixed35162.00.00.018002999
1945default1parents4825singlenofreelance350.00.00.018001809
3563 rows × 14 columns

Upon reviewing the previous output, we can observe that the index is not ordered. To address this, we will start by resetting the index.

df_full_train = df_full_train.reset_index(drop=True)

The next steps involve setting the ‘y’ value and removing the ‘status’ column from the training dataframe to prevent accidental use of this column during training.

y_full_train = (df_full_train.status == 'default').astype(int).values
y_full_train
# Output: array([0, 1, 0, ..., 0, 0, 1])
del df_full_train['status']

We can create dictionaries for the DictVectorizer and then use the fit_transform method to obtain X_full_train. For X_test, we only need to call the transform method since the vectorizer has already been fitted.

dicts_full_train = df_full_train.to_dict(orient='records')

dv = DictVectorizer(sparse=False)
X_full_train = dv.fit_transform(dicts_full_train)

dicts_test = df_test.to_dict(orient='records')
X_test = dv.transform(dicts_test)

Inspecting the ‘feature_names’ reveals the one-hot-encoded columns and confirms that there is no ‘status’ column, indicating that our data preparation is complete and we are ready to train the final model.

feature_names = list(dv.get_feature_names_out())
feature_names

# Output: 
# ['age',
#  'amount',
#  'assets',
#  'debt',
#  'expenses',
#  'home=ignore',
#  'home=other',
#  'home=owner',
#  'home=parents',
#  'home=private',
#  'home=rent',
#  'home=unk',
#  'income',
#  'job=fixed',
#  'job=freelance',
#  'job=others',
#  'job=partime',
#  'job=unk',
#  'marital=divorced',
#  'marital=married',
#  'marital=separated',
#  'marital=single',
#  'marital=unk',
#  'marital=widow',
#  'price',
# 'records=no',
#  'records=yes',
#  'seniority',
#  'time']

XGBoost models require data in the form of DMatrix for training. We also prepare the test data, which doesn’t require labels, as we’ll evaluate it using Scikit-Learn.

feature_names = list(dv.get_feature_names_out())
dfulltrain = xgb.DMatrix(X_full_train, label=y_full_train,
                    feature_names=feature_names)

dtest = xgb.DMatrix(X_test, feature_names=feature_names)

Now let’s set the parameters and train the final model.

xgb_params = {
    'eta': 0.1, 
    'max_depth': 3,
    'min_child_weight': 1,

    'objective': 'binary:logistic',
    'eval_metric': 'auc',

    'nthread': 8,
    'seed': 1,
    'verbosity': 1,
}

model = xgb.train(xgb_params, dfulltrain, num_boost_round=175)

Evaluate the final model

y_pred = model.predict(dtest)
roc_auc_score(y_test, y_pred)

# Output: 0.8289367577342261

The performance of the final model is a little bit worse than the best XGBoost model (0.831), but it’s only like a fraction of one percent. So this is fine. We can conclude that our model didn’t overfit. The final model generalizes quite well on unseen data. XGBoost models are often one of the best models at least for tabular data (dataframe with features). The downside of this is that XGBoost models are more complex, it’s more difficult to tune, it has more parameters, and it’s easier to overfit with XGBoost. But you can get a better performance out of this.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.