Selecting the final model
This is the final part of the module ‘Decision Trees and Ensemble Learning – Part 14.’ This time, we revisit the best model of each type and evaluate their performance on the validation data. Based on these evaluations, we will select the overall best model and train it on the full training dataset. The final model will then be evaluated on the test set.
Choosing between XGBoost, random forest and decision tree
Retrain the best model of each type
Let’s retrain the best Decision Tree model we had.
dt = DecisionTreeClassifier(max_depth=6, min_samples_leaf=15)
dt.fit(X_train, y_train)
# Output:
# DecisionTreeClassifier(max_depth=6, min_samples_leaf=15)
Let’s retrain the best Random Forest model we had.
rf = RandomForestClassifier(n_estimators=200,
max_depth=10,
min_samples_leaf=3,
random_state=1)
rf.fit(X_train, y_train)
# Output:
# RandomForestClassifier(max_depth=10, min_samples_leaf=3, n_estimators=200,
# random_state=1)
Let’s retrain the best XGBoost model we had.
xgb_params = {
'eta': 0.1,
'max_depth': 3,
'min_child_weight': 1,
'objective': 'binary:logistic',
'eval_metric': 'auc',
'nthread': 8,
'seed': 1,
'verbosity': 1,
}
model = xgb.train(xgb_params, dtrain, num_boost_round=175)
Evaluate all the best of models on validation data
# Decision Tree
y_pred = dt.predict_proba(X_val)[:, 1]
roc_auc_score(y_val, y_pred)
# Output: 0.7850954203095104
# Random Forest
y_pred = rf.predict_proba(X_val)[:, 1]
roc_auc_score(y_val, y_pred)
# Output: 0.8246258264512848
# XGBoost Model
y_pred = model.predict(dval)
roc_auc_score(y_val, y_pred)
# Output: 0.8309347073212081
We see that the XGBoost model has the best auc score. We’ll use this to train the final model.
Training the final model
To train the final model, we will use the entire dataset. Following the training, we will evaluate the final model on our test dataset.
df_full_train
| status | seniority | home | time | age | marital | records | job | expenses | income | assets | debt | amount | price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3668 | ok | 22 | owner | 48 | 48 | married | no | fixed | 60 | 110.0 | 3000.0 | 0.0 | 1000 | 1460 |
| 2540 | default | 8 | other | 60 | 41 | married | no | freelance | 45 | 62.0 | 0.0 | 0.0 | 1800 | 2101 |
| 279 | ok | 2 | parents | 36 | 19 | married | no | fixed | 35 | 162.0 | 4000.0 | 100.0 | 400 | 570 |
| 3536 | ok | 1 | owner | 12 | 61 | married | no | others | 45 | 103.0 | 20000.0 | 0.0 | 300 | 650 |
| 3866 | ok | 13 | owner | 60 | 27 | married | no | fixed | 35 | 253.0 | 7000.0 | 0.0 | 1060 | 1750 |
| … | … | … | … | … | … | … | … | … | … | … | … | … | … | … |
| 332 | default | 4 | owner | 60 | 47 | married | no | freelance | 75 | 0.0 | 13500.0 | 0.0 | 1900 | 1976 |
| 1293 | ok | 2 | rent | 60 | 28 | single | no | fixed | 45 | 101.0 | 0.0 | 0.0 | 1300 | 1333 |
| 4023 | ok | 2 | parents | 36 | 25 | single | no | fixed | 35 | 110.0 | 0.0 | 0.0 | 500 | 1200 |
| 3775 | ok | 4 | other | 60 | 25 | single | no | fixed | 35 | 162.0 | 0.0 | 0.0 | 1800 | 2999 |
| 1945 | default | 1 | parents | 48 | 25 | single | no | freelance | 35 | 0.0 | 0.0 | 0.0 | 1800 | 1809 |
Upon reviewing the previous output, we can observe that the index is not ordered. To address this, we will start by resetting the index.
df_full_train = df_full_train.reset_index(drop=True)
The next steps involve setting the ‘y’ value and removing the ‘status’ column from the training dataframe to prevent accidental use of this column during training.
y_full_train = (df_full_train.status == 'default').astype(int).values
y_full_train
# Output: array([0, 1, 0, ..., 0, 0, 1])
del df_full_train['status']
We can create dictionaries for the DictVectorizer and then use the fit_transform method to obtain X_full_train. For X_test, we only need to call the transform method since the vectorizer has already been fitted.
dicts_full_train = df_full_train.to_dict(orient='records')
dv = DictVectorizer(sparse=False)
X_full_train = dv.fit_transform(dicts_full_train)
dicts_test = df_test.to_dict(orient='records')
X_test = dv.transform(dicts_test)
Inspecting the ‘feature_names’ reveals the one-hot-encoded columns and confirms that there is no ‘status’ column, indicating that our data preparation is complete and we are ready to train the final model.
feature_names = list(dv.get_feature_names_out())
feature_names
# Output:
# ['age',
# 'amount',
# 'assets',
# 'debt',
# 'expenses',
# 'home=ignore',
# 'home=other',
# 'home=owner',
# 'home=parents',
# 'home=private',
# 'home=rent',
# 'home=unk',
# 'income',
# 'job=fixed',
# 'job=freelance',
# 'job=others',
# 'job=partime',
# 'job=unk',
# 'marital=divorced',
# 'marital=married',
# 'marital=separated',
# 'marital=single',
# 'marital=unk',
# 'marital=widow',
# 'price',
# 'records=no',
# 'records=yes',
# 'seniority',
# 'time']
XGBoost models require data in the form of DMatrix for training. We also prepare the test data, which doesn’t require labels, as we’ll evaluate it using Scikit-Learn.
feature_names = list(dv.get_feature_names_out())
dfulltrain = xgb.DMatrix(X_full_train, label=y_full_train,
feature_names=feature_names)
dtest = xgb.DMatrix(X_test, feature_names=feature_names)
Now let’s set the parameters and train the final model.
xgb_params = {
'eta': 0.1,
'max_depth': 3,
'min_child_weight': 1,
'objective': 'binary:logistic',
'eval_metric': 'auc',
'nthread': 8,
'seed': 1,
'verbosity': 1,
}
model = xgb.train(xgb_params, dfulltrain, num_boost_round=175)
Evaluate the final model¶
y_pred = model.predict(dtest)
roc_auc_score(y_test, y_pred)
# Output: 0.8289367577342261
The performance of the final model is a little bit worse than the best XGBoost model (0.831), but it’s only like a fraction of one percent. So this is fine. We can conclude that our model didn’t overfit. The final model generalizes quite well on unseen data. XGBoost models are often one of the best models at least for tabular data (dataframe with features). The downside of this is that XGBoost models are more complex, it’s more difficult to tune, it has more parameters, and it’s easier to overfit with XGBoost. But you can get a better performance out of this.