Selecting the final model

Selecting the final model

This is the final part of the module ‘Decision Trees and Ensemble Learning – Part 14.’ This time, we revisit the best model of each type and evaluate their performance on the validation data. Based on these evaluations, we will select the overall best model and train it on the full training dataset. The final model will then be evaluated on the test set.

Choosing between XGBoost, random forest and decision tree

Retrain the best model of each type

Let’s retrain the best Decision Tree model we had.

dt = DecisionTreeClassifier(max_depth=6, min_samples_leaf=15)
dt.fit(X_train, y_train)

# Output: 
# DecisionTreeClassifier(max_depth=6, min_samples_leaf=15)

Let’s retrain the best Random Forest model we had.

rf = RandomForestClassifier(n_estimators=200,
                            max_depth=10,
                            min_samples_leaf=3,
                            random_state=1)
rf.fit(X_train, y_train)

# Output: 
# RandomForestClassifier(max_depth=10, min_samples_leaf=3, n_estimators=200,
#                                                random_state=1)

Let’s retrain the best XGBoost model we had.

xgb_params = {
    'eta': 0.1, 
    'max_depth': 3,
    'min_child_weight': 1,

    'objective': 'binary:logistic',
    'eval_metric': 'auc',

    'nthread': 8,
    'seed': 1,
    'verbosity': 1,
}

model = xgb.train(xgb_params, dtrain, num_boost_round=175)

Evaluate all the best of models on validation data

# Decision Tree
y_pred = dt.predict_proba(X_val)[:, 1]
roc_auc_score(y_val, y_pred)

# Output: 0.7850954203095104

# Random Forest
y_pred = rf.predict_proba(X_val)[:, 1]
roc_auc_score(y_val, y_pred)

# Output: 0.8246258264512848

# XGBoost Model
y_pred = model.predict(dval)
roc_auc_score(y_val, y_pred)

# Output: 0.8309347073212081

We see that the XGBoost model has the best auc score. We’ll use this to train the final model.

Training the final model

To train the final model, we will use the entire dataset. Following the training, we will evaluate the final model on our test dataset.

df_full_train

	status	seniority	home	time	age	marital	records	job	expenses	income	assets	debt	amount	price
3668	ok	22	owner	48	48	married	no	fixed	60	110.0	3000.0	0.0	1000	1460
2540	default	8	other	60	41	married	no	freelance	45	62.0	0.0	0.0	1800	2101
279	ok	2	parents	36	19	married	no	fixed	35	162.0	4000.0	100.0	400	570
3536	ok	1	owner	12	61	married	no	others	45	103.0	20000.0	0.0	300	650
3866	ok	13	owner	60	27	married	no	fixed	35	253.0	7000.0	0.0	1060	1750
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
332	default	4	owner	60	47	married	no	freelance	75	0.0	13500.0	0.0	1900	1976
1293	ok	2	rent	60	28	single	no	fixed	45	101.0	0.0	0.0	1300	1333
4023	ok	2	parents	36	25	single	no	fixed	35	110.0	0.0	0.0	500	1200
3775	ok	4	other	60	25	single	no	fixed	35	162.0	0.0	0.0	1800	2999
1945	default	1	parents	48	25	single	no	freelance	35	0.0	0.0	0.0	1800	1809

3563 rows × 14 columns

Upon reviewing the previous output, we can observe that the index is not ordered. To address this, we will start by resetting the index.

df_full_train = df_full_train.reset_index(drop=True)

The next steps involve setting the ‘y’ value and removing the ‘status’ column from the training dataframe to prevent accidental use of this column during training.

y_full_train = (df_full_train.status == 'default').astype(int).values
y_full_train
# Output: array([0, 1, 0, ..., 0, 0, 1])

del df_full_train['status']

We can create dictionaries for the DictVectorizer and then use the fit_transform method to obtain X_full_train. For X_test, we only need to call the transform method since the vectorizer has already been fitted.

dicts_full_train = df_full_train.to_dict(orient='records')

dv = DictVectorizer(sparse=False)
X_full_train = dv.fit_transform(dicts_full_train)

dicts_test = df_test.to_dict(orient='records')
X_test = dv.transform(dicts_test)

Inspecting the ‘feature_names’ reveals the one-hot-encoded columns and confirms that there is no ‘status’ column, indicating that our data preparation is complete and we are ready to train the final model.

feature_names = list(dv.get_feature_names_out())
feature_names

# Output: 
# ['age',
#  'amount',
#  'assets',
#  'debt',
#  'expenses',
#  'home=ignore',
#  'home=other',
#  'home=owner',
#  'home=parents',
#  'home=private',
#  'home=rent',
#  'home=unk',
#  'income',
#  'job=fixed',
#  'job=freelance',
#  'job=others',
#  'job=partime',
#  'job=unk',
#  'marital=divorced',
#  'marital=married',
#  'marital=separated',
#  'marital=single',
#  'marital=unk',
#  'marital=widow',
#  'price',
# 'records=no',
#  'records=yes',
#  'seniority',
#  'time']

XGBoost models require data in the form of DMatrix for training. We also prepare the test data, which doesn’t require labels, as we’ll evaluate it using Scikit-Learn.

feature_names = list(dv.get_feature_names_out())
dfulltrain = xgb.DMatrix(X_full_train, label=y_full_train,
                    feature_names=feature_names)

dtest = xgb.DMatrix(X_test, feature_names=feature_names)

Now let’s set the parameters and train the final model.

xgb_params = {
    'eta': 0.1, 
    'max_depth': 3,
    'min_child_weight': 1,

    'objective': 'binary:logistic',
    'eval_metric': 'auc',

    'nthread': 8,
    'seed': 1,
    'verbosity': 1,
}

model = xgb.train(xgb_params, dfulltrain, num_boost_round=175)

Evaluate the final model¶

y_pred = model.predict(dtest)
roc_auc_score(y_test, y_pred)

# Output: 0.8289367577342261

The performance of the final model is a little bit worse than the best XGBoost model (0.831), but it’s only like a fraction of one percent. So this is fine. We can conclude that our model didn’t overfit. The final model generalizes quite well on unseen data. XGBoost models are often one of the best models at least for tabular data (dataframe with features). The downside of this is that XGBoost models are more complex, it’s more difficult to tune, it has more parameters, and it’s easier to overfit with XGBoost. But you can get a better performance out of this.

ML Zoomcamp 2023 – Decision Trees and Ensemble Learning– Part 14

Selecting the final model

Choosing between XGBoost, random forest and decision tree

Retrain the best model of each type

Evaluate all the best of models on validation data

Training the final model

Evaluate the final model¶

Leave a comment Cancel reply

Selecting the final model

Choosing between XGBoost, random forest and decision tree

Retrain the best model of each type

Evaluate all the best of models on validation data

Training the final model

Evaluate the final model¶

Teilen mit:

Related

Leave a comment Cancel reply