Overview:

Model Selection Process

In my last post (Introduction to Machine Learning – Part 4) I wrote about the CRISP-DM ML Process with its six steps. This post is about the 4th step of that process – the modeling step. This means that the model selection process is also relevant here.

Model Selection Process

Imagine there is a model (g) and a dataset X with target values y. The model performs quite good on that data. But later there is new data and you would like to know how the model performs. What you can do now is to take all the data and use train-validation split. That means for example 80% used for training (old dataset) and 20% used for validation (new dataset).

Steps to get model performance

Extract feature matrix X_train from train dataset
You also have the target value y_train from train dataset
Using X_train and y_train to train g
From validation dataset you also have X_V and y_V
Applying g to X_V to get predicted values: g(X_V) = (y-hat)_V
Comparing the predicted (y-hat)_V values with the actual y_V values to get information about model performance

(y-hat)_V	(Y-Hat)_V	Y_V
0.8	1	1
0.7	1	0
0.6	1	1
0.1	0	0
0.9	1	1
0.6	1	0
prediction (as likelyhood)	prediction	target

4 of 6 predicted values are correct ~ 66% accuracy

Trying different models and selecting the best one based on accuracy, e.g.

g₁	linear regression	66%
g₂	decision tree	60%
g₃	random forest	67%
g₄	neural network	80%

g₄ is the best model

Multiple comparison problem

The last table visualize a problem that could happen, when comparing different models on one validation dataset. The winning model could just get lucky (like a coin-flip) predicting the validation data. When testing the models on a totally different dataset, the winner could be another model.

Training – validation – test – split

To guard cases like this, use three datasets for training, validation, and testing (60%-20%-20%). Hide the 20% for testing and do the same steps as before. To select the best model based on the training and validation set we use the same model selection process as described before. But now there is an additional step. To ensure that the winning model didn’t get lucky on the validation dataset, we also apply this model to the test dataset g(X_T) = y_T

		acc_Val	ACC_Test
g₁	linear regression	66%
g₂	decision tree	60%
g₃	random forest	67%
g₄	neural network	80%	79%

g₄ performs similar on test set, so we can expect that model wasn’t lucky

We can conclude that this model g₄ behaves quite well.

Summary

This process is called the model selection process and is one of the most important thing in Machine Learning.

the 6 steps

Split datasets (60%-20%-20%)
Train the model
Apply the model to validation dataset
Repeat 2 and 3 a few times
Select the best model
Apply the model to the test dataset
Check everything is good (compare accuracy of validation and test datasets)

Alternative Approach

To not waste the validation dataset you can reuse it. That means you train a model on the training dataset, apply the model on validation dataset, and choose the best model as before. But then combine train and validation datasets and train another model based on both datasets. Apply this new model on the test dataset.

The alternative approach mentioned above, where the validation dataset is not wasted, can be a practical solution in some cases. By combining the training and validation datasets, we can create a larger dataset for training a new model. This approach can help improve the performance and generalization of the selected model.

Here are the steps for the alternative approach:

Split the original dataset into training, validation, and test sets with a ratio of 60%-20%-20%.
Train the initial models using the training dataset.
Apply the initial models to the validation dataset and evaluate their performance.
Select the best-performing model based on the validation results.
Combine the training and validation datasets to create a new combined dataset.
Retrain the selected model using the new combined dataset.
Apply the newly trained model to the test dataset to assess its performance on unseen data.

By training the model on a larger combined dataset, we can potentially capture more patterns and improve the model’s ability to generalize to new data. The final evaluation on the test dataset provides a more reliable measure of the model’s performance and gives us confidence in its ability to make accurate predictions.

It’s important to note that the alternative approach may not always yield better results compared to the original model selection process. The effectiveness of this approach depends on the specific characteristics of the dataset and the performance of the initial models. Experimentation and careful evaluation are key to determine the most suitable approach for your machine learning task.

In summary, the model selection process is a crucial step in machine learning, and it involves thoroughly assessing different models and selecting the one that performs the best on unseen data. The alternative approach of combining the training and validation datasets can be an effective strategy to enhance model performance and generalize better.

ML Zoomcamp 2023 – Introduction to Machine Learning – Part 5

Model Selection Process

Steps to get model performance

Multiple comparison problem

Training – validation – test – split

Summary

the 6 steps

Alternative Approach

Leave a comment Cancel reply

Model Selection Process

Steps to get model performance

Multiple comparison problem

Training – validation – test – split

Summary

the 6 steps

Alternative Approach

Teilen mit:

Related

Leave a comment Cancel reply