Overview:
In my last post (Introduction to Machine Learning – Part 4) I wrote about the CRISP-DM ML Process with its six steps. This post is about the 4th step of that process – the modeling step. This means that the model selection process is also relevant here.
Model Selection Process
Imagine there is a model (g) and a dataset X with target values y. The model performs quite good on that data. But later there is new data and you would like to know how the model performs. What you can do now is to take all the data and use train-validation split. That means for example 80% used for training (old dataset) and 20% used for validation (new dataset).
Steps to get model performance
- Extract feature matrix Xtrain from train dataset
- You also have the target value ytrain from train dataset
- Using Xtrain and ytrain to train g
- From validation dataset you also have XV and yV
- Applying g to XV to get predicted values: g(XV) = (y-hat)V
- Comparing the predicted (y-hat)V values with the actual yV values to get information about model performance
| (y-hat)V | (Y-Hat)V | YV |
|---|---|---|
| 0.8 | 1 | 1 |
| 0.7 | 1 | 0 |
| 0.6 | 1 | 1 |
| 0.1 | 0 | 0 |
| 0.9 | 1 | 1 |
| 0.6 | 1 | 0 |
| prediction (as likelyhood) | prediction | target |
- Trying different models and selecting the best one based on accuracy, e.g.
| g1 | linear regression | 66% |
| g2 | decision tree | 60% |
| g3 | random forest | 67% |
| g4 | neural network | 80% |
Multiple comparison problem
The last table visualize a problem that could happen, when comparing different models on one validation dataset. The winning model could just get lucky (like a coin-flip) predicting the validation data. When testing the models on a totally different dataset, the winner could be another model.
Training – validation – test – split
To guard cases like this, use three datasets for training, validation, and testing (60%-20%-20%). Hide the 20% for testing and do the same steps as before. To select the best model based on the training and validation set we use the same model selection process as described before. But now there is an additional step. To ensure that the winning model didn’t get lucky on the validation dataset, we also apply this model to the test dataset g(XT) = yT
| accVal | ACCTest | ||
|---|---|---|---|
| g1 | linear regression | 66% | |
| g2 | decision tree | 60% | |
| g3 | random forest | 67% | |
| g4 | neural network | 80% | 79% |
We can conclude that this model g4 behaves quite well.
Summary
This process is called the model selection process and is one of the most important thing in Machine Learning.
the 6 steps
- Split datasets (60%-20%-20%)
- Train the model
- Apply the model to validation dataset
Repeat 2 and 3 a few times - Select the best model
- Apply the model to the test dataset
- Check everything is good (compare accuracy of validation and test datasets)
Alternative Approach
To not waste the validation dataset you can reuse it. That means you train a model on the training dataset, apply the model on validation dataset, and choose the best model as before. But then combine train and validation datasets and train another model based on both datasets. Apply this new model on the test dataset.
The alternative approach mentioned above, where the validation dataset is not wasted, can be a practical solution in some cases. By combining the training and validation datasets, we can create a larger dataset for training a new model. This approach can help improve the performance and generalization of the selected model.
Here are the steps for the alternative approach:
- Split the original dataset into training, validation, and test sets with a ratio of 60%-20%-20%.
- Train the initial models using the training dataset.
- Apply the initial models to the validation dataset and evaluate their performance.
- Select the best-performing model based on the validation results.
- Combine the training and validation datasets to create a new combined dataset.
- Retrain the selected model using the new combined dataset.
- Apply the newly trained model to the test dataset to assess its performance on unseen data.
By training the model on a larger combined dataset, we can potentially capture more patterns and improve the model’s ability to generalize to new data. The final evaluation on the test dataset provides a more reliable measure of the model’s performance and gives us confidence in its ability to make accurate predictions.
It’s important to note that the alternative approach may not always yield better results compared to the original model selection process. The effectiveness of this approach depends on the specific characteristics of the dataset and the performance of the initial models. Experimentation and careful evaluation are key to determine the most suitable approach for your machine learning task.
In summary, the model selection process is a crucial step in machine learning, and it involves thoroughly assessing different models and selecting the one that performs the best on unseen data. The alternative approach of combining the training and validation datasets can be an effective strategy to enhance model performance and generalize better.