Gradient boosting and XGBoost – Part 1/2
This time, we delve into a different approach for combining decision trees, where models are trained sequentially, with each new model correcting the errors of the previous one. This method of model combination is known as boosting. We will specifically explore gradient boosting and utilize the XGBoost library, which is designed for implementing the gradient boosted tree algorithm.
Gradient boosting vs. random forest
In a random forest, multiple independent decision trees are trained on the same dataset. The final prediction is achieved by aggregating the results of these individual trees, typically by taking an average: ((1/n) * Σ(pi)).
On the other hand, boosting employs a different strategy for combining multiple models into one ensemble. In boosting, we begin with the dataset and train the first model. The first model makes predictions, and we evaluate the errors made. Based on these errors, we train a second model, which generates its own predictions and, in the process, introduces its own errors. We then train a third model, which aims to correct the errors made by the second model. This process can be repeated for multiple iterations. At the end of these iterations, we combine the predictions from these multiple models into the final prediction.
The core idea behind boosting is the sequential training of multiple models, where each subsequent model corrects the mistakes of the previous one.
Installing XGBoost
XGBoost is a library known for its highly effective implementation of gradient boosting.
!pip install xgboost
import xgboost as xgb
Training the first model
The first step in the process is to structure the training data into a specialized data format known as ‘DMatrix.’ This format is optimized for training XGBoost models, allowing for faster training.
XGBoost Parameters – Some of the most crucial parameters include:
- eta: This parameter represents the learning rate, determining how quickly the model learns.
- max_depth: Similar to random forests and decision trees, ‘max_depth’ controls the size of the trees.
- min_child_weight: This parameter controls the minimum number of observations that should be present in a leaf node, similar to the ‘min_samples_leaf’ in decision trees.
- objective: Since we have a binary classification task, where we aim to classify clients into ‘defaulting’ or ‘non-defaulting,’ we need to specify the ‘objective.’ There are various objectives available for different types of problems, including regression and classification.
- nthread: XGBoost has the capability to parallelize training, and here, we specify how many threads to utilize.
- seed: This parameter controls the randomization used in the model.
- verbosity: It allows us to control the level of detail in the warnings and messages generated during training.
xgb_params = {
'eta': 0.3,
'max_depth': 6,
'min_child_weight': 1,
'objective': 'binary:logistic',
'nthread': 8,
'seed': 1,
'verbosity': 1,
}
model = xgb.train(xgb_params, dtrain, num_boost_round= 10)
Now, we’re ready to test the model. To do this, we can simply use the predict function of the XGBoost model. It returns a one-dimensional array with the model’s predictions.
y_pred = model.predict(dval)
We can proceed to calculate the AUC and observe that this model achieves a value of nearly 81%. This is a commendable performance, considering that we haven’t performed any specific parameter tuning; we’ve used the default settings. However, it’s essential to exercise caution regarding the number of trees we train and the tree sizes, as XGBoost models can also be prone to overfitting, a topic we’ll explore in more depth later on. Notably, in this case, the performance with num_boost_round=10 is quite comparable to num_boost_round=200.
roc_auc_score(y_val, y_pred)
# Output: 0.8065256351262986