Optimizing Gradient Boosting Models

Gradient Boosting Models

Gradient boosting classifier models are a powerful type of machine learning algorithm that outperform many other types of classifiers. In simplest terms, gradient boosting algorithms learn from the mistakes they make by optmizing on gradient descent. A gradient boosting model values the gradient descent, or the direction of the steepest increase of a function, to make adjustments so that the function can increase rapidly over each iteration. Gradient boosting models can be used for classfication or regression. This post will concern itself with gradient boosting classification. In particular, we will study and optimize three gradient boosting models, AdaBoost (sklearn), XGBoost, and Gradient Boosting Classifier (sklearn). We will study these classifiers with the Kaggle data set, Kaggle: Diabetes Health Indicators Dataset. In particular, we will use the binary data set. I will not be doing any cleansing or engineering of the data. I am strictly experimenting with the effects of the learning rate and number of estimators hyperparameters on gradient boosting model performance.

Learning Rate and Number of Estimators

The two hyperparameters I am concerning myself with are learning_rate and n_estimators. The learning rate shrinks the contribution of each tree by the number provided. This means that each iterative estimation the model provides will carry a lower rate the lower you make the learning rate. I will test values between 0.0001 and 1.

The n_estimators parameter is the number of boosting stages or number of trees in the boosting ensemble. I will test values between 100 and 100000. The higher the number of estimators, the more granular the model can be, but it also increases the chances of overfitting. I want to find the balance between complexity and accuracy. Gradient boosting models have a number of benefits, including handle missing data well, and being very robust against overfitting.

The Code

I am using a simple code structure to treat each of the three models in the same way: train and fit and print the Accuracy and F1-Score for each trial of the learning rate and n_estimators combinations. I will keep track of the best performing models in a dictionary to compare later. Please see the code here.

The Result

Comparing the output results for Accuracy and F1-Score, it appears that the XGBoost model had both the best and worst scores given the constraints. With a learning_rate = 1.0 and n_estimators = 100000, the accuracy = 0.6659 and the f1 score = 0.6709. With a learning_rate = 0.0001 and n_estimators = 100000, the accuracy = 0.7441 and the f1 score = 0.7588. Given that the default values for the XGBoost model are learning_rate = 0.1, and n_estimators = 100, it seems clear that we can squeeze some extra performance gains from gradient boosting momdels by optmizing the learning rate and number of estimators. It is also worth noting that these gains from the default settings are minimal without additional feature engineering or model tuning. However, we set out to see the possible performance gains of tweaking only these two hyperparamters on a standard data set. We have shown that suboptimal hyperparameters can certainly negatively affect the performance of gradient boosting models. I hope this analysis provided some useful insights into optimizing gradient boosting models.

What’s going on right now

What I’m building right now:

🚧 Machine learning optmization frameworks.

What I’m drinking right now:

🍺 Devil’s Backbone Grapefruit Smash

What I’m reading right now:

📚 All The Pretty Horses by Cormac McCarthy

What I’m watching right now:

📺 Frasier

What I’m playing right now:

🎮 Diablo IV

What I’m learning right now:

🎓 🎸 Learning Never Let You Go by Third Eye Blind.
🎓 Optmizing machine learning models.