The aim was to select the best machine learning algorithm to predict whether an individual makes more than $50,000, using a census income dataset. After first considerations regarding variance and bias for some scikit-learn models, I selected three of them. Then I trained, tested them, and assessed their performance based on F-score. GradientBoosting was the best for this prediction task.
The dataset originates from the UCI Machine Learning Repository, where you will find its features description. The data I investigated comprised small changes, such as removing the ‘fnlwgt’ feature and records with missing or ill-formatted entries. So the changed dataset comprised approximately 32,000 data points, with each data point having 13 features. The target feature was ‘income’, which is <= 50K or > 50K.
First, I explored the data, then transformed skewed continuous features and normalized numerical features. I dedicated 80% of the data for training and 20% for testing.
I started my investigation with the supervised learning models available in scikit-learn.
Decision Trees are weak learners. They used to have high variance and low bias. Optimizing both variance and bias requires using Ensemble methods: Bagging, AdaBoost, Random Forest, and Gradient Boosting. Therefore, I tested Bagging to mitigate the variance and Boosting for the bias. First, I chose Random Forest, which is an improvement over Bagging. Regarding the Boosting, AdaBoost and GradientBoosting have two different ways on working on the weak learners, so I tried both of them.
Here is my assessment of the three chosen models based on their strengths, weaknesses and the reasons they were good candidates for the problem:
- Strengths: it performs well with a few observations and when there are thousands of input variables.
- Weaknesses: it performs poorly when the data includes categorical variables that have different numbers of levels; The model shows a bias in favor of the attributes with more levels. When the data contains groups of correlated features with similar relevance for the output, the model favors the smaller groups over the large ones.
- What makes this model a good candidate for the problem: it performs well with a few observations. It helps to mitigate the variance as an alternative to the basic Decision Trees model.
- Strengths: it performs well because it does not overfit.
- Weaknesses: it performs poorly when we have noisy data and outliers.
- What makes this model a good candidate for the problem: it is a good way to mitigate the low bias coming from a weak learner like a basic Decision Trees model.
- Strengths: it performs well because it can take advantage of regularization methods. They penalize various parts of the algorithm but improve the performance of the algorithm by reducing over-fitting.
- Weaknesses: it performs poorly because it can overfit a training dataset quickly. In addition, the training can take longer.
- What makes this model a good candidate for the problem: it offers, as a boosting model, an opportunity to mitigate the bias. This is an alternative from AdaBoost because it managed the weak learners differently, still being a boosting algorithm. AdaBoost model identifies the shortcomings by using high weight data points; GradientBoosting performs the same by using gradients in a loss function.
Choosing the best model
Based on the models’ performance metrics above, GradientBoosting is the most appropriate model for the tasks of identifying individuals that make more that $50,000:
- It has the best F-score (0.7395) on the testing set whatever is the training set size.
- It has the less prediction time, out of the 3 models we tested. We are around 0.02 seconds for 100% of the training set size.
It has the worst training time, especially for the whole training set size, but this is not an issue from a business point of view. Only the prediction is run more often. So the tradeoff here is to have a lesser prediction lead time, which is the case. Out of all, whatever is the training size, the prediction lead time is here a huge differentiator in favor of the GradientBoosting classifier.
This analysis was part of my coursework related to Machine Learning. The source code, including the visualizations, is available in the finding_donors.ipynb notebook file.
 The references I used to assess the three selected models:
 The references I used for Random Forest performance:
 The references I used for AdaBoost performance:
 The references I used for GradientBoosting performance: