Random Forest Classifier or Regressor - Which to choose
A basic comparison of predictions when using an off the shelf sklearn classifier verus regressor, with different criterion to split decision trees. The data is taken from the Homesite Competition on Kaggle.
- Setup Working Environment
- Kaggle Dataset
- Minimal Data Exploration
- Sample Training and Validation Sets from Train DF
- Random Forest Classifier
- Create Random Forest Regressor
- Test Set Predictions
This notebook compares 2 basic Random Forest models, each with 2 different criterion to split decision trees. Distributions of predicitons are plotted at the end. Rather than provide any concrete conclusions, I leave this here as food for thought.
from google.colab import drive
drive.mount('/content/drive')
Path.cwd()
os.chdir('/content/gdrive/MyDrive/Colab Notebooks/GroupProject')
path = Path.cwd()
path
We should now have all the functions we require to run the notebook. The next thing we need is our dataset...
I have previosuly downloadeded the dataset from kaggle, and extracted the files. My teammate Nissan has already described how to download Kaggle data. So lets check the files listed in the directory:
path.ls()
train = pd.read_csv(path/'_data/train.csv',low_memory=False)
test = pd.read_csv(path/'_data/test.csv',low_memory=False)
Only presenting basic steps here. My teammates have provided other notebooks on more detailed EDA.
train.shape,train.columns
train.shape tells us that there are 260,753 rows, and 299 columns of data in the training set. To look at the column names specifically we can use train.columns, which also confirms there are 299 columns.
test.shape,test.columns
test.shape tells us that there are fewer records in the test set compared to the training set (260,753 vs. 173,753), and one less column (299 vs 298). The missing column is our dependent variable, "QuoteConversion_Flag".
The output from .columns also shows that we have a variable containing dates in the 2nd column - "Original_Quote_Date". Numerical coding of the date, although easy to read, hides potentially valuable information about dates. Such as: which day of the week? Was it a holiday? Is this date closer to the start or end of the calendar/financial year? This information could have an influence on what we are trying to predict. Thankfully, fast.ai has provided a useful function that takes the numerical date format and generates extra columns that hold this sort of information. Lets give it a go and have a look at the output.
train = add_datepart(train, 'Original_Quote_Date')
train.columns
test = add_datepart(test, 'Original_Quote_Date') #Run once
test.columns
By calling .columns once again, we can see we now have 311 (training) and 310 (test) columns with info about the day of week, day of year, holiday, etc.
train['QuoteConversion_Flag'].unique()
train['QuoteConversion_Flag'].describe()
dep_var = 'QuoteConversion_Flag'
Before we run our models, we need to split our trainging dataframe into a training set and a validation set. The validation set will not be passed to the model for training, and will provide us metrics for how well our model generalises to 'unseen' data. For a first attempt, lets randomly assign 80% of our records to the training set and 20% of our records to the validation set.
dep_var = 'QuoteConversion_Flag'
cont,cat = cont_cat_split(train, 1, dep_var=dep_var) #Specify Continuous & Categorical Columns in the DataSe
#cont,cat,dep_var
procs = [Categorify,FillMissing]
random.seed(42)
splits = RandomSplitter(valid_pct=0.2)(range_of(train))
to = TabularPandas(train,procs,cat,cont,y_names=dep_var,splits=splits)
len(to.train),len(to.valid)
xs,y = to.train.xs,to.train.y
valid_xs,valid_y = to.valid.xs,to.valid.y
Looking at the dependent variable - "QuoteConversion_Flag", we can seen that this is a binary outcome, i.e 0 or 1. The obvious choice when using Decision Trees with binary data is a Random Forest Classifier. In the classifier, each tree votes and the most popular class is chosen as the final result.
rfclass_gini = RandomForestClassifier(n_jobs=-1, random_state=42, criterion = "gini",oob_score=False).fit(xs, y)
train_classifier_gini_predictions = rfclass_gini.predict(xs)
valid_classifier_gini_predictions = rfclass_gini.predict(valid_xs)
The output from the Classifier are binary predictions. See below:
plt.hist(valid_classifier_gini_predictions, bins=10)
To get the probabilities of each outcome; whether a customer purchases insurance (1) or not (0), we need to call a different function
rfclass_gini.predict_proba(valid_xs)
plt.hist(rfclass_gini.predict_proba(valid_xs), bins=10)
We can see the output is a probability for each class. So in this case we have 2 probabilty estimates for each record. However, the Kaggle competition compares entries on the area under the curve (AUC). Therefore, lets see how the classifier performs for this metric.
train_classifier_auc = roc_auc_score(y,train_classifier_gini_predictions)
valid_classifier_auc = roc_auc_score(valid_y,valid_classifier_gini_predictions)
train_classifier_auc,valid_classifier_auc
For a first model, without any feature selection and data cleaning an ROC AUC of 0.81 in the validation isnt bad. A model predicting outcomes randomly would produce an ROC AUC of 0.5. The code below shoes this:
random_probs = [0 for _ in range(len(valid_y))]
random_lr_auc = roc_auc_score(valid_y, random_probs)
random_lr_auc
We can see that the random forest classfier is doing a lot better than a model making random predcitions, so that is encouraging.
Another way of visualising the results from classification tasks is a confusion matrix. This plots the counts of the predicted vs true classes. This Classifier seems to do a relatively good job at predicting records that relate to customers who did not purchase home insurance (True = 0), but struggled a lot more at predicting records that relate to customers that did go onto purchase home insurance (True = 1). Another observation is that is a skew in the number of records towaards those of customers that did NOT go onto purchase home.
plot_confusion_matrix(rfclass_gini,valid_xs, valid_y,values_format='d')
plt.show()
The Random Forest Classifier has a second option for the criteria used to split the decision trees. Lets try the same process with the 'entropy' criterion and compare the results.
rfclass_entropy = RandomForestClassifier(n_jobs=-1, random_state=42, criterion = "entropy",oob_score=False).fit(xs, y)
train_classifier_entropy_predictions = rfclass_entropy.predict(xs)
valid_classifier_entropy_predictions = rfclass_entropy.predict(valid_xs)
roc_auc_score(y,train_classifier_entropy_predictions),roc_auc_score(valid_y,valid_classifier_entropy_predictions)
plot_confusion_matrix(rfclass_entropy,valid_xs, valid_y,values_format='d')
plt.show()
Comparing the confusion matrices gives us an indication of how the different criterion influenced the predictions from the classifiers.
The 'entropy' criterion looks as though it did very slightly better at classfying records in which customers did not purchase home insurance (True Label = 0_, and slightly worse than the 'gini impurity' for records in which customers did go on to purchase home insurance (True Label = 1).
Given the task is to predict the probabilty that a customer went onto purchase home insurance, using the 'gini impurity' criterion seems more appropriate. Although given such as small difference between criterion, the decision between these two criterion is lilkey to have very little influence on the results of the model overall. Therefore, time investigating the effect of removing redundant or highly correlated variablesmay be more fruitful.
Although I said a Random Forest Classifier is the most obvious choice for use on binary data, we could try something else. The Kaggle competition task is to predict the probability that a customer purchased home insurance. Therefore, why not use a method that directly outputs probabilities.
In short:
A classifier will take the most common prediction from all of the decision trees. So if we run 100 trees, and 51 predict the customer purchased home insurance (1), and 49 predicted that the customer did NOT purchase home insurance (0), the overall prediction would be that the customer purchased home insurance (1). Since, we just want a probability, not a classification for the task, why force the model to make a classification. This could hae a particularly big imapct in the intermediate cases, as above.
Why not try a regressor? a Random Forest Regressor will take the average of the predictions from all of the decision trees. Therefore, in the example above, the overall prediction would be 0.51. This prediciton could be passed directly to the Kaggle to compute the ROC AUC score. Using this more conservative approach could provide improvements for these intermediate cases. So lets try it!
rfregress_mse = RandomForestRegressor(n_jobs=-1,random_state=42,criterion="mse",oob_score=False).fit(xs, y)
train_rfregress_mse_predictions = rfregress_mse.predict(xs)
valid_rfregress_mse_predictions = rfregress_mse.predict(valid_xs)
train_lr_auc = roc_auc_score(y, train_rfregress_mse_predictions)
valid_lr_auc = roc_auc_score(valid_y, valid_rfregress_mse_predictions)
train_lr_auc,valid_lr_auc
With the Regressor, using defult parameters and splitting trees based on the Mean Squared Error (mse), we get over a 10% increase in the ROC AUC for the validation set! (Caveat, This difference may be less after more detailed EDA and QC)
Lets re-check the Classifier ROC AUC to be sure:
train_classifier_auc,valid_classifier_auc
I will leave investigating the reasons for the differences for another notebook, this could involve writing our own classifiers and regressors so me can compare with the same or different loss functions. But to show the differences in the outputs of the two models, histograms of the predictions from the Random Forest Classifier & and the Random Forest Regressor are shown below:
plt.hist(rfclass_gini.predict_proba(valid_xs)[:,1], bins=100,range=[0, 1])
plt.hist(valid_rfregress_mse_predictions, bins=100,range=[0, 1])
cont,cat = cont_cat_split(test, max_card=1000) #Specify Continuous & Categorical Columns in the DataSe
procs = [Categorify,FillMissing]
to.test = TabularPandas(test,procs,cat,cont,splits=None)
to.test.show()
test_xs = to.test.xs
test_classifier_gini_predictions = rfclass_gini.predict_proba(test_xs)
#Probability of Insurance Policy Purchased - Regressor
test_rfregress_mse_predictions = rfregress_mse.predict(test_xs)
classifier_submission = pd.DataFrame(zip(test.QuoteNumber,test_classifier_gini_predictions[:,1]), columns = ['QuoteNumber','QuoteConversion_Flag'])
classifier_submission.to_csv(path/'classifier_submission.csv',index=False)
regressor_submission = pd.DataFrame(zip(test.QuoteNumber,test_rfregress_mse_predictions), columns = ['QuoteNumber','QuoteConversion_Flag'])
regressor_submission.to_csv(path/'regressor_submission.csv',index=False)