This notebook compares 2 basic Random Forest models, each with 2 different criterion to split decision trees. Distributions of predicitons are plotted at the end. Rather than provide any concrete conclusions, I leave this here as food for thought.

Setup Working Environment

from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
Path.cwd()
Path('/content')
os.chdir('/content/gdrive/MyDrive/Colab Notebooks/GroupProject')
path = Path.cwd()
path
Path('/content/gdrive/MyDrive/Colab Notebooks/GroupProject')

We should now have all the functions we require to run the notebook. The next thing we need is our dataset...

Kaggle Dataset

I have previosuly downloadeded the dataset from kaggle, and extracted the files. My teammate Nissan has already described how to download Kaggle data. So lets check the files listed in the directory:

path.ls()
(#1) [Path('/content/gdrive/MyDrive/Colab Notebooks/GroupProject/_data')]
train = pd.read_csv(path/'_data/train.csv',low_memory=False)
test = pd.read_csv(path/'_data/test.csv',low_memory=False)

Minimal Data Exploration

Only presenting basic steps here. My teammates have provided other notebooks on more detailed EDA.

train.shape,train.columns
((260753, 299),
 Index(['QuoteNumber', 'Original_Quote_Date', 'QuoteConversion_Flag', 'Field6',
        'Field7', 'Field8', 'Field9', 'Field10', 'Field11', 'Field12',
        ...
        'GeographicField59A', 'GeographicField59B', 'GeographicField60A',
        'GeographicField60B', 'GeographicField61A', 'GeographicField61B',
        'GeographicField62A', 'GeographicField62B', 'GeographicField63',
        'GeographicField64'],
       dtype='object', length=299))

train.shape tells us that there are 260,753 rows, and 299 columns of data in the training set. To look at the column names specifically we can use train.columns, which also confirms there are 299 columns.

test.shape,test.columns
((173836, 298),
 Index(['QuoteNumber', 'Original_Quote_Date', 'Field6', 'Field7', 'Field8',
        'Field9', 'Field10', 'Field11', 'Field12', 'CoverageField1A',
        ...
        'GeographicField59A', 'GeographicField59B', 'GeographicField60A',
        'GeographicField60B', 'GeographicField61A', 'GeographicField61B',
        'GeographicField62A', 'GeographicField62B', 'GeographicField63',
        'GeographicField64'],
       dtype='object', length=298))

test.shape tells us that there are fewer records in the test set compared to the training set (260,753 vs. 173,753), and one less column (299 vs 298). The missing column is our dependent variable, "QuoteConversion_Flag".

The output from .columns also shows that we have a variable containing dates in the 2nd column - "Original_Quote_Date". Numerical coding of the date, although easy to read, hides potentially valuable information about dates. Such as: which day of the week? Was it a holiday? Is this date closer to the start or end of the calendar/financial year? This information could have an influence on what we are trying to predict. Thankfully, fast.ai has provided a useful function that takes the numerical date format and generates extra columns that hold this sort of information. Lets give it a go and have a look at the output.

train = add_datepart(train, 'Original_Quote_Date')
train.columns
Index(['QuoteNumber', 'QuoteConversion_Flag', 'Field6', 'Field7', 'Field8',
       'Field9', 'Field10', 'Field11', 'Field12', 'CoverageField1A',
       ...
       'Original_Quote_Day', 'Original_Quote_Dayofweek',
       'Original_Quote_Dayofyear', 'Original_Quote_Is_month_end',
       'Original_Quote_Is_month_start', 'Original_Quote_Is_quarter_end',
       'Original_Quote_Is_quarter_start', 'Original_Quote_Is_year_end',
       'Original_Quote_Is_year_start', 'Original_Quote_Elapsed'],
      dtype='object', length=311)
test = add_datepart(test, 'Original_Quote_Date') #Run once
test.columns
Index(['QuoteNumber', 'Field6', 'Field7', 'Field8', 'Field9', 'Field10',
       'Field11', 'Field12', 'CoverageField1A', 'CoverageField1B',
       ...
       'Original_Quote_Day', 'Original_Quote_Dayofweek',
       'Original_Quote_Dayofyear', 'Original_Quote_Is_month_end',
       'Original_Quote_Is_month_start', 'Original_Quote_Is_quarter_end',
       'Original_Quote_Is_quarter_start', 'Original_Quote_Is_year_end',
       'Original_Quote_Is_year_start', 'Original_Quote_Elapsed'],
      dtype='object', length=310)

By calling .columns once again, we can see we now have 311 (training) and 310 (test) columns with info about the day of week, day of year, holiday, etc.

train['QuoteConversion_Flag'].unique()
train['QuoteConversion_Flag'].describe()
dep_var = 'QuoteConversion_Flag'

Sample Training and Validation Sets from Train DF

Before we run our models, we need to split our trainging dataframe into a training set and a validation set. The validation set will not be passed to the model for training, and will provide us metrics for how well our model generalises to 'unseen' data. For a first attempt, lets randomly assign 80% of our records to the training set and 20% of our records to the validation set.

dep_var = 'QuoteConversion_Flag'
cont,cat = cont_cat_split(train, 1, dep_var=dep_var) #Specify Continuous & Categorical Columns in the DataSe
#cont,cat,dep_var
procs = [Categorify,FillMissing]
random.seed(42)
splits = RandomSplitter(valid_pct=0.2)(range_of(train))
to = TabularPandas(train,procs,cat,cont,y_names=dep_var,splits=splits)
len(to.train),len(to.valid)
(208603, 52150)
xs,y = to.train.xs,to.train.y
valid_xs,valid_y = to.valid.xs,to.valid.y

Random Forest Classifier

Looking at the dependent variable - "QuoteConversion_Flag", we can seen that this is a binary outcome, i.e 0 or 1. The obvious choice when using Decision Trees with binary data is a Random Forest Classifier. In the classifier, each tree votes and the most popular class is chosen as the final result.

Split Trees by Gini Impurity

rfclass_gini = RandomForestClassifier(n_jobs=-1, random_state=42, criterion = "gini",oob_score=False).fit(xs, y)
train_classifier_gini_predictions = rfclass_gini.predict(xs)
valid_classifier_gini_predictions = rfclass_gini.predict(valid_xs)

The output from the Classifier are binary predictions. See below:

plt.hist(valid_classifier_gini_predictions, bins=10)
(array([44760.,     0.,     0.,     0.,     0.,     0.,     0.,     0.,     0.,  7390.]),
 array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ]),
 <a list of 10 Patch objects>)

To get the probabilities of each outcome; whether a customer purchases insurance (1) or not (0), we need to call a different function

rfclass_gini.predict_proba(valid_xs)
plt.hist(rfclass_gini.predict_proba(valid_xs), bins=10)
(array([[  974.,  1627.,  1470.,  1362.,  1957.,  2984.,  3141.,  3236.,  5677., 29722.],
        [28868.,  6131.,  3959.,  2818.,  2766.,  2175.,  1489.,  1214.,  1596.,  1134.]]),
 array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ]),
 <a list of 2 Lists of Patches objects>)

We can see the output is a probability for each class. So in this case we have 2 probabilty estimates for each record. However, the Kaggle competition compares entries on the area under the curve (AUC). Therefore, lets see how the classifier performs for this metric.

train_classifier_auc = roc_auc_score(y,train_classifier_gini_predictions)
valid_classifier_auc = roc_auc_score(valid_y,valid_classifier_gini_predictions)
train_classifier_auc,valid_classifier_auc
(1.0, 0.8121955713748932)

For a first model, without any feature selection and data cleaning an ROC AUC of 0.81 in the validation isnt bad. A model predicting outcomes randomly would produce an ROC AUC of 0.5. The code below shoes this:

random_probs = [0 for _ in range(len(valid_y))]
random_lr_auc = roc_auc_score(valid_y, random_probs)
random_lr_auc
0.5

We can see that the random forest classfier is doing a lot better than a model making random predcitions, so that is encouraging.

Another way of visualising the results from classification tasks is a confusion matrix. This plots the counts of the predicted vs true classes. This Classifier seems to do a relatively good job at predicting records that relate to customers who did not purchase home insurance (True = 0), but struggled a lot more at predicting records that relate to customers that did go onto purchase home insurance (True = 1). Another observation is that is a skew in the number of records towaards those of customers that did NOT go onto purchase home.

plot_confusion_matrix(rfclass_gini,valid_xs, valid_y,values_format='d')
plt.show()

Split Trees by Entropy

The Random Forest Classifier has a second option for the criteria used to split the decision trees. Lets try the same process with the 'entropy' criterion and compare the results.

rfclass_entropy = RandomForestClassifier(n_jobs=-1, random_state=42, criterion = "entropy",oob_score=False).fit(xs, y)
train_classifier_entropy_predictions = rfclass_entropy.predict(xs)
valid_classifier_entropy_predictions = rfclass_entropy.predict(valid_xs)
roc_auc_score(y,train_classifier_entropy_predictions),roc_auc_score(valid_y,valid_classifier_entropy_predictions)
(0.9999743412105817, 0.8091649897936626)
plot_confusion_matrix(rfclass_entropy,valid_xs, valid_y,values_format='d')
plt.show()

Comparing the confusion matrices gives us an indication of how the different criterion influenced the predictions from the classifiers.

The 'entropy' criterion looks as though it did very slightly better at classfying records in which customers did not purchase home insurance (True Label = 0_, and slightly worse than the 'gini impurity' for records in which customers did go on to purchase home insurance (True Label = 1).

Given the task is to predict the probabilty that a customer went onto purchase home insurance, using the 'gini impurity' criterion seems more appropriate. Although given such as small difference between criterion, the decision between these two criterion is lilkey to have very little influence on the results of the model overall. Therefore, time investigating the effect of removing redundant or highly correlated variablesmay be more fruitful.

Create Random Forest Regressor

Although I said a Random Forest Classifier is the most obvious choice for use on binary data, we could try something else. The Kaggle competition task is to predict the probability that a customer purchased home insurance. Therefore, why not use a method that directly outputs probabilities.

In short:

A classifier will take the most common prediction from all of the decision trees. So if we run 100 trees, and 51 predict the customer purchased home insurance (1), and 49 predicted that the customer did NOT purchase home insurance (0), the overall prediction would be that the customer purchased home insurance (1). Since, we just want a probability, not a classification for the task, why force the model to make a classification. This could hae a particularly big imapct in the intermediate cases, as above.

Why not try a regressor? a Random Forest Regressor will take the average of the predictions from all of the decision trees. Therefore, in the example above, the overall prediction would be 0.51. This prediciton could be passed directly to the Kaggle to compute the ROC AUC score. Using this more conservative approach could provide improvements for these intermediate cases. So lets try it!

rfregress_mse = RandomForestRegressor(n_jobs=-1,random_state=42,criterion="mse",oob_score=False).fit(xs, y)
train_rfregress_mse_predictions = rfregress_mse.predict(xs)
valid_rfregress_mse_predictions = rfregress_mse.predict(valid_xs)
train_lr_auc = roc_auc_score(y, train_rfregress_mse_predictions)
valid_lr_auc = roc_auc_score(valid_y, valid_rfregress_mse_predictions)
train_lr_auc,valid_lr_auc
(1.0, 0.9587383322397882)

With the Regressor, using defult parameters and splitting trees based on the Mean Squared Error (mse), we get over a 10% increase in the ROC AUC for the validation set! (Caveat, This difference may be less after more detailed EDA and QC)

Lets re-check the Classifier ROC AUC to be sure:

train_classifier_auc,valid_classifier_auc
(1.0, 0.8121955713748932)

I will leave investigating the reasons for the differences for another notebook, this could involve writing our own classifiers and regressors so me can compare with the same or different loss functions. But to show the differences in the outputs of the two models, histograms of the predictions from the Random Forest Classifier & and the Random Forest Regressor are shown below:

Classifier

plt.hist(rfclass_gini.predict_proba(valid_xs)[:,1], bins=100,range=[0, 1])
(array([12171.,  4645.,  2879.,  2036.,  1653.,  1305.,  1154.,  1094.,  1000.,   931.,   854.,   756.,   733.,   667.,   597.,   587.,   532.,   452.,   496.,   457.,   400.,   399.,   360.,   398.,
          371.,   352.,   323.,   347.,   340.,   346.,   323.,   326.,   319.,   317.,   612.,     0.,   302.,   311.,   317.,   314.,   634.,     0.,   282.,   271.,   263.,   276.,   532.,     0.,
          257.,   251.,   218.,   227.,   229.,   212.,   203.,   181.,   396.,     0.,   187.,   175.,   147.,   184.,   159.,   147.,   159.,   167.,   129.,   131.,   286.,   127.,     0.,   136.,
          140.,   125.,   115.,   128.,   128.,   130.,   154.,   158.,   129.,   325.,   157.,     0.,   147.,   175.,   168.,   158.,   156.,   181.,   160.,   178.,   149.,   302.,   106.,     0.,
           93.,    73.,    35.,    38.]),
 array([0.  , 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1 , 0.11, 0.12, 0.13, 0.14, 0.15, 0.16, 0.17, 0.18, 0.19, 0.2 , 0.21, 0.22, 0.23, 0.24, 0.25, 0.26, 0.27, 0.28, 0.29, 0.3 , 0.31,
        0.32, 0.33, 0.34, 0.35, 0.36, 0.37, 0.38, 0.39, 0.4 , 0.41, 0.42, 0.43, 0.44, 0.45, 0.46, 0.47, 0.48, 0.49, 0.5 , 0.51, 0.52, 0.53, 0.54, 0.55, 0.56, 0.57, 0.58, 0.59, 0.6 , 0.61, 0.62, 0.63,
        0.64, 0.65, 0.66, 0.67, 0.68, 0.69, 0.7 , 0.71, 0.72, 0.73, 0.74, 0.75, 0.76, 0.77, 0.78, 0.79, 0.8 , 0.81, 0.82, 0.83, 0.84, 0.85, 0.86, 0.87, 0.88, 0.89, 0.9 , 0.91, 0.92, 0.93, 0.94, 0.95,
        0.96, 0.97, 0.98, 0.99, 1.  ]),
 <a list of 100 Patch objects>)

Regressor

plt.hist(valid_rfregress_mse_predictions, bins=100,range=[0, 1])
(array([21304.,  2296.,  1756.,  1427.,  1255.,  1147.,   984.,   918.,   748.,   744.,   646.,   594.,   565.,   473.,   456.,   434.,   441.,   379.,   363.,   349.,   366.,   324.,   312.,   329.,
          283.,   327.,   286.,   274.,   240.,   237.,   242.,   210.,   276.,   239.,   519.,     0.,   228.,   223.,   228.,   179.,   387.,     0.,   210.,   214.,   191.,   167.,   314.,     0.,
          159.,   186.,   139.,   148.,   131.,   127.,   122.,   136.,   195.,     0.,   103.,    94.,    93.,    75.,    76.,    73.,    64.,    59.,    61.,    57.,    87.,    47.,     0.,    45.,
           44.,    32.,    33.,    32.,    31.,    35.,    32.,    36.,    33.,    45.,    26.,     0.,    22.,    25.,    26.,    34.,    31.,    28.,    39.,    44.,    42.,   117.,    83.,     0.,
           91.,   147.,   232.,  4449.]),
 array([0.  , 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1 , 0.11, 0.12, 0.13, 0.14, 0.15, 0.16, 0.17, 0.18, 0.19, 0.2 , 0.21, 0.22, 0.23, 0.24, 0.25, 0.26, 0.27, 0.28, 0.29, 0.3 , 0.31,
        0.32, 0.33, 0.34, 0.35, 0.36, 0.37, 0.38, 0.39, 0.4 , 0.41, 0.42, 0.43, 0.44, 0.45, 0.46, 0.47, 0.48, 0.49, 0.5 , 0.51, 0.52, 0.53, 0.54, 0.55, 0.56, 0.57, 0.58, 0.59, 0.6 , 0.61, 0.62, 0.63,
        0.64, 0.65, 0.66, 0.67, 0.68, 0.69, 0.7 , 0.71, 0.72, 0.73, 0.74, 0.75, 0.76, 0.77, 0.78, 0.79, 0.8 , 0.81, 0.82, 0.83, 0.84, 0.85, 0.86, 0.87, 0.88, 0.89, 0.9 , 0.91, 0.92, 0.93, 0.94, 0.95,
        0.96, 0.97, 0.98, 0.99, 1.  ]),
 <a list of 100 Patch objects>)

Test Set Predictions

Data Prep

cont,cat = cont_cat_split(test, max_card=1000) #Specify Continuous & Categorical Columns in the DataSe
procs = [Categorify,FillMissing]
to.test = TabularPandas(test,procs,cat,cont,splits=None)
to.test.show()
Field6 Field7 Field10 Field12 CoverageField1A CoverageField1B CoverageField2A CoverageField2B CoverageField3A CoverageField3B CoverageField4A CoverageField4B CoverageField5A CoverageField5B CoverageField6A CoverageField6B CoverageField8 CoverageField9 CoverageField11A CoverageField11B SalesField1A SalesField1B SalesField2A SalesField2B SalesField3 SalesField4 SalesField5 SalesField6 SalesField7 SalesField9 SalesField10 SalesField11 SalesField12 SalesField13 SalesField14 SalesField15 PersonalField1 PersonalField2 PersonalField4A PersonalField4B PersonalField5 PersonalField6 PersonalField7 PersonalField8 PersonalField9 PersonalField10A PersonalField10B PersonalField11 PersonalField12 PersonalField13 PersonalField14 PersonalField15 PersonalField16 PersonalField17 PersonalField18 PersonalField19 PersonalField22 PersonalField23 PersonalField24 PersonalField25 PersonalField26 PersonalField27 PersonalField28 PersonalField29 PersonalField30 PersonalField31 PersonalField32 PersonalField33 PersonalField34 PersonalField35 PersonalField36 PersonalField37 PersonalField38 PersonalField39 PersonalField40 PersonalField41 PersonalField42 PersonalField43 PersonalField44 PersonalField45 PersonalField46 PersonalField47 PersonalField48 PersonalField49 PersonalField50 PersonalField51 PersonalField52 PersonalField53 PersonalField54 PersonalField55 PersonalField56 PersonalField57 PersonalField58 PersonalField59 PersonalField60 PersonalField61 PersonalField62 PersonalField63 PersonalField64 PersonalField65 PersonalField66 PersonalField67 PersonalField68 PersonalField69 PersonalField70 PersonalField71 PersonalField72 PersonalField73 PersonalField74 PersonalField75 PersonalField76 PersonalField77 PersonalField78 PersonalField79 PersonalField80 PersonalField81 PersonalField82 PersonalField83 PropertyField1A PropertyField1B PropertyField2A PropertyField2B PropertyField3 PropertyField4 PropertyField5 PropertyField6 PropertyField7 PropertyField8 PropertyField9 PropertyField10 PropertyField11A PropertyField11B PropertyField12 PropertyField13 PropertyField14 PropertyField15 PropertyField16A PropertyField16B PropertyField17 PropertyField18 PropertyField19 PropertyField20 PropertyField21A PropertyField21B PropertyField22 PropertyField23 PropertyField24A PropertyField24B PropertyField26A PropertyField26B PropertyField27 PropertyField28 PropertyField30 PropertyField31 PropertyField32 PropertyField33 PropertyField34 PropertyField35 PropertyField36 PropertyField37 PropertyField38 PropertyField39A PropertyField39B GeographicField1A GeographicField1B GeographicField2A GeographicField2B GeographicField3A GeographicField3B GeographicField4A GeographicField4B GeographicField5A GeographicField5B GeographicField6A GeographicField6B GeographicField7A GeographicField7B GeographicField8A GeographicField8B GeographicField9A GeographicField9B GeographicField10A GeographicField10B GeographicField11A GeographicField11B GeographicField12A GeographicField12B GeographicField13A GeographicField13B GeographicField14A GeographicField14B GeographicField15A GeographicField15B GeographicField16A GeographicField16B GeographicField17A GeographicField17B GeographicField18A GeographicField18B GeographicField19A GeographicField19B GeographicField20A GeographicField20B GeographicField21A GeographicField21B GeographicField22A GeographicField22B GeographicField23A GeographicField23B GeographicField24A GeographicField24B GeographicField25A GeographicField25B GeographicField26A GeographicField26B GeographicField27A GeographicField27B GeographicField28A GeographicField28B GeographicField29A GeographicField29B GeographicField30A GeographicField30B GeographicField31A GeographicField31B GeographicField32A GeographicField32B GeographicField33A GeographicField33B GeographicField34A GeographicField34B GeographicField35A GeographicField35B GeographicField36A GeographicField36B GeographicField37A GeographicField37B GeographicField38A GeographicField38B GeographicField39A GeographicField39B GeographicField40A GeographicField40B GeographicField41A GeographicField41B GeographicField42A GeographicField42B GeographicField43A GeographicField43B GeographicField44A GeographicField44B GeographicField45A GeographicField45B GeographicField46A GeographicField46B GeographicField47A GeographicField47B GeographicField48A GeographicField48B GeographicField49A GeographicField49B GeographicField50A GeographicField50B GeographicField51A GeographicField51B GeographicField52A GeographicField52B GeographicField53A GeographicField53B GeographicField54A GeographicField54B GeographicField55A GeographicField55B GeographicField56A GeographicField56B GeographicField57A GeographicField57B GeographicField58A GeographicField58B GeographicField59A GeographicField59B GeographicField60A GeographicField60B GeographicField61A GeographicField61B GeographicField62A GeographicField62B GeographicField63 GeographicField64 Original_Quote_Year Original_Quote_Month Original_Quote_Week Original_Quote_Day Original_Quote_Dayofweek Original_Quote_Dayofyear Original_Quote_Is_month_end Original_Quote_Is_month_start Original_Quote_Is_quarter_end Original_Quote_Is_quarter_start Original_Quote_Is_year_end Original_Quote_Is_year_start PersonalField84_na PropertyField29_na QuoteNumber Field8 Field9 Field11 SalesField8 PersonalField84 PropertyField25 PropertyField29 Original_Quote_Elapsed
0 E 16 1,487 N 4 4 4 4 3 3 3 4 13 22 13 23 Y K 13 22 6 16 9 21 0 5 5 11 P 0 0 0 0 0 0 0 1 1 11 14 6 1 N 1 3 -1 -1 0 5 2 5 14 YH XR XQ ZQ 1 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 18 23 -1 4 N N Y 0 D 1 0 1 -1 24 1 2 A 1 3 5 0 2 0 0 4 4 2 1 6 8 7 8 19 B N N Y G N 2 N Y N 18 21 25 25 9 6 1 1 24 25 -1 13 9 18 12 18 10 18 16 19 -1 25 9 18 13 18 9 18 -1 19 10 13 8 17 2 10 -1 20 19 18 4 9 -1 25 -1 24 -1 15 16 15 17 17 10 13 21 20 25 25 16 22 25 25 8 11 10 17 11 17 24 25 15 23 25 25 -1 -1 21 24 18 24 11 18 18 16 20 24 11 17 22 23 9 12 25 25 6 9 4 2 16 12 20 20 2 2 2 1 1 1 10 7 25 25 -1 19 19 22 12 15 1 1 -1 1 -1 20 -1 25 Y IL 2014 8 33 12 1 224 False False False False False False True True 3 0.9364 0.0006 1.3045 67052 2.0 1.5 0.0 1.407802e+09
1 F 11 564 N 8 14 8 14 7 12 8 13 13 22 13 23 T E 4 5 3 6 3 6 1 5 5 4 R 1 0 0 0 0 0 0 1 1 15 20 6 1 N 1 2 9 18 0 1 2 2 4 YF XS YP XC 1 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 22 24 -1 10 N N Y 0 J 1 0 1 -1 21 1 1 A 4 7 15 1 1 1 0 8 14 2 1 5 6 10 15 10 B N O Y H N 1 N N N 5 3 17 24 17 17 15 17 10 9 -1 20 4 13 9 13 5 13 5 12 -1 25 4 15 8 13 5 14 -1 13 9 11 5 12 2 16 -1 22 23 24 21 22 -1 22 -1 22 -1 14 10 4 10 5 5 5 13 6 9 14 8 10 12 22 7 10 9 15 11 18 14 18 5 7 19 24 6 14 16 23 6 12 11 19 25 25 15 20 12 20 23 24 12 21 23 25 7 11 16 14 13 6 17 15 7 5 7 5 13 7 14 14 7 14 -1 4 1 1 5 3 10 10 -1 5 -1 5 -1 21 N NJ 2013 9 36 7 5 250 False False False False False False True True 5 0.9919 0.0038 1.1886 27288 2.0 1.0 0.0 1.378512e+09
2 F 15 564 N 11 18 11 18 10 16 10 18 13 22 13 23 T E 3 3 5 14 3 9 1 5 5 23 V 0 1 2 2 0 0 0 1 1 12 15 6 1 N 1 2 22 24 0 1 2 4 10 XE ZH YK ZN 1 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 6 9 -1 23 N N Y 0 R 1 0 1 -1 24 1 2 C 1 8 17 0 2 1 0 11 18 2 1 12 18 20 24 10 B N M Y G Y 1 N N N 11 12 3 6 16 16 15 18 14 16 -1 16 4 10 10 16 5 11 6 13 -1 25 4 10 10 16 5 10 -1 11 11 18 5 13 2 13 -1 15 21 20 19 21 -1 16 -1 18 -1 16 15 11 11 7 6 7 13 7 7 11 15 21 5 5 3 2 14 21 8 12 12 15 14 22 7 7 5 11 9 14 13 22 8 13 22 19 11 14 11 16 16 18 9 10 14 16 6 8 20 19 17 14 16 13 20 22 20 22 20 18 10 7 4 7 -1 11 13 12 18 22 10 11 -1 20 -1 22 -1 11 N NJ 2013 3 13 29 4 88 False False False False False False True True 7 0.8945 0.0038 1.0670 65264 2.0 1.0 0.0 1.364515e+09
3 K 21 1,113 Y 14 22 15 22 13 20 22 25 13 22 13 23 Y F 5 9 9 20 5 16 1 5 5 11 R 1 1 1 1 0 0 0 1 1 5 6 6 1 N 1 2 7 15 0 1 2 3 5 XR YY XT XC 1 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 5 8 -1 6 N N Y 0 Q 1 0 1 -1 21 4 2 C 4 5 10 1 2 1 0 14 22 2 1 13 18 8 11 4 B N M N G N 0 N N N 7 5 9 18 11 9 13 15 22 24 -1 14 24 25 23 23 24 25 25 25 -1 25 24 24 23 23 24 24 25 25 23 22 24 25 2 7 -1 13 11 14 6 16 -1 15 -1 15 -1 22 11 5 10 4 5 4 13 5 7 10 3 2 5 5 9 14 15 22 7 10 9 8 13 22 6 5 3 7 8 11 11 20 17 23 14 12 15 20 10 14 11 11 9 10 11 13 15 21 14 12 17 13 10 6 20 22 20 22 19 16 12 11 4 6 -1 13 10 8 5 3 8 8 -1 13 -1 8 -1 21 N TX 2015 3 12 21 5 80 False False False False False False True True 9 0.8870 0.0004 1.2665 32725 2.0 2.0 0.0 1.426896e+09
4 B 25 935 N 4 5 4 5 4 4 4 5 13 22 13 23 Y D 12 21 1 1 3 6 0 5 5 11 T 0 1 1 1 0 0 0 1 1 1 1 7 0 N 1 1 4 2 0 4 2 2 24 ZA ZE XR XD 1 0 0 0 0 1 2 2 0 0 1 1 1 0 0 0 0 0 0 0 0 1 0 0 1 1 2 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 1 1 2 0 0 0 0 1 2 2 -1 8 N N Y 0 R 0 0 1 -1 21 2 1 C 1 5 11 0 2 0 0 4 5 2 1 4 4 9 12 10 B N O Y H N 1 Y N N 10 11 2 4 12 11 1 1 11 9 -1 13 2 8 4 6 2 6 2 9 -1 25 2 3 3 5 2 5 -1 7 6 6 2 6 15 23 -1 17 4 8 6 17 -1 13 -1 15 -1 8 17 15 15 13 15 18 14 7 7 10 12 19 8 14 7 9 2 1 15 21 12 15 16 23 9 13 16 24 6 8 4 6 2 5 11 7 8 9 6 5 9 8 25 25 9 3 9 18 7 4 16 12 13 9 8 6 8 6 11 5 19 21 13 21 -1 23 11 8 5 3 7 7 -1 3 -1 22 -1 21 N CA 2014 12 50 10 2 344 False False False False False False False True 10 0.9153 0.0007 1.0200 56025 2.0 1.0 0.0 1.418170e+09
5 B 24 965 N 11 18 11 18 10 16 10 18 13 22 13 23 Y D 6 12 6 17 4 14 0 5 5 24 V 0 0 0 0 0 0 0 1 1 15 19 7 0 N 1 2 5 7 0 1 2 3 24 ZA ZE XR XD 1 0 0 0 0 1 2 2 0 1 1 1 1 0 0 0 0 0 0 0 0 1 0 1 1 1 2 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 1 1 1 2 0 0 0 0 1 15 20 -1 20 N N Y 0 O 0 0 1 -1 21 2 2 C 1 10 19 1 4 0 0 11 18 2 1 13 19 21 24 10 B N O N H Y 2 N N N 12 14 13 23 14 13 6 6 14 15 -1 13 2 3 4 5 2 5 2 2 -1 25 2 6 3 6 2 6 -1 7 5 5 2 5 15 23 -1 24 2 2 2 4 -1 7 -1 15 -1 2 13 8 14 11 11 16 14 9 10 18 9 12 8 14 18 24 12 20 13 20 9 9 11 19 15 23 7 16 12 20 5 8 3 6 2 3 6 5 8 7 14 14 2 4 10 10 21 23 12 10 17 13 15 12 6 4 6 4 17 12 14 14 11 20 -1 21 10 6 10 12 9 9 -1 2 -1 20 -1 8 N CA 2013 7 29 19 4 200 False False False False False False False True 11 0.9403 0.0006 1.0200 42868 2.0 1.0 0.0 1.374192e+09
6 E 23 1,487 N 5 6 5 6 4 5 4 5 13 22 13 23 Y K 11 21 3 8 5 15 1 5 5 11 K 0 0 0 0 0 0 0 1 1 18 22 7 0 N 1 2 15 21 0 1 2 2 24 ZA ZE XR XD 1 0 0 0 0 1 2 2 0 0 1 1 1 0 0 0 0 0 0 0 0 1 0 0 1 1 2 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 1 1 2 0 0 0 0 1 4 6 -1 8 Y N Y 0 N 0 0 1 -1 21 1 2 C 2 4 7 1 1 0 0 5 6 2 1 4 4 9 12 10 B N O N E N 0 N N N 10 10 2 3 14 14 7 6 17 19 -1 13 13 21 15 19 13 20 23 25 -1 25 11 19 16 19 11 19 -1 23 14 19 15 22 2 18 -1 19 16 16 2 4 -1 24 -1 18 -1 19 2 1 2 1 2 1 2 1 5 7 4 3 6 7 8 12 7 11 6 8 11 12 8 14 6 6 9 20 9 14 2 2 23 25 17 15 20 24 11 17 8 3 6 6 13 15 4 2 21 21 20 20 15 12 23 24 23 24 20 18 16 16 2 3 -1 3 25 25 9 9 11 12 -1 7 -1 4 -1 8 N IL 2014 7 31 28 0 209 False False False False False False False False 15 0.9392 0.0006 1.3045 52431 2.0 1.0 0.0 1.406506e+09
7 B 25 935 N 6 9 6 9 5 7 5 8 13 22 13 23 Y D 10 19 2 5 3 9 1 5 5 11 P 0 0 0 0 0 0 0 1 1 7 10 6 1 N 1 2 21 23 0 1 2 3 24 ZA ZE XR XD 1 0 0 0 0 1 2 2 0 0 1 1 1 0 0 0 0 0 0 0 0 1 0 0 1 1 2 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 1 1 2 0 0 0 0 1 15 19 -1 20 N N Y 0 O 0 0 1 -1 21 4 2 C 4 7 15 1 2 0 0 6 9 2 1 5 6 11 15 10 B N O Y H Y 2 N N N 9 10 6 13 17 18 10 11 10 8 -1 13 2 7 2 2 2 2 2 4 -1 25 2 7 1 1 2 2 -1 7 3 3 2 3 5 20 -1 18 3 6 4 11 -1 11 -1 15 -1 5 23 24 15 13 14 17 15 9 9 14 3 2 9 16 3 2 15 22 14 21 8 7 9 16 14 22 15 23 15 22 6 12 1 1 6 6 12 15 2 2 14 14 2 2 9 3 1 1 18 18 20 19 14 10 23 24 22 24 15 10 22 24 7 14 -1 16 15 16 2 1 15 19 -1 17 -1 15 -1 8 N CA 2015 1 4 20 1 20 False False False False False False False True 16 0.9153 0.0007 1.0200 2572 2.0 1.0 0.0 1.421712e+09
8 J 23 1,113 N 12 20 12 20 11 18 12 19 13 22 13 23 T F 2 3 15 24 10 22 0 5 5 11 T 0 1 1 1 1 1 1 0 0 -1 -1 7 0 N 1 2 24 25 0 1 2 5 24 ZA ZE XR XD 1 0 0 0 0 1 2 2 0 0 1 1 1 0 0 0 0 0 0 0 0 1 0 0 1 1 2 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 1 1 2 0 0 0 0 1 10 13 -1 24 Y Y Y 0 D 0 0 1 -1 21 4 2 A 5 4 8 1 3 1 0 12 20 2 1 21 24 8 10 4 B N K Y E N 2 N N N 18 22 4 9 11 9 12 13 14 14 -1 23 13 21 22 22 14 21 20 23 -1 25 14 22 21 22 15 22 -1 16 23 23 9 18 2 4 -1 13 9 11 1 1 -1 8 -1 15 -1 24 19 20 13 10 7 10 16 14 9 15 9 13 11 20 14 22 15 22 9 14 12 14 11 19 14 21 4 10 9 15 11 21 8 13 13 9 11 13 25 25 8 5 12 21 11 13 8 15 11 8 17 12 19 19 13 12 13 12 17 12 8 5 6 11 -1 15 12 10 4 2 7 7 -1 12 -1 5 -1 8 N TX 2014 5 22 28 2 148 False False False False False False False False 17 0.8928 0.0004 1.2665 10065 2.0 3.0 0.0 1.401235e+09
9 J 23 1,165 N 3 3 3 3 3 3 3 3 13 22 13 23 Y F 14 23 9 21 15 24 1 3 4 1 Q 0 0 0 0 0 0 0 1 1 6 7 6 1 N 1 3 -1 -1 0 5 2 1 20 XD XS YP XC 1 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 16 20 -1 4 N N Y 0 O 1 0 1 -1 21 4 2 C 4 2 3 0 1 0 0 3 3 2 1 5 6 10 14 10 B N O Y H N 2 N Y N 17 21 10 20 10 7 1 1 16 18 -1 24 13 22 22 22 14 22 23 24 -1 25 14 21 21 22 15 21 -1 18 23 22 10 20 2 5 -1 14 9 11 3 8 -1 7 -1 15 -1 22 19 20 13 10 7 10 16 14 12 20 8 11 10 19 7 10 4 4 5 5 10 12 4 5 25 25 25 25 9 14 4 5 11 17 13 9 17 22 19 23 10 9 11 18 12 14 8 16 2 1 22 22 23 24 1 1 1 1 14 8 2 1 9 17 -1 13 15 17 19 23 7 6 -1 1 -1 8 -1 13 N TX 2013 7 28 11 3 192 False False False False False False True False 21 0.9691 0.0004 1.2665 8321 2.0 1.0 0.0 1.373501e+09
test_xs = to.test.xs

Predictions

test_classifier_gini_predictions = rfclass_gini.predict_proba(test_xs)
#Probability of Insurance Policy Purchased - Regressor 
test_rfregress_mse_predictions  = rfregress_mse.predict(test_xs)
classifier_submission = pd.DataFrame(zip(test.QuoteNumber,test_classifier_gini_predictions[:,1]), columns = ['QuoteNumber','QuoteConversion_Flag'])
classifier_submission.to_csv(path/'classifier_submission.csv',index=False)

regressor_submission = pd.DataFrame(zip(test.QuoteNumber,test_rfregress_mse_predictions), columns = ['QuoteNumber','QuoteConversion_Flag'])
regressor_submission.to_csv(path/'regressor_submission.csv',index=False)