This notebook compares 2 basic Random Forest models, each with 2 different criterion to split decision trees. Distributions of predicitons are plotted at the end. Rather than provide any concrete conclusions, I leave this here as food for thought.

Setup Working Environment

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive

Path.cwd()

Path('/content')

os.chdir('/content/gdrive/MyDrive/Colab Notebooks/GroupProject')
path = Path.cwd()
path

Path('/content/gdrive/MyDrive/Colab Notebooks/GroupProject')

We should now have all the functions we require to run the notebook. The next thing we need is our dataset...

Kaggle Dataset

I have previosuly downloadeded the dataset from kaggle, and extracted the files. My teammate Nissan has already described how to download Kaggle data. So lets check the files listed in the directory:

path.ls()

(#1) [Path('/content/gdrive/MyDrive/Colab Notebooks/GroupProject/_data')]

train = pd.read_csv(path/'_data/train.csv',low_memory=False)
test = pd.read_csv(path/'_data/test.csv',low_memory=False)

Minimal Data Exploration

Only presenting basic steps here. My teammates have provided other notebooks on more detailed EDA.

train.shape,train.columns

((260753, 299),
 Index(['QuoteNumber', 'Original_Quote_Date', 'QuoteConversion_Flag', 'Field6',
        'Field7', 'Field8', 'Field9', 'Field10', 'Field11', 'Field12',
        ...
        'GeographicField59A', 'GeographicField59B', 'GeographicField60A',
        'GeographicField60B', 'GeographicField61A', 'GeographicField61B',
        'GeographicField62A', 'GeographicField62B', 'GeographicField63',
        'GeographicField64'],
       dtype='object', length=299))

train.shape tells us that there are 260,753 rows, and 299 columns of data in the training set. To look at the column names specifically we can use train.columns, which also confirms there are 299 columns.

test.shape,test.columns

((173836, 298),
 Index(['QuoteNumber', 'Original_Quote_Date', 'Field6', 'Field7', 'Field8',
        'Field9', 'Field10', 'Field11', 'Field12', 'CoverageField1A',
        ...
        'GeographicField59A', 'GeographicField59B', 'GeographicField60A',
        'GeographicField60B', 'GeographicField61A', 'GeographicField61B',
        'GeographicField62A', 'GeographicField62B', 'GeographicField63',
        'GeographicField64'],
       dtype='object', length=298))

test.shape tells us that there are fewer records in the test set compared to the training set (260,753 vs. 173,753), and one less column (299 vs 298). The missing column is our dependent variable, "QuoteConversion_Flag".

The output from .columns also shows that we have a variable containing dates in the 2nd column - "Original_Quote_Date". Numerical coding of the date, although easy to read, hides potentially valuable information about dates. Such as: which day of the week? Was it a holiday? Is this date closer to the start or end of the calendar/financial year? This information could have an influence on what we are trying to predict. Thankfully, fast.ai has provided a useful function that takes the numerical date format and generates extra columns that hold this sort of information. Lets give it a go and have a look at the output.

train = add_datepart(train, 'Original_Quote_Date')
train.columns

Index(['QuoteNumber', 'QuoteConversion_Flag', 'Field6', 'Field7', 'Field8',
       'Field9', 'Field10', 'Field11', 'Field12', 'CoverageField1A',
       ...
       'Original_Quote_Day', 'Original_Quote_Dayofweek',
       'Original_Quote_Dayofyear', 'Original_Quote_Is_month_end',
       'Original_Quote_Is_month_start', 'Original_Quote_Is_quarter_end',
       'Original_Quote_Is_quarter_start', 'Original_Quote_Is_year_end',
       'Original_Quote_Is_year_start', 'Original_Quote_Elapsed'],
      dtype='object', length=311)

test = add_datepart(test, 'Original_Quote_Date') #Run once
test.columns

Index(['QuoteNumber', 'Field6', 'Field7', 'Field8', 'Field9', 'Field10',
       'Field11', 'Field12', 'CoverageField1A', 'CoverageField1B',
       ...
       'Original_Quote_Day', 'Original_Quote_Dayofweek',
       'Original_Quote_Dayofyear', 'Original_Quote_Is_month_end',
       'Original_Quote_Is_month_start', 'Original_Quote_Is_quarter_end',
       'Original_Quote_Is_quarter_start', 'Original_Quote_Is_year_end',
       'Original_Quote_Is_year_start', 'Original_Quote_Elapsed'],
      dtype='object', length=310)

By calling .columns once again, we can see we now have 311 (training) and 310 (test) columns with info about the day of week, day of year, holiday, etc.

train['QuoteConversion_Flag'].unique()
train['QuoteConversion_Flag'].describe()
dep_var = 'QuoteConversion_Flag'

Sample Training and Validation Sets from Train DF

Before we run our models, we need to split our trainging dataframe into a training set and a validation set. The validation set will not be passed to the model for training, and will provide us metrics for how well our model generalises to 'unseen' data. For a first attempt, lets randomly assign 80% of our records to the training set and 20% of our records to the validation set.

dep_var = 'QuoteConversion_Flag'
cont,cat = cont_cat_split(train, 1, dep_var=dep_var) #Specify Continuous & Categorical Columns in the DataSe
#cont,cat,dep_var
procs = [Categorify,FillMissing]

random.seed(42)
splits = RandomSplitter(valid_pct=0.2)(range_of(train))

to = TabularPandas(train,procs,cat,cont,y_names=dep_var,splits=splits)

len(to.train),len(to.valid)

(208603, 52150)

xs,y = to.train.xs,to.train.y
valid_xs,valid_y = to.valid.xs,to.valid.y

Random Forest Classifier

Looking at the dependent variable - "QuoteConversion_Flag", we can seen that this is a binary outcome, i.e 0 or 1. The obvious choice when using Decision Trees with binary data is a Random Forest Classifier. In the classifier, each tree votes and the most popular class is chosen as the final result.

Split Trees by Gini Impurity

rfclass_gini = RandomForestClassifier(n_jobs=-1, random_state=42, criterion = "gini",oob_score=False).fit(xs, y)

train_classifier_gini_predictions = rfclass_gini.predict(xs)
valid_classifier_gini_predictions = rfclass_gini.predict(valid_xs)

The output from the Classifier are binary predictions. See below:

plt.hist(valid_classifier_gini_predictions, bins=10)

(array([44760.,     0.,     0.,     0.,     0.,     0.,     0.,     0.,     0.,  7390.]),
 array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ]),
 <a list of 10 Patch objects>)

To get the probabilities of each outcome; whether a customer purchases insurance (1) or not (0), we need to call a different function

rfclass_gini.predict_proba(valid_xs)
plt.hist(rfclass_gini.predict_proba(valid_xs), bins=10)

(array([[  974.,  1627.,  1470.,  1362.,  1957.,  2984.,  3141.,  3236.,  5677., 29722.],
        [28868.,  6131.,  3959.,  2818.,  2766.,  2175.,  1489.,  1214.,  1596.,  1134.]]),
 array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. ]),
 <a list of 2 Lists of Patches objects>)

We can see the output is a probability for each class. So in this case we have 2 probabilty estimates for each record. However, the Kaggle competition compares entries on the area under the curve (AUC). Therefore, lets see how the classifier performs for this metric.

train_classifier_auc = roc_auc_score(y,train_classifier_gini_predictions)
valid_classifier_auc = roc_auc_score(valid_y,valid_classifier_gini_predictions)
train_classifier_auc,valid_classifier_auc

(1.0, 0.8121955713748932)

For a first model, without any feature selection and data cleaning an ROC AUC of 0.81 in the validation isnt bad. A model predicting outcomes randomly would produce an ROC AUC of 0.5. The code below shoes this:

random_probs = [0 for _ in range(len(valid_y))]
random_lr_auc = roc_auc_score(valid_y, random_probs)
random_lr_auc

0.5

We can see that the random forest classfier is doing a lot better than a model making random predcitions, so that is encouraging.

Another way of visualising the results from classification tasks is a confusion matrix. This plots the counts of the predicted vs true classes. This Classifier seems to do a relatively good job at predicting records that relate to customers who did not purchase home insurance (True = 0), but struggled a lot more at predicting records that relate to customers that did go onto purchase home insurance (True = 1). Another observation is that is a skew in the number of records towaards those of customers that did NOT go onto purchase home.

plot_confusion_matrix(rfclass_gini,valid_xs, valid_y,values_format='d')
plt.show()

Split Trees by Entropy

The Random Forest Classifier has a second option for the criteria used to split the decision trees. Lets try the same process with the 'entropy' criterion and compare the results.

rfclass_entropy = RandomForestClassifier(n_jobs=-1, random_state=42, criterion = "entropy",oob_score=False).fit(xs, y)
train_classifier_entropy_predictions = rfclass_entropy.predict(xs)
valid_classifier_entropy_predictions = rfclass_entropy.predict(valid_xs)

roc_auc_score(y,train_classifier_entropy_predictions),roc_auc_score(valid_y,valid_classifier_entropy_predictions)

(0.9999743412105817, 0.8091649897936626)

plot_confusion_matrix(rfclass_entropy,valid_xs, valid_y,values_format='d')
plt.show()

Comparing the confusion matrices gives us an indication of how the different criterion influenced the predictions from the classifiers.

The 'entropy' criterion looks as though it did very slightly better at classfying records in which customers did not purchase home insurance (True Label = 0_, and slightly worse than the 'gini impurity' for records in which customers did go on to purchase home insurance (True Label = 1).

Given the task is to predict the probabilty that a customer went onto purchase home insurance, using the 'gini impurity' criterion seems more appropriate. Although given such as small difference between criterion, the decision between these two criterion is lilkey to have very little influence on the results of the model overall. Therefore, time investigating the effect of removing redundant or highly correlated variablesmay be more fruitful.

Create Random Forest Regressor

Although I said a Random Forest Classifier is the most obvious choice for use on binary data, we could try something else. The Kaggle competition task is to predict the probability that a customer purchased home insurance. Therefore, why not use a method that directly outputs probabilities.

In short:

A classifier will take the most common prediction from all of the decision trees. So if we run 100 trees, and 51 predict the customer purchased home insurance (1), and 49 predicted that the customer did NOT purchase home insurance (0), the overall prediction would be that the customer purchased home insurance (1). Since, we just want a probability, not a classification for the task, why force the model to make a classification. This could hae a particularly big imapct in the intermediate cases, as above.

Why not try a regressor? a Random Forest Regressor will take the average of the predictions from all of the decision trees. Therefore, in the example above, the overall prediction would be 0.51. This prediciton could be passed directly to the Kaggle to compute the ROC AUC score. Using this more conservative approach could provide improvements for these intermediate cases. So lets try it!

rfregress_mse = RandomForestRegressor(n_jobs=-1,random_state=42,criterion="mse",oob_score=False).fit(xs, y)
train_rfregress_mse_predictions = rfregress_mse.predict(xs)
valid_rfregress_mse_predictions = rfregress_mse.predict(valid_xs)

train_lr_auc = roc_auc_score(y, train_rfregress_mse_predictions)
valid_lr_auc = roc_auc_score(valid_y, valid_rfregress_mse_predictions)
train_lr_auc,valid_lr_auc

(1.0, 0.9587383322397882)

With the Regressor, using defult parameters and splitting trees based on the Mean Squared Error (mse), we get over a 10% increase in the ROC AUC for the validation set! (Caveat, This difference may be less after more detailed EDA and QC)

Lets re-check the Classifier ROC AUC to be sure:

train_classifier_auc,valid_classifier_auc

(1.0, 0.8121955713748932)

I will leave investigating the reasons for the differences for another notebook, this could involve writing our own classifiers and regressors so me can compare with the same or different loss functions. But to show the differences in the outputs of the two models, histograms of the predictions from the Random Forest Classifier & and the Random Forest Regressor are shown below:

Classifier

plt.hist(rfclass_gini.predict_proba(valid_xs)[:,1], bins=100,range=[0, 1])

(array([12171.,  4645.,  2879.,  2036.,  1653.,  1305.,  1154.,  1094.,  1000.,   931.,   854.,   756.,   733.,   667.,   597.,   587.,   532.,   452.,   496.,   457.,   400.,   399.,   360.,   398.,
          371.,   352.,   323.,   347.,   340.,   346.,   323.,   326.,   319.,   317.,   612.,     0.,   302.,   311.,   317.,   314.,   634.,     0.,   282.,   271.,   263.,   276.,   532.,     0.,
          257.,   251.,   218.,   227.,   229.,   212.,   203.,   181.,   396.,     0.,   187.,   175.,   147.,   184.,   159.,   147.,   159.,   167.,   129.,   131.,   286.,   127.,     0.,   136.,
          140.,   125.,   115.,   128.,   128.,   130.,   154.,   158.,   129.,   325.,   157.,     0.,   147.,   175.,   168.,   158.,   156.,   181.,   160.,   178.,   149.,   302.,   106.,     0.,
           93.,    73.,    35.,    38.]),
 array([0.  , 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1 , 0.11, 0.12, 0.13, 0.14, 0.15, 0.16, 0.17, 0.18, 0.19, 0.2 , 0.21, 0.22, 0.23, 0.24, 0.25, 0.26, 0.27, 0.28, 0.29, 0.3 , 0.31,
        0.32, 0.33, 0.34, 0.35, 0.36, 0.37, 0.38, 0.39, 0.4 , 0.41, 0.42, 0.43, 0.44, 0.45, 0.46, 0.47, 0.48, 0.49, 0.5 , 0.51, 0.52, 0.53, 0.54, 0.55, 0.56, 0.57, 0.58, 0.59, 0.6 , 0.61, 0.62, 0.63,
        0.64, 0.65, 0.66, 0.67, 0.68, 0.69, 0.7 , 0.71, 0.72, 0.73, 0.74, 0.75, 0.76, 0.77, 0.78, 0.79, 0.8 , 0.81, 0.82, 0.83, 0.84, 0.85, 0.86, 0.87, 0.88, 0.89, 0.9 , 0.91, 0.92, 0.93, 0.94, 0.95,
        0.96, 0.97, 0.98, 0.99, 1.  ]),
 <a list of 100 Patch objects>)

Regressor

plt.hist(valid_rfregress_mse_predictions, bins=100,range=[0, 1])

(array([21304.,  2296.,  1756.,  1427.,  1255.,  1147.,   984.,   918.,   748.,   744.,   646.,   594.,   565.,   473.,   456.,   434.,   441.,   379.,   363.,   349.,   366.,   324.,   312.,   329.,
          283.,   327.,   286.,   274.,   240.,   237.,   242.,   210.,   276.,   239.,   519.,     0.,   228.,   223.,   228.,   179.,   387.,     0.,   210.,   214.,   191.,   167.,   314.,     0.,
          159.,   186.,   139.,   148.,   131.,   127.,   122.,   136.,   195.,     0.,   103.,    94.,    93.,    75.,    76.,    73.,    64.,    59.,    61.,    57.,    87.,    47.,     0.,    45.,
           44.,    32.,    33.,    32.,    31.,    35.,    32.,    36.,    33.,    45.,    26.,     0.,    22.,    25.,    26.,    34.,    31.,    28.,    39.,    44.,    42.,   117.,    83.,     0.,
           91.,   147.,   232.,  4449.]),
 array([0.  , 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1 , 0.11, 0.12, 0.13, 0.14, 0.15, 0.16, 0.17, 0.18, 0.19, 0.2 , 0.21, 0.22, 0.23, 0.24, 0.25, 0.26, 0.27, 0.28, 0.29, 0.3 , 0.31,
        0.32, 0.33, 0.34, 0.35, 0.36, 0.37, 0.38, 0.39, 0.4 , 0.41, 0.42, 0.43, 0.44, 0.45, 0.46, 0.47, 0.48, 0.49, 0.5 , 0.51, 0.52, 0.53, 0.54, 0.55, 0.56, 0.57, 0.58, 0.59, 0.6 , 0.61, 0.62, 0.63,
        0.64, 0.65, 0.66, 0.67, 0.68, 0.69, 0.7 , 0.71, 0.72, 0.73, 0.74, 0.75, 0.76, 0.77, 0.78, 0.79, 0.8 , 0.81, 0.82, 0.83, 0.84, 0.85, 0.86, 0.87, 0.88, 0.89, 0.9 , 0.91, 0.92, 0.93, 0.94, 0.95,
        0.96, 0.97, 0.98, 0.99, 1.  ]),
 <a list of 100 Patch objects>)

Test Set Predictions

Data Prep

cont,cat = cont_cat_split(test, max_card=1000) #Specify Continuous & Categorical Columns in the DataSe
procs = [Categorify,FillMissing]

to.test = TabularPandas(test,procs,cat,cont,splits=None)
to.test.show()

test_xs = to.test.xs

Predictions

test_classifier_gini_predictions = rfclass_gini.predict_proba(test_xs)
#Probability of Insurance Policy Purchased - Regressor 
test_rfregress_mse_predictions  = rfregress_mse.predict(test_xs)

classifier_submission = pd.DataFrame(zip(test.QuoteNumber,test_classifier_gini_predictions[:,1]), columns = ['QuoteNumber','QuoteConversion_Flag'])
classifier_submission.to_csv(path/'classifier_submission.csv',index=False)

regressor_submission = pd.DataFrame(zip(test.QuoteNumber,test_rfregress_mse_predictions), columns = ['QuoteNumber','QuoteConversion_Flag'])
regressor_submission.to_csv(path/'regressor_submission.csv',index=False)

	Field6	Field7	Field10	Field12	CoverageField1A	CoverageField1B	CoverageField2A	CoverageField2B	CoverageField3A	CoverageField3B	CoverageField4A	CoverageField4B	CoverageField5A	CoverageField5B	CoverageField6A	CoverageField6B	CoverageField8	CoverageField9	CoverageField11A	CoverageField11B	SalesField1A	SalesField1B	SalesField2A	SalesField2B	SalesField3	SalesField4	SalesField5	SalesField6	SalesField7	SalesField9	SalesField10	SalesField11	SalesField12	SalesField13	SalesField14	SalesField15	PersonalField1	PersonalField2	PersonalField4A	PersonalField4B	PersonalField5	PersonalField6	PersonalField7	PersonalField8	PersonalField9	PersonalField10A	PersonalField10B	PersonalField12	PersonalField13	PersonalField14	PersonalField15	PersonalField16	PersonalField17	PersonalField18	PersonalField19	PersonalField22	PersonalField27	PersonalField28	PersonalField29	PersonalField31	PersonalField32	PersonalField33	PersonalField34	PersonalField43	PersonalField45	PersonalField46	PersonalField47	PersonalField48	PersonalField53	PersonalField58	PersonalField63	PersonalField68	PersonalField73	PersonalField75	PersonalField76	PersonalField77	PersonalField78	PersonalField83	PropertyField1A	PropertyField1B	PropertyField2A	PropertyField2B	PropertyField3	PropertyField4	PropertyField5	PropertyField7	PropertyField8	PropertyField10	PropertyField11A	PropertyField11B	PropertyField12	PropertyField13	PropertyField14	PropertyField15	PropertyField16A	PropertyField16B	PropertyField17	PropertyField18	PropertyField19	PropertyField21A	PropertyField21B	PropertyField22	PropertyField23	PropertyField24A	PropertyField24B	PropertyField26A	PropertyField26B	PropertyField27	PropertyField28	PropertyField30	PropertyField31	PropertyField32	PropertyField33	PropertyField34	PropertyField35	PropertyField36	PropertyField37	PropertyField38	PropertyField39A	PropertyField39B	GeographicField1A	GeographicField1B	GeographicField2A	GeographicField2B	GeographicField3A	GeographicField3B	GeographicField4A	GeographicField4B	GeographicField5A	GeographicField5B	GeographicField6A	GeographicField6B	GeographicField7A	GeographicField7B	GeographicField8A	GeographicField8B	GeographicField9A	GeographicField9B	GeographicField10A	GeographicField10B	GeographicField11A	GeographicField11B	GeographicField12A	GeographicField12B	GeographicField13A	GeographicField13B	GeographicField14A	GeographicField14B	GeographicField15A	GeographicField15B	GeographicField16A	GeographicField16B	GeographicField17A	GeographicField17B	GeographicField18A	GeographicField18B	GeographicField19A	GeographicField19B	GeographicField20A	GeographicField20B	GeographicField21A	GeographicField21B	GeographicField22A	GeographicField22B	GeographicField23A	GeographicField23B	GeographicField24A	GeographicField24B	GeographicField25A	GeographicField25B	GeographicField26A	GeographicField26B	GeographicField27A	GeographicField27B	GeographicField28A	GeographicField28B	GeographicField29A	GeographicField29B	GeographicField30A	GeographicField30B	GeographicField31A	GeographicField31B	GeographicField32A	GeographicField32B	GeographicField33A	GeographicField33B	GeographicField34A	GeographicField34B	GeographicField35A	GeographicField35B	GeographicField36A	GeographicField36B	GeographicField37A	GeographicField37B	GeographicField38A	GeographicField38B	GeographicField39A	GeographicField39B	GeographicField40A	GeographicField40B	GeographicField41A	GeographicField41B	GeographicField42A	GeographicField42B	GeographicField43A	GeographicField43B	GeographicField44A	GeographicField44B	GeographicField45A	GeographicField45B	GeographicField46A	GeographicField46B	GeographicField47A	GeographicField47B	GeographicField48A	GeographicField48B	GeographicField49A	GeographicField49B	GeographicField50A	GeographicField50B	GeographicField51A	GeographicField51B	GeographicField52A	GeographicField52B	GeographicField53A	GeographicField53B	GeographicField54A	GeographicField54B	GeographicField55A	GeographicField55B	GeographicField56A	GeographicField56B	GeographicField57A	GeographicField57B	GeographicField58A	GeographicField58B	GeographicField59A	GeographicField59B	GeographicField60A	GeographicField60B	GeographicField61A	GeographicField61B	GeographicField62A	GeographicField62B	GeographicField63	GeographicField64	Original_Quote_Year	Original_Quote_Month	Original_Quote_Week	Original_Quote_Day	Original_Quote_Dayofweek	Original_Quote_Dayofyear	Original_Quote_Is_month_end	Original_Quote_Is_month_start	Original_Quote_Is_quarter_end	Original_Quote_Is_quarter_start	Original_Quote_Is_year_end	Original_Quote_Is_year_start	PersonalField84_na	PropertyField29_na	QuoteNumber	Field8	Field9	Field11	SalesField8	PersonalField84	PropertyField25	Original_Quote_Elapsed
0	E	16	1,487	N	4	4	4	4	3	3	3	4	13	22	13	23	Y	K	13	22	6	16	9	21	0	5	5	11	P	0	0	0	0	0	0	0	1	1	11	14	6	1	N	1	3	-1	-1	5	2	5	14	YH	XR	XQ	ZQ	1	0	1	1	0	0	0	1	1	0	0	0	1	1	1	1	1	1	0	0	0	1	1	18	23	-1	4	N	N	Y	D	1	1	-1	24	1	2	A	1	3	5	0	2	0	4	4	2	1	6	8	7	8	19	B	N	N	Y	G	N	2	N	Y	N	18	21	25	25	9	6	1	1	24	25	-1	13	9	18	12	18	10	18	16	19	-1	25	9	18	13	18	9	18	-1	19	10	13	8	17	2	10	-1	20	19	18	4	9	-1	25	-1	24	-1	15	16	15	17	17	10	13	21	20	25	25	16	22	25	25	8	11	10	17	11	17	24	25	15	23	25	25	-1	-1	21	24	18	24	11	18	18	16	20	24	11	17	22	23	9	12	25	25	6	9	4	2	16	12	20	20	2	2	2	1	1	1	10	7	25	25	-1	19	19	22	12	15	1	1	-1	1	-1	20	-1	25	Y	IL	2014	8	33	12	1	224	False	False	False	False	False	False	True	True	3	0.9364	0.0006	1.3045	67052	2.0	1.5	1.407802e+09
1	F	11	564	N	8	14	8	14	7	12	8	13	13	22	13	23	T	E	4	5	3	6	3	6	1	5	5	4	R	1	0	0	0	0	0	0	1	1	15	20	6	1	N	1	2	9	18	1	2	2	4	YF	XS	YP	XC	1	0	1	1	0	0	0	1	1	0	0	0	1	1	1	1	1	1	0	0	0	1	1	22	24	-1	10	N	N	Y	J	1	1	-1	21	1	1	A	4	7	15	1	1	1	8	14	2	1	5	6	10	15	10	B	N	O	Y	H	N	1	N	N	N	5	3	17	24	17	17	15	17	10	9	-1	20	4	13	9	13	5	13	5	12	-1	25	4	15	8	13	5	14	-1	13	9	11	5	12	2	16	-1	22	23	24	21	22	-1	22	-1	22	-1	14	10	4	10	5	5	5	13	6	9	14	8	10	12	22	7	10	9	15	11	18	14	18	5	7	19	24	6	14	16	23	6	12	11	19	25	25	15	20	12	20	23	24	12	21	23	25	7	11	16	14	13	6	17	15	7	5	7	5	13	7	14	14	7	14	-1	4	1	1	5	3	10	10	-1	5	-1	5	-1	21	N	NJ	2013	9	36	7	5	250	False	False	False	False	False	False	True	True	5	0.9919	0.0038	1.1886	27288	2.0	1.0	1.378512e+09
2	F	15	564	N	11	18	11	18	10	16	10	18	13	22	13	23	T	E	3	3	5	14	3	9	1	5	5	23	V	0	1	2	2	0	0	0	1	1	12	15	6	1	N	1	2	22	24	1	2	4	10	XE	ZH	YK	ZN	1	0	1	1	0	0	0	1	1	0	0	0	1	1	1	1	1	1	0	0	0	1	1	6	9	-1	23	N	N	Y	R	1	1	-1	24	1	2	C	1	8	17	0	2	1	11	18	2	1	12	18	20	24	10	B	N	M	Y	G	Y	1	N	N	N	11	12	3	6	16	16	15	18	14	16	-1	16	4	10	10	16	5	11	6	13	-1	25	4	10	10	16	5	10	-1	11	11	18	5	13	2	13	-1	15	21	20	19	21	-1	16	-1	18	-1	16	15	11	11	7	6	7	13	7	7	11	15	21	5	5	3	2	14	21	8	12	12	15	14	22	7	7	5	11	9	14	13	22	8	13	22	19	11	14	11	16	16	18	9	10	14	16	6	8	20	19	17	14	16	13	20	22	20	22	20	18	10	7	4	7	-1	11	13	12	18	22	10	11	-1	20	-1	22	-1	11	N	NJ	2013	3	13	29	4	88	False	False	False	False	False	False	True	True	7	0.8945	0.0038	1.0670	65264	2.0	1.0	1.364515e+09
3	K	21	1,113	Y	14	22	15	22	13	20	22	25	13	22	13	23	Y	F	5	9	9	20	5	16	1	5	5	11	R	1	1	1	1	0	0	0	1	1	5	6	6	1	N	1	2	7	15	1	2	3	5	XR	YY	XT	XC	1	0	1	1	0	0	0	1	1	0	0	0	1	1	1	1	1	1	0	0	0	1	1	5	8	-1	6	N	N	Y	Q	1	1	-1	21	4	2	C	4	5	10	1	2	1	14	22	2	1	13	18	8	11	4	B	N	M	N	G	N	0	N	N	N	7	5	9	18	11	9	13	15	22	24	-1	14	24	25	23	23	24	25	25	25	-1	25	24	24	23	23	24	24	25	25	23	22	24	25	2	7	-1	13	11	14	6	16	-1	15	-1	15	-1	22	11	5	10	4	5	4	13	5	7	10	3	2	5	5	9	14	15	22	7	10	9	8	13	22	6	5	3	7	8	11	11	20	17	23	14	12	15	20	10	14	11	11	9	10	11	13	15	21	14	12	17	13	10	6	20	22	20	22	19	16	12	11	4	6	-1	13	10	8	5	3	8	8	-1	13	-1	8	-1	21	N	TX	2015	3	12	21	5	80	False	False	False	False	False	False	True	True	9	0.8870	0.0004	1.2665	32725	2.0	2.0	1.426896e+09
4	B	25	935	N	4	5	4	5	4	4	4	5	13	22	13	23	Y	D	12	21	1	1	3	6	0	5	5	11	T	0	1	1	1	0	0	0	1	1	1	1	7	0	N	1	1	4	2	4	2	2	24	ZA	ZE	XR	XD	1	1	2	2	0	1	1	1	1	0	1	1	2	1	1	1	1	1	0	1	1	2	1	2	2	-1	8	N	N	Y	R	0	1	-1	21	2	1	C	1	5	11	0	2	0	4	5	2	1	4	4	9	12	10	B	N	O	Y	H	N	1	Y	N	N	10	11	2	4	12	11	1	1	11	9	-1	13	2	8	4	6	2	6	2	9	-1	25	2	3	3	5	2	5	-1	7	6	6	2	6	15	23	-1	17	4	8	6	17	-1	13	-1	15	-1	8	17	15	15	13	15	18	14	7	7	10	12	19	8	14	7	9	2	1	15	21	12	15	16	23	9	13	16	24	6	8	4	6	2	5	11	7	8	9	6	5	9	8	25	25	9	3	9	18	7	4	16	12	13	9	8	6	8	6	11	5	19	21	13	21	-1	23	11	8	5	3	7	7	-1	3	-1	22	-1	21	N	CA	2014	12	50	10	2	344	False	False	False	False	False	False	False	True	10	0.9153	0.0007	1.0200	56025	2.0	1.0	1.418170e+09
5	B	24	965	N	11	18	11	18	10	16	10	18	13	22	13	23	Y	D	6	12	6	17	4	14	0	5	5	24	V	0	0	0	0	0	0	0	1	1	15	19	7	0	N	1	2	5	7	1	2	3	24	ZA	ZE	XR	XD	1	1	2	2	1	1	1	1	1	1	1	1	2	1	1	1	1	1	1	1	1	2	1	15	20	-1	20	N	N	Y	O	0	1	-1	21	2	2	C	1	10	19	1	4	0	11	18	2	1	13	19	21	24	10	B	N	O	N	H	Y	2	N	N	N	12	14	13	23	14	13	6	6	14	15	-1	13	2	3	4	5	2	5	2	2	-1	25	2	6	3	6	2	6	-1	7	5	5	2	5	15	23	-1	24	2	2	2	4	-1	7	-1	15	-1	2	13	8	14	11	11	16	14	9	10	18	9	12	8	14	18	24	12	20	13	20	9	9	11	19	15	23	7	16	12	20	5	8	3	6	2	3	6	5	8	7	14	14	2	4	10	10	21	23	12	10	17	13	15	12	6	4	6	4	17	12	14	14	11	20	-1	21	10	6	10	12	9	9	-1	2	-1	20	-1	8	N	CA	2013	7	29	19	4	200	False	False	False	False	False	False	False	True	11	0.9403	0.0006	1.0200	42868	2.0	1.0	1.374192e+09
6	E	23	1,487	N	5	6	5	6	4	5	4	5	13	22	13	23	Y	K	11	21	3	8	5	15	1	5	5	11	K	0	0	0	0	0	0	0	1	1	18	22	7	0	N	1	2	15	21	1	2	2	24	ZA	ZE	XR	XD	1	1	2	2	0	1	1	1	1	0	1	1	2	1	1	1	1	1	0	1	1	2	1	4	6	-1	8	Y	N	Y	N	0	1	-1	21	1	2	C	2	4	7	1	1	0	5	6	2	1	4	4	9	12	10	B	N	O	N	E	N	0	N	N	N	10	10	2	3	14	14	7	6	17	19	-1	13	13	21	15	19	13	20	23	25	-1	25	11	19	16	19	11	19	-1	23	14	19	15	22	2	18	-1	19	16	16	2	4	-1	24	-1	18	-1	19	2	1	2	1	2	1	2	1	5	7	4	3	6	7	8	12	7	11	6	8	11	12	8	14	6	6	9	20	9	14	2	2	23	25	17	15	20	24	11	17	8	3	6	6	13	15	4	2	21	21	20	20	15	12	23	24	23	24	20	18	16	16	2	3	-1	3	25	25	9	9	11	12	-1	7	-1	4	-1	8	N	IL	2014	7	31	28	0	209	False	False	False	False	False	False	False	False	15	0.9392	0.0006	1.3045	52431	2.0	1.0	1.406506e+09
7	B	25	935	N	6	9	6	9	5	7	5	8	13	22	13	23	Y	D	10	19	2	5	3	9	1	5	5	11	P	0	0	0	0	0	0	0	1	1	7	10	6	1	N	1	2	21	23	1	2	3	24	ZA	ZE	XR	XD	1	1	2	2	0	1	1	1	1	0	1	1	2	1	1	1	1	1	0	1	1	2	1	15	19	-1	20	N	N	Y	O	0	1	-1	21	4	2	C	4	7	15	1	2	0	6	9	2	1	5	6	11	15	10	B	N	O	Y	H	Y	2	N	N	N	9	10	6	13	17	18	10	11	10	8	-1	13	2	7	2	2	2	2	2	4	-1	25	2	7	1	1	2	2	-1	7	3	3	2	3	5	20	-1	18	3	6	4	11	-1	11	-1	15	-1	5	23	24	15	13	14	17	15	9	9	14	3	2	9	16	3	2	15	22	14	21	8	7	9	16	14	22	15	23	15	22	6	12	1	1	6	6	12	15	2	2	14	14	2	2	9	3	1	1	18	18	20	19	14	10	23	24	22	24	15	10	22	24	7	14	-1	16	15	16	2	1	15	19	-1	17	-1	15	-1	8	N	CA	2015	1	4	20	1	20	False	False	False	False	False	False	False	True	16	0.9153	0.0007	1.0200	2572	2.0	1.0	1.421712e+09
8	J	23	1,113	N	12	20	12	20	11	18	12	19	13	22	13	23	T	F	2	3	15	24	10	22	0	5	5	11	T	0	1	1	1	1	1	1	0	0	-1	-1	7	0	N	1	2	24	25	1	2	5	24	ZA	ZE	XR	XD	1	1	2	2	0	1	1	1	1	0	1	1	2	1	1	1	1	1	0	1	1	2	1	10	13	-1	24	Y	Y	Y	D	0	1	-1	21	4	2	A	5	4	8	1	3	1	12	20	2	1	21	24	8	10	4	B	N	K	Y	E	N	2	N	N	N	18	22	4	9	11	9	12	13	14	14	-1	23	13	21	22	22	14	21	20	23	-1	25	14	22	21	22	15	22	-1	16	23	23	9	18	2	4	-1	13	9	11	1	1	-1	8	-1	15	-1	24	19	20	13	10	7	10	16	14	9	15	9	13	11	20	14	22	15	22	9	14	12	14	11	19	14	21	4	10	9	15	11	21	8	13	13	9	11	13	25	25	8	5	12	21	11	13	8	15	11	8	17	12	19	19	13	12	13	12	17	12	8	5	6	11	-1	15	12	10	4	2	7	7	-1	12	-1	5	-1	8	N	TX	2014	5	22	28	2	148	False	False	False	False	False	False	False	False	17	0.8928	0.0004	1.2665	10065	2.0	3.0	1.401235e+09
9	J	23	1,165	N	3	3	3	3	3	3	3	3	13	22	13	23	Y	F	14	23	9	21	15	24	1	3	4	1	Q	0	0	0	0	0	0	0	1	1	6	7	6	1	N	1	3	-1	-1	5	2	1	20	XD	XS	YP	XC	1	0	1	1	0	0	0	1	1	0	0	0	1	1	1	1	1	1	0	0	0	1	1	16	20	-1	4	N	N	Y	O	1	1	-1	21	4	2	C	4	2	3	0	1	0	3	3	2	1	5	6	10	14	10	B	N	O	Y	H	N	2	N	Y	N	17	21	10	20	10	7	1	1	16	18	-1	24	13	22	22	22	14	22	23	24	-1	25	14	21	21	22	15	21	-1	18	23	22	10	20	2	5	-1	14	9	11	3	8	-1	7	-1	15	-1	22	19	20	13	10	7	10	16	14	12	20	8	11	10	19	7	10	4	4	5	5	10	12	4	5	25	25	25	25	9	14	4	5	11	17	13	9	17	22	19	23	10	9	11	18	12	14	8	16	2	1	22	22	23	24	1	1	1	1	14	8	2	1	9	17	-1	13	15	17	19	23	7	6	-1	1	-1	8	-1	13	N	TX	2013	7	28	11	3	192	False	False	False	False	False	False	True	False	21	0.9691	0.0004	1.2665	8321	2.0	1.0	1.373501e+09