Introduction

This is a modification of the "first pass" submission to the Homesite Competition on Kaggle Competition using Google Colab, but modifying some of the default parameters and maybe adding some learning from the initial exploratory data analysis at this time, to see if I can improve the baseline after applying what we learnt so far to see how it improves (or not) our submission then.

Changes made:

  • Changed from RandomSplitter() to TrainTestSplitter() for making test and validation sets more fairly weighted based on the bias of the input data towards negative results
  • Increase batch size to 1024 to make training shorter, but to still hopefully get a better predictor for it. Set a separate validator batch size to 128.
  • Increase the validation percentage to 0.25
  • Fix the learning rate to 1e-2
  • Increase epochs to 7, see if it overfits and what effect that has
  • Modified the cat_names and cont_names arrays with the initial insights from the EDA notebook post
  • Add a date part for dates
  • Add weight decay of 0.2

Setup fastai and Google drive

!pip install -Uqq fastai
from fastai.tabular.all import *

The snippet below is only useful in Colab for accessing my Google Drive and is straight out the fastbook source code in Github

global gdrive
gdrive = Path('/content/gdrive/My Drive')
from google.colab import drive
if not gdrive.exists(): drive.mount(str(gdrive.parent))

Only add the Kaggle bits below if I'm running locally, in Collab they're already here

 
!ls /content/gdrive/MyDrive/Kaggle/kaggle.json
/content/gdrive/MyDrive/Kaggle/kaggle.json

Useful links here:

Path.cwd()
Path('/content')

Setup kaggle environment parameters

!mkdir -p ~/.kaggle
!cp /content/gdrive/MyDrive/Kaggle/kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json
from kaggle import api
path = Path.cwd()
path.ls()
(#4) [Path('/content/.config'),Path('/content/models'),Path('/content/gdrive'),Path('/content/sample_data')]
path = path/"gdrive/MyDrive/Kaggle/homesite_data"
path.mkdir(exist_ok=True)
Path.BASE_PATH = path
api.competition_download_cli('homesite-quote-conversion', path=path)
file_extract(path/"homesite-quote-conversion.zip")
file_extract(path/"train.csv.zip")
file_extract(path/"test.csv.zip")
homesite-quote-conversion.zip: Skipping, found more recently modified local copy (use --force to force download)
path
Path('.')
path.ls()
(#18) [Path('homesite-quote-conversion.zip'),Path('models'),Path('sample_submission.csv.zip'),Path('test.csv.zip'),Path('train.csv.zip'),Path('train.csv'),Path('test.csv'),Path('sample_submission.csv'),Path('submission.csv'),Path('submission2.csv')...]

Exploring the Homesite data

First set the random seed so that the results are reproducible

set_seed(42) 
bs = 1024
val_bs = 128
test_size = 0.25
epochs = 7
lr = 1e-2
wd=0.2
df_train = pd.read_csv(path/"train.csv", low_memory=False)
df_train.head()
QuoteNumber Original_Quote_Date QuoteConversion_Flag Field6 Field7 Field8 Field9 Field10 Field11 Field12 CoverageField1A CoverageField1B CoverageField2A CoverageField2B CoverageField3A CoverageField3B CoverageField4A CoverageField4B CoverageField5A CoverageField5B CoverageField6A CoverageField6B CoverageField8 CoverageField9 CoverageField11A CoverageField11B SalesField1A SalesField1B SalesField2A SalesField2B SalesField3 SalesField4 SalesField5 SalesField6 SalesField7 SalesField8 SalesField9 SalesField10 SalesField11 SalesField12 ... GeographicField44A GeographicField44B GeographicField45A GeographicField45B GeographicField46A GeographicField46B GeographicField47A GeographicField47B GeographicField48A GeographicField48B GeographicField49A GeographicField49B GeographicField50A GeographicField50B GeographicField51A GeographicField51B GeographicField52A GeographicField52B GeographicField53A GeographicField53B GeographicField54A GeographicField54B GeographicField55A GeographicField55B GeographicField56A GeographicField56B GeographicField57A GeographicField57B GeographicField58A GeographicField58B GeographicField59A GeographicField59B GeographicField60A GeographicField60B GeographicField61A GeographicField61B GeographicField62A GeographicField62B GeographicField63 GeographicField64
0 1 2013-08-16 0 B 23 0.9403 0.0006 965 1.0200 N 17 23 17 23 15 22 16 22 13 22 13 23 T D 2 1 7 18 3 8 0 5 5 24 V 48649 0 0 0 0 ... 8 4 20 22 10 8 6 5 15 13 19 18 16 14 21 23 21 23 16 11 22 24 7 14 -1 17 15 17 14 18 9 9 -1 8 -1 18 -1 10 N CA
1 2 2014-04-22 0 F 7 1.0006 0.0040 548 1.2433 N 6 8 6 8 5 7 5 8 13 22 13 23 T E 5 9 5 14 6 18 1 5 5 11 P 26778 0 0 1 1 ... 23 24 11 15 21 24 6 11 21 21 18 15 20 20 13 12 12 12 15 9 13 11 11 20 -1 9 18 21 8 7 10 10 -1 11 -1 17 -1 20 N NJ
2 4 2014-08-25 0 F 7 1.0006 0.0040 548 1.2433 N 7 12 7 12 6 10 7 11 25 25 13 23 T J 4 6 3 10 4 11 1 5 5 11 K 8751 0 0 2 2 ... 21 22 24 25 20 22 7 13 23 23 20 19 20 20 18 20 19 21 20 19 11 8 3 3 -1 5 21 24 12 15 15 18 -1 21 -1 11 -1 8 N NJ
3 6 2013-04-15 0 J 10 0.9769 0.0004 1,165 1.2665 N 3 2 3 2 2 2 3 2 13 22 13 23 Y F 15 23 8 19 14 24 0 5 5 23 V 43854 0 0 0 0 ... 3 1 14 22 6 2 7 14 11 8 19 18 18 16 13 12 13 12 17 13 5 2 3 4 -1 7 14 14 14 18 6 5 -1 10 -1 9 -1 21 N TX
4 8 2014-01-25 0 E 23 0.9472 0.0006 1,487 1.3045 N 8 13 8 13 7 11 7 13 13 22 13 23 T F 4 6 3 6 3 6 1 5 5 7 R 12505 1 0 0 0 ... 24 25 9 11 25 25 5 3 22 22 21 21 17 15 25 25 25 25 17 13 13 11 3 4 -1 7 11 9 10 10 18 22 -1 10 -1 11 -1 12 N IL

5 rows × 299 columns

df_train.shape
(260753, 299)
df_test = pd.read_csv(path/"test.csv", low_memory=False)
df_test.head()
QuoteNumber Original_Quote_Date Field6 Field7 Field8 Field9 Field10 Field11 Field12 CoverageField1A CoverageField1B CoverageField2A CoverageField2B CoverageField3A CoverageField3B CoverageField4A CoverageField4B CoverageField5A CoverageField5B CoverageField6A CoverageField6B CoverageField8 CoverageField9 CoverageField11A CoverageField11B SalesField1A SalesField1B SalesField2A SalesField2B SalesField3 SalesField4 SalesField5 SalesField6 SalesField7 SalesField8 SalesField9 SalesField10 SalesField11 SalesField12 SalesField13 ... GeographicField44A GeographicField44B GeographicField45A GeographicField45B GeographicField46A GeographicField46B GeographicField47A GeographicField47B GeographicField48A GeographicField48B GeographicField49A GeographicField49B GeographicField50A GeographicField50B GeographicField51A GeographicField51B GeographicField52A GeographicField52B GeographicField53A GeographicField53B GeographicField54A GeographicField54B GeographicField55A GeographicField55B GeographicField56A GeographicField56B GeographicField57A GeographicField57B GeographicField58A GeographicField58B GeographicField59A GeographicField59B GeographicField60A GeographicField60B GeographicField61A GeographicField61B GeographicField62A GeographicField62B GeographicField63 GeographicField64
0 3 2014-08-12 E 16 0.9364 0.0006 1,487 1.3045 N 4 4 4 4 3 3 3 4 13 22 13 23 Y K 13 22 6 16 9 21 0 5 5 11 P 67052 0 0 0 0 0 ... 22 23 9 12 25 25 6 9 4 2 16 12 20 20 2 2 2 1 1 1 10 7 25 25 -1 19 19 22 12 15 1 1 -1 1 -1 20 -1 25 Y IL
1 5 2013-09-07 F 11 0.9919 0.0038 564 1.1886 N 8 14 8 14 7 12 8 13 13 22 13 23 T E 4 5 3 6 3 6 1 5 5 4 R 27288 1 0 0 0 0 ... 23 24 12 21 23 25 7 11 16 14 13 6 17 15 7 5 7 5 13 7 14 14 7 14 -1 4 1 1 5 3 10 10 -1 5 -1 5 -1 21 N NJ
2 7 2013-03-29 F 15 0.8945 0.0038 564 1.0670 N 11 18 11 18 10 16 10 18 13 22 13 23 T E 3 3 5 14 3 9 1 5 5 23 V 65264 0 1 2 2 0 ... 16 18 9 10 14 16 6 8 20 19 17 14 16 13 20 22 20 22 20 18 10 7 4 7 -1 11 13 12 18 22 10 11 -1 20 -1 22 -1 11 N NJ
3 9 2015-03-21 K 21 0.8870 0.0004 1,113 1.2665 Y 14 22 15 22 13 20 22 25 13 22 13 23 Y F 5 9 9 20 5 16 1 5 5 11 R 32725 1 1 1 1 0 ... 11 11 9 10 11 13 15 21 14 12 17 13 10 6 20 22 20 22 19 16 12 11 4 6 -1 13 10 8 5 3 8 8 -1 13 -1 8 -1 21 N TX
4 10 2014-12-10 B 25 0.9153 0.0007 935 1.0200 N 4 5 4 5 4 4 4 5 13 22 13 23 Y D 12 21 1 1 3 6 0 5 5 11 T 56025 0 1 1 1 0 ... 9 8 25 25 9 3 9 18 7 4 16 12 13 9 8 6 8 6 11 5 19 21 13 21 -1 23 11 8 5 3 7 7 -1 3 -1 22 -1 21 N CA

5 rows × 298 columns

df_test.shape
(173836, 298)
y_column = df_train.columns.difference(df_test.columns)
y_column
Index(['QuoteConversion_Flag'], dtype='object')

From this it looks like QuoteConversion_Flag is the value we want to predict. Let's take a look at this

type(df_train.QuoteConversion_Flag)
pandas.core.series.Series
df_train.QuoteConversion_Flag.unique()
array([0, 1])
type(df_train.QuoteConversion_Flag.unique()[0])
numpy.int64

Make this a boolean for the purpose of generating predictions as a binary classification

df_train.QuoteConversion_Flag = df_train.QuoteConversion_Flag.astype(dtype='boolean')

Let's see how the training data outcomes are balanced

df_train.QuoteConversion_Flag.describe()
count     260753
unique         2
top        False
freq      211859
Name: QuoteConversion_Flag, dtype: object
train_data_balance = pd.DataFrame(df_train["QuoteConversion_Flag"]).groupby("QuoteConversion_Flag")
train_data_balance["QuoteConversion_Flag"].describe()
count unique top freq
QuoteConversion_Flag
False 211859 1 False 211859
True 48894 1 True 48894

We have about 5 times as many "No Sale" data rows as we do data that shows a successful sale happened. This data bias may have an impact on the effectiveness of our model to predict positive sales results

First things first

Learning from my colleague Tim's work already we know:

  • Quotenumber is unique so we can make it the index
  • Original_Quote_Date column should be set as a date type

Additionally, we should make sure to apply any changes to data types to both train and test data so predictions don't fail later on

df_train = df_train.set_index('QuoteNumber')
df_test = df_test.set_index('QuoteNumber')

We may have some NaN values for Original_Quote_Date in either the training or test dataset, but let's confirm there are none.

df_train['Original_Quote_Date'].isna().sum(), df_test['Original_Quote_Date'].isna().sum()
(0, 0)
df_train['Original_Quote_Date'] = pd.to_datetime(df_train['Original_Quote_Date'])
df_test['Original_Quote_Date'] = pd.to_datetime(df_test['Original_Quote_Date'])

Add the date_part to see if this helps improve modeling

df_train = add_datepart(df_train, 'Original_Quote_Date')
df_test = add_datepart(df_test, 'Original_Quote_Date')

Goal: Better model training, refining fastai parameters, using EDA insights gathered to date

y_names = [y_column[0]]
y_names
['QuoteConversion_Flag']
cont_names, cat_names = cont_cat_split(df_train, dep_var=y_names)
len(cont_names), len(cat_names)
(155, 154)

Modifying these lists here based on EDA notebook learnings to date

'Field8' in cont_names
True
'Field9' in cat_names, 'Field9' in cont_names
(False, True)
field9_categories = df_train['Field9'].unique()
df_train['Field9'] = df_train['Field9'].astype('category')
df_train['Field9'].cat.set_categories(field9_categories, inplace=True)
cont_names.remove('Field9')
cat_names.append('Field9')
'Field11' in cat_names, 'Field11' in cont_names
(False, True)
field11_categories = df_train['Field11'].unique()
df_train['Field11'] = df_train['Field11'].astype('category')
df_train['Field11'].cat.set_categories(field11_categories, inplace=True)
cont_names.remove('Field11')
cat_names.append('Field11')
'PropertyField25' in cat_names, 'PropertyField25' in cont_names
(False, True)
propertyfield25_categories = df_train['PropertyField25'].unique()
df_train['PropertyField25'] = df_train['PropertyField25'].astype('category')
df_train['PropertyField25'].cat.set_categories(propertyfield25_categories, inplace=True)
cont_names.remove('PropertyField25')
cat_names.append('PropertyField25')
df_train.drop('PropertyField29', axis=1, inplace=True)
df_train.drop('PersonalField84', axis=1, inplace=True)
cont_names.remove('PersonalField84')
cont_names.remove('PropertyField29')
"QuoteConversion_Flag" in cont_names, "QuoteConversion_Flag" in cat_names #Make sure we've gotten our y-column excluded
(False, False)
procs = [Categorify, FillMissing, Normalize]
splits = TrainTestSplitter(test_size=test_size, stratify=df_train[y_names])(df_train)
to = TabularPandas(df=df_train, procs=procs, cat_names=cat_names, cont_names=cont_names, y_names=y_names,splits=splits)
dls = to.dataloaders(bs=bs, val_bs=val_bs)
dls.valid.show_batch()
Field6 Field10 Field12 CoverageField5A CoverageField5B CoverageField6A CoverageField6B CoverageField8 CoverageField9 SalesField3 SalesField4 SalesField5 SalesField7 SalesField9 SalesField10 SalesField11 SalesField13 SalesField14 SalesField15 PersonalField1 PersonalField2 PersonalField5 PersonalField6 PersonalField7 PersonalField8 PersonalField9 PersonalField11 PersonalField12 PersonalField13 PersonalField16 PersonalField17 PersonalField18 PersonalField19 PersonalField22 PersonalField23 PersonalField24 PersonalField25 PersonalField26 PersonalField27 PersonalField28 PersonalField29 PersonalField30 PersonalField31 PersonalField32 PersonalField33 PersonalField34 PersonalField35 PersonalField36 PersonalField37 PersonalField38 PersonalField39 PersonalField40 PersonalField41 PersonalField42 PersonalField43 PersonalField44 PersonalField45 PersonalField46 PersonalField47 PersonalField48 PersonalField49 PersonalField50 PersonalField51 PersonalField52 PersonalField53 PersonalField54 PersonalField55 PersonalField56 PersonalField57 PersonalField58 PersonalField59 PersonalField60 PersonalField61 PersonalField62 PersonalField63 PersonalField64 PersonalField65 PersonalField66 PersonalField67 PersonalField68 PersonalField69 PersonalField70 PersonalField71 PersonalField72 PersonalField73 PersonalField74 PersonalField75 PersonalField76 PersonalField77 PersonalField78 PersonalField79 PersonalField80 PersonalField81 PersonalField82 PersonalField83 PropertyField2A PropertyField3 PropertyField4 PropertyField5 PropertyField6 PropertyField7 PropertyField8 PropertyField9 PropertyField10 PropertyField11A PropertyField11B PropertyField12 PropertyField13 PropertyField14 PropertyField15 PropertyField17 PropertyField18 PropertyField19 PropertyField20 PropertyField22 PropertyField23 PropertyField27 PropertyField28 PropertyField30 PropertyField31 PropertyField32 PropertyField33 PropertyField34 PropertyField35 PropertyField36 PropertyField37 PropertyField38 GeographicField5A GeographicField5B GeographicField10A GeographicField10B GeographicField14A GeographicField14B GeographicField18A GeographicField21A GeographicField22A GeographicField22B GeographicField23A GeographicField56A GeographicField60A GeographicField61A GeographicField62A GeographicField62B GeographicField63 GeographicField64 Original_Quote_Year Original_Quote_Month Original_Quote_Dayofweek Original_Quote_Is_month_end Original_Quote_Is_month_start Original_Quote_Is_quarter_end Original_Quote_Is_quarter_start Original_Quote_Is_year_end Original_Quote_Is_year_start Field9 Field11 PropertyField25 Field7 Field8 CoverageField1A CoverageField1B CoverageField2A CoverageField2B CoverageField3A CoverageField3B CoverageField4A CoverageField4B CoverageField11A CoverageField11B SalesField1A SalesField1B SalesField2A SalesField2B SalesField6 SalesField8 SalesField12 PersonalField4A PersonalField4B PersonalField10A PersonalField10B PersonalField14 PersonalField15 PropertyField1A PropertyField1B PropertyField2B PropertyField16A PropertyField16B PropertyField21A PropertyField21B PropertyField24A PropertyField24B PropertyField26A PropertyField26B PropertyField39A PropertyField39B GeographicField1A GeographicField1B GeographicField2A GeographicField2B GeographicField3A GeographicField3B GeographicField4A GeographicField4B GeographicField6A GeographicField6B GeographicField7A GeographicField7B GeographicField8A GeographicField8B GeographicField9A GeographicField9B GeographicField11A GeographicField11B GeographicField12A GeographicField12B GeographicField13A GeographicField13B GeographicField15A GeographicField15B GeographicField16A GeographicField16B GeographicField17A GeographicField17B GeographicField18B GeographicField19A GeographicField19B GeographicField20A GeographicField20B GeographicField21B GeographicField23B GeographicField24A GeographicField24B GeographicField25A GeographicField25B GeographicField26A GeographicField26B GeographicField27A GeographicField27B GeographicField28A GeographicField28B GeographicField29A GeographicField29B GeographicField30A GeographicField30B GeographicField31A GeographicField31B GeographicField32A GeographicField32B GeographicField33A GeographicField33B GeographicField34A GeographicField34B GeographicField35A GeographicField35B GeographicField36A GeographicField36B GeographicField37A GeographicField37B GeographicField38A GeographicField38B GeographicField39A GeographicField39B GeographicField40A GeographicField40B GeographicField41A GeographicField41B GeographicField42A GeographicField42B GeographicField43A GeographicField43B GeographicField44A GeographicField44B GeographicField45A GeographicField45B GeographicField46A GeographicField46B GeographicField47A GeographicField47B GeographicField48A GeographicField48B GeographicField49A GeographicField49B GeographicField50A GeographicField50B GeographicField51A GeographicField51B GeographicField52A GeographicField52B GeographicField53A GeographicField53B GeographicField54A GeographicField54B GeographicField55A GeographicField55B GeographicField56B GeographicField57A GeographicField57B GeographicField58A GeographicField58B GeographicField59A GeographicField59B GeographicField60B GeographicField61B Original_Quote_Week Original_Quote_Day Original_Quote_Dayofyear Original_Quote_Elapsed QuoteConversion_Flag
0 B 935 N 13 22 13 23 T D 1 5 5 K 0 1 1 0 0 0 1 1 7 0 N 1 2 0 1 2 ZA ZE XR XD 1 0 0 0 0 1 2 2 0 1 1 1 1 0 0 0 0 0 0 0 0 1 0 1 1 1 2 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 1 1 1 2 0 0 0 0 1 -1 Y Y Y 0 O 0 0 1 -1 21 4 3 C 4 1 3 0 0 2 2 4 B N N Y G Y 1 N N N -1 13 -1 25 -1 9 -1 -1 -1 15 -1 -1 -1 -1 -1 12 N CA 2014 2 0 False False False False False False 0.0007 1.02 3 23.0 0.9403 25.000000 25.0 25.0 25.0 22.000000 25.0 24.0 25.0 1.0 1.0 8.0 19.0 2.0 4.0 11.0 42316.000014 1.000000e+00 3.0 3.0 9.0 18.0 4.0 24.0 4.0 6.0 17.0 10.0 19.0 25.000001 25.0 18.0 23.0 7.0 8.0 3.0 2.0 7.0 15.0 11.0 9.0 11.0 13.0 25.0 25.0 2.0 9.0 6.0 7.0 2.0 7.0 2.0 5.0 2.0 9.0 4.0 7.0 2.0 8.0 8.0 7.0 3.0 7.0 10.000000 21.0 9.0 2.0 3.0 3.000000 6.0 4.0 2.0 13.0 8.0 16.0 15.0 16.0 20.0 15.0 11.0 6.0 9.0 6.0 6.0 5.0 4.0 7.000000 8.0 16.0 22.0 14.0 21.0 6.000000 5.0 6.0 9.0 5.0 4.0 7.0 17.0 12.0 19.0 8.000000 16.0 3.0 8.0 2.0 1.0 5.0 3.0 10.0 11.0 15.0 17.0 2.0 4.0 10.0 12.0 21.000000 23.0 13.0 10.0 14.0 7.0 6.0 3.0 19.0 21.0 18.0 20.0 14.0 8.0 21.0 23.0 12.0 20.0 21.0 6.0 3.0 6.0 4.0 9.0 9.0 16.0 12.0 6.999999 10.0 40.999994 1.391990e+09 False
1 B 965 N 13 22 13 23 T D 1 5 5 Q 0 1 1 0 0 0 1 1 6 1 N 1 2 0 1 2 ZA ZE XR XD 1 0 0 0 0 1 2 2 0 1 1 1 1 0 0 0 0 0 0 0 0 1 0 1 1 1 2 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 1 1 1 2 0 0 0 0 1 -1 Y Y Y 0 O 0 0 1 -1 21 4 2 C 4 1 3 0 0 2 4 10 B N N Y E Y 1 N N N -1 13 -1 25 -1 8 -1 -1 -1 15 -1 -1 -1 -1 -1 24 N CA 2013 11 3 False False False False False False 0.0006 1.02 1 23.0 0.9403 23.999999 25.0 24.0 25.0 21.000000 24.0 23.0 25.0 1.0 1.0 12.0 22.0 4.0 13.0 7.0 59943.999147 1.000000e+00 19.0 23.0 16.0 21.0 4.0 24.0 17.0 22.0 4.0 9.0 18.0 24.000000 25.0 17.0 22.0 25.0 25.0 3.0 2.0 11.0 21.0 12.0 11.0 4.0 3.0 11.0 10.0 2.0 10.0 2.0 3.0 2.0 4.0 2.0 7.0 2.0 10.0 2.0 4.0 2.0 7.0 2.0 2.0 2.0 2.0 21.000001 25.0 24.0 2.0 4.0 3.000000 7.0 4.0 5.0 13.0 8.0 16.0 15.0 16.0 20.0 15.0 11.0 13.0 20.0 20.0 24.0 11.0 20.0 16.000000 23.0 7.0 11.0 20.0 24.0 13.000000 17.0 17.0 24.0 14.0 22.0 10.0 21.0 10.0 15.0 8.000000 15.0 3.0 8.0 1.0 1.0 7.0 8.0 7.0 6.0 15.0 16.0 2.0 2.0 9.0 4.0 7.000000 14.0 8.0 5.0 12.0 5.0 12.0 8.0 4.0 3.0 4.0 3.0 9.0 4.0 12.0 10.0 18.0 23.0 21.0 16.0 19.0 16.0 21.0 5.0 3.0 6.0 17.0 46.000001 14.0 317.999998 1.384387e+09 False
2 B 965 N 13 22 13 23 X D 1 5 5 Q 0 0 0 0 0 0 1 1 7 0 N 1 2 0 3 2 ZA ZE XR XD 1 0 0 0 0 1 2 2 0 1 1 1 1 0 0 0 0 0 0 0 0 1 0 1 1 1 2 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 1 1 1 2 0 0 0 0 1 -1 N N Y 0 O 1 0 2 -1 21 4 2 C 1 1 2 0 0 2 4 10 B N N Y H N 1 N N N -1 13 -1 25 -1 7 -1 -1 -1 15 -1 -1 -1 -1 -1 12 N CA 2013 11 1 False False False False False False 0.0006 1.02 1 25.0 0.9403 14.000000 21.0 14.0 21.0 12.000000 19.0 13.0 20.0 14.0 22.0 4.0 11.0 2.0 3.0 20.0 3553.999261 -3.226606e-08 6.0 8.0 5.0 4.0 5.0 24.0 6.0 9.0 22.0 4.0 9.0 14.000000 21.0 9.0 14.0 16.0 22.0 4.0 2.0 7.0 14.0 8.0 5.0 8.0 8.0 20.0 22.0 2.0 4.0 2.0 2.0 2.0 2.0 2.0 3.0 2.0 7.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 18.000000 24.0 21.0 2.0 2.0 2.000000 3.0 11.0 5.0 15.0 11.0 19.0 20.0 20.0 22.0 17.0 16.0 6.0 10.0 11.0 17.0 23.0 25.0 3.000000 2.0 13.0 21.0 19.0 24.0 6.000000 5.0 24.0 25.0 6.0 6.0 7.0 17.0 18.0 24.0 8.000000 16.0 2.0 3.0 3.0 4.0 4.0 3.0 8.0 8.0 10.0 9.0 3.0 5.0 10.0 8.0 25.000001 25.0 8.0 5.0 14.0 8.0 12.0 7.0 12.0 11.0 11.0 10.0 7.0 3.0 20.0 22.0 16.0 23.0 19.0 9.0 6.0 22.0 24.0 6.0 5.0 7.0 17.0 44.999999 5.0 308.999998 1.383610e+09 True
3 C 1,487 Y 13 22 25 25 V A 1 5 5 P 0 0 0 0 0 0 1 1 7 0 N 1 2 0 1 2 XR XS YP XC 1 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 -1 N N Y 0 A 1 0 1 -1 25 1 3 C 5 1 3 1 0 2 13 7 B Y K N E N 0 N N N -1 13 -1 25 -1 20 -1 25 -1 23 -1 -1 -1 -1 -1 8 N IL 2015 5 6 False False False False False False 0.0006 1.3045 2 17.0 0.8746 25.000000 25.0 25.0 25.0 25.000001 25.0 25.0 25.0 12.0 21.0 7.0 18.0 1.0 1.0 11.0 7220.998942 -3.226606e-08 16.0 20.0 8.0 18.0 3.0 4.0 6.0 9.0 23.0 12.0 21.0 25.000001 25.0 23.0 25.0 16.0 21.0 2.0 1.0 3.0 7.0 16.0 16.0 24.0 25.0 19.0 21.0 10.0 18.0 12.0 17.0 10.0 18.0 18.0 21.0 9.0 18.0 13.0 17.0 9.0 18.0 10.0 12.0 9.0 18.0 2.000000 9.0 19.0 20.0 19.0 5.000000 16.0 25.0 16.0 5.0 2.0 5.0 2.0 3.0 3.0 6.0 2.0 4.0 4.0 7.0 8.0 7.0 10.0 7.000000 8.0 10.0 17.0 13.0 20.0 10.000000 12.0 14.0 22.0 9.0 13.0 5.0 12.0 7.0 8.0 6.000000 11.0 5.0 10.0 16.0 14.0 9.0 10.0 10.0 13.0 19.0 20.0 10.0 13.0 13.0 15.0 6.000000 7.0 20.0 20.0 20.0 20.0 17.0 15.0 21.0 23.0 21.0 23.0 20.0 19.0 19.0 21.0 5.0 9.0 9.0 22.0 24.0 14.0 17.0 13.0 16.0 24.0 2.0 19.000000 10.0 130.000000 1.431216e+09 False
4 F 548 N 13 22 13 23 T E 1 5 5 Q 0 0 0 0 0 0 1 1 7 0 N 1 2 0 1 2 XH XG XF ZT 1 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 -1 N N Y 0 R 1 0 1 -1 21 2 2 C 4 0 1 0 0 2 1 10 B N O Y H Y 1 N N N -1 22 -1 25 -1 18 -1 -1 -1 16 -1 -1 -1 -1 -1 17 N NJ 2014 3 3 False False False False False False 0.004 1.1886 1 11.0 0.9566 3.000000 2.0 3.0 3.0 2.000000 2.0 3.0 2.0 7.0 14.0 4.0 11.0 7.0 20.0 11.0 64502.998849 -3.226606e-08 8.0 10.0 11.0 19.0 5.0 10.0 8.0 12.0 15.0 4.0 6.0 3.000000 2.0 3.0 2.0 7.0 9.0 12.0 13.0 3.0 7.0 8.0 6.0 8.0 8.0 10.0 9.0 5.0 15.0 11.0 16.0 5.0 15.0 6.0 12.0 4.0 10.0 10.0 16.0 4.0 10.0 12.0 19.0 8.0 17.0 2.000000 8.0 11.0 24.0 25.0 25.000001 25.0 16.0 18.0 25.0 25.0 24.0 25.0 16.0 18.0 25.0 25.0 1.0 1.0 4.0 3.0 5.0 5.0 8.000000 12.0 4.0 4.0 5.0 5.0 7.000000 6.0 11.0 19.0 5.0 4.0 6.0 15.0 3.0 2.0 7.000000 13.0 9.0 14.0 24.0 24.0 12.0 15.0 10.0 13.0 12.0 12.0 8.0 8.0 10.0 11.0 7.000000 11.0 10.0 7.0 5.0 2.0 3.0 1.0 12.0 11.0 10.0 9.0 6.0 2.0 20.0 22.0 12.0 20.0 18.0 9.0 5.0 8.0 7.0 13.0 16.0 5.0 11.0 11.000000 13.0 71.999999 1.394669e+09 False
5 B 965 N 13 22 13 23 T D 0 5 5 T 0 0 1 0 0 0 1 1 7 0 N 1 3 0 5 2 ZA ZE XR XD 1 0 0 0 0 1 2 2 0 1 1 1 1 0 0 0 0 0 0 0 0 1 0 1 1 1 2 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 1 1 1 2 0 0 0 0 1 -1 N N Y 0 O 1 0 1 -1 21 2 1 A 1 0 1 0 0 2 1 10 B N O Y G Y 2 N Y N -1 13 -1 25 -1 7 -1 -1 -1 15 -1 -1 -1 -1 -1 19 N CA 2013 9 3 False False False False False False 0.0006 1.02 1 24.0 0.9403 1.000000 1.0 1.0 1.0 1.000000 1.0 1.0 1.0 10.0 20.0 2.0 1.0 5.0 17.0 4.0 37300.000122 1.000000e+00 1.0 1.0 -1.0 -1.0 4.0 24.0 19.0 23.0 15.0 1.0 1.0 1.000000 1.0 1.0 1.0 4.0 4.0 12.0 15.0 9.0 18.0 10.0 8.0 2.0 1.0 17.0 19.0 2.0 3.0 7.0 8.0 2.0 8.0 2.0 5.0 2.0 4.0 5.0 8.0 2.0 8.0 9.0 12.0 3.0 8.0 2.000000 16.0 9.0 4.0 8.0 6.000000 16.0 4.0 9.0 25.0 25.0 25.0 25.0 25.0 25.0 24.0 25.0 9.0 16.0 10.0 14.0 13.0 23.0 7.000000 10.0 6.0 9.0 7.0 8.0 16.000000 21.0 2.0 2.0 14.0 21.0 7.0 16.0 13.0 21.0 2.000000 2.0 9.0 14.0 18.0 17.0 10.0 11.0 6.0 6.0 12.0 12.0 25.0 25.0 10.0 9.0 5.000000 3.0 8.0 5.0 12.0 5.0 12.0 8.0 7.0 5.0 7.0 5.0 13.0 7.0 16.0 16.0 11.0 19.0 16.0 15.0 17.0 1.0 1.0 5.0 4.0 3.0 24.0 37.000000 12.0 255.000001 1.378944e+09 False
6 B 965 N 13 22 13 23 T D 1 5 5 T 0 0 0 0 0 0 1 1 7 0 N 1 2 0 1 2 ZA ZE XR XD 1 0 0 0 0 1 2 3 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 1 3 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 1 1 1 1 3 0 0 0 0 1 -1 N N Y 0 O 0 0 1 -1 21 2 2 C 1 1 2 0 0 2 1 10 B N N Y G Y 1 N N N -1 13 -1 25 -1 8 -1 -1 -1 15 -1 -1 25 -1 -1 9 N CA 2013 3 2 False False False False False False 0.0006 1.02 1 23.0 0.9403 9.000000 15.0 9.0 15.0 8.000000 13.0 8.0 14.0 3.0 5.0 6.0 17.0 5.0 17.0 23.0 48178.000355 -3.226606e-08 11.0 14.0 8.0 17.0 1.0 24.0 15.0 20.0 8.0 15.0 23.0 9.000000 15.0 9.0 14.0 16.0 22.0 10.0 11.0 3.0 6.0 23.0 24.0 12.0 14.0 11.0 9.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 3.0 2.0 4.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 14.000000 23.0 22.0 2.0 4.0 3.000000 8.0 11.0 3.0 15.0 11.0 19.0 20.0 20.0 22.0 17.0 16.0 18.0 23.0 25.0 25.0 7.0 9.0 20.999999 24.0 19.0 24.0 25.0 25.0 23.000001 25.0 25.0 25.0 13.0 21.0 11.0 22.0 13.0 20.0 21.999999 25.0 2.0 3.0 3.0 5.0 8.0 9.0 14.0 22.0 14.0 15.0 2.0 3.0 8.0 3.0 24.000000 25.0 23.0 23.0 22.0 23.0 23.0 24.0 17.0 18.0 18.0 19.0 22.0 21.0 14.0 13.0 3.0 5.0 4.0 10.0 7.0 12.0 14.0 17.0 21.0 25.0 4.0 11.000000 13.0 71.999999 1.363133e+09 False
7 J 1,113 N 13 22 13 23 Y K 1 5 5 Q 0 1 1 1 1 1 1 1 6 1 N 1 2 0 1 2 ZA ZE XR XD 1 0 0 0 0 1 2 2 0 0 1 1 1 0 0 0 0 0 0 0 0 1 0 0 1 1 2 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 1 1 2 0 0 0 0 1 -1 Y Y Y 0 J 0 0 1 -1 21 4 2 C 2 0 2 0 0 2 1 10 B N N N G N 0 N N N -1 23 -1 25 -1 10 -1 -1 -1 15 -1 -1 -1 -1 -1 16 N TX 2014 7 3 False False False False False False 0.0004 1.2665 1 23.0 0.8928 6.000000 8.0 6.0 8.0 5.000000 7.0 5.0 8.0 10.0 19.0 19.0 24.0 23.0 25.0 11.0 60277.000961 1.000000e+00 7.0 9.0 25.0 25.0 4.0 24.0 6.0 8.0 3.0 3.0 3.0 6.000000 8.0 8.0 12.0 14.0 20.0 15.0 19.0 4.0 7.0 6.0 3.0 10.0 11.0 14.0 15.0 3.0 10.0 23.0 23.0 6.0 16.0 6.0 14.0 3.0 10.0 23.0 22.0 5.0 16.0 24.0 23.0 7.0 15.0 2.000000 9.0 5.0 6.0 9.0 7.000000 18.0 5.0 10.0 21.0 23.0 19.0 21.0 10.0 13.0 24.0 25.0 6.0 10.0 3.0 2.0 3.0 2.0 2.000000 2.0 11.0 18.0 6.0 7.0 6.000000 4.0 4.0 3.0 3.0 2.0 6.0 13.0 2.0 1.0 2.000000 1.0 16.0 23.0 19.0 18.0 6.0 5.0 15.0 22.0 9.0 6.0 1.0 1.0 20.0 22.0 6.000000 5.0 9.0 6.0 21.0 21.0 19.0 19.0 10.0 9.0 11.0 10.0 24.0 25.0 5.0 3.0 6.0 12.0 20.0 17.0 20.0 18.0 22.0 7.0 7.0 9.0 4.0 29.000000 17.0 198.000000 1.405555e+09 False
8 B 935 N 13 22 1 6 T J 0 4 3 P 0 0 0 0 0 0 0 0 6 1 N 1 2 0 3 2 ZA ZE XR XD 1 0 0 0 0 1 2 2 0 1 1 1 1 0 0 0 0 0 0 0 0 1 0 1 1 1 2 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 1 1 1 2 0 0 0 0 1 -1 N N Y 0 O 1 0 1 -1 21 2 2 A 4 1 3 0 0 2 1 10 B N N Y G Y 1 N N N -1 13 -1 25 -1 7 -1 -1 -1 15 -1 -1 -1 -1 -1 16 N CA 2014 1 1 False False False False False False 0.0007 1.02 1 25.0 0.9403 7.000000 11.0 7.0 11.0 6.000000 9.0 6.0 10.0 4.0 7.0 5.0 16.0 6.0 18.0 20.0 16438.000261 -3.226606e-08 16.0 20.0 4.0 3.0 1.0 24.0 16.0 21.0 20.0 9.0 18.0 7.000000 11.0 7.0 10.0 13.0 19.0 10.0 12.0 12.0 22.0 12.0 11.0 4.0 3.0 24.0 25.0 2.0 3.0 4.0 5.0 2.0 5.0 2.0 2.0 2.0 6.0 3.0 6.0 2.0 6.0 5.0 5.0 2.0 5.0 15.000000 23.0 24.0 2.0 2.0 2.000000 4.0 7.0 2.0 13.0 8.0 14.0 11.0 11.0 16.0 14.0 9.0 11.0 19.0 10.0 15.0 10.0 18.0 20.999999 24.0 8.0 13.0 12.0 19.0 14.000000 18.0 15.0 23.0 17.0 23.0 5.0 12.0 12.0 18.0 4.000000 6.0 3.0 7.0 2.0 3.0 6.0 6.0 8.0 8.0 14.0 16.0 2.0 4.0 10.0 10.0 21.000000 23.0 13.0 10.0 16.0 11.0 16.0 14.0 9.0 7.0 9.0 7.0 14.0 9.0 17.0 18.0 13.0 21.0 21.0 13.0 12.0 20.0 24.0 1.0 1.0 6.0 6.0 2.999999 14.0 14.000005 1.389658e+09 False
9 E 1,480 N 13 22 13 23 T F 1 5 5 P 0 1 1 0 0 0 1 1 7 0 N 1 2 0 1 2 ZG ZF ZN ZQ 1 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 -1 N N Y 0 R 1 0 1 -1 23 3 2 C 4 0 2 1 0 2 1 14 A N K N E N 0 N N N -1 13 -1 25 -1 18 -1 -1 25 25 -1 -1 -1 -1 -1 18 N IL 2013 8 1 False False False False False False 0.0006 1.3045 2 14.0 0.9487 8.000000 14.0 2.0 2.0 2.000000 2.0 8.0 13.0 4.0 6.0 6.0 17.0 6.0 18.0 4.0 49660.999846 1.000000e+00 7.0 9.0 5.0 6.0 4.0 11.0 6.0 8.0 3.0 3.0 5.0 8.000000 14.0 7.0 9.0 4.0 4.0 8.0 7.0 5.0 10.0 14.0 13.0 10.0 11.0 21.0 23.0 9.0 17.0 11.0 17.0 9.0 17.0 14.0 18.0 9.0 17.0 13.0 17.0 9.0 17.0 9.0 12.0 7.0 16.0 2.000000 9.0 16.0 20.0 19.0 4.000000 9.0 24.0 17.0 1.0 1.0 1.0 1.0 1.0 1.0 2.0 1.0 6.0 8.0 4.0 3.0 9.0 15.0 9.000000 13.0 9.0 16.0 11.0 17.0 15.000000 20.0 9.0 16.0 11.0 17.0 8.0 18.0 9.0 13.0 4.000000 7.0 12.0 22.0 18.0 16.0 19.0 23.0 14.0 22.0 22.0 23.0 8.0 8.0 16.0 18.0 6.000000 8.0 17.0 16.0 17.0 13.0 12.0 8.0 20.0 22.0 19.0 21.0 17.0 13.0 10.0 7.0 6.0 13.0 6.0 17.0 19.0 15.0 19.0 9.0 9.0 14.0 23.0 33.000000 13.0 225.000001 1.376352e+09 False
len(dls.train)*512, len(dls.valid)*128
(97280, 65280)
roc_auc_binary = RocAucBinary()
learn = tabular_learner(dls, metrics=roc_auc_binary)
type(roc_auc_binary)
fastai.metrics.AccumMetric
learn.lr_find()
SuggestedLRs(valley=tensor(0.0010))

Note I ran fit_one_cycle with a value of 10 when prepping this notebook for publishing, but the test results came out suspiciously high on 1 outputs, given that the test submission I ran before was heavily weighted with 0 outputs and got a 0.83 score when I ran with 5 but I didn't set the random seed value then. What happened I think was changing the splitter, I got a different tensor output and I was looking at the alternate value column instead column with the prediction value.

learn.fit_one_cycle(epochs, lr, wd=wd)
epoch train_loss valid_loss roc_auc_score time
0 0.248215 0.201386 0.949011 00:13
1 0.196663 0.190094 0.955587 00:13
2 0.185024 0.185835 0.957160 00:13
3 0.181798 0.182034 0.958874 00:13
4 0.175761 0.179600 0.960017 00:13
5 0.168771 0.177265 0.961372 00:13
6 0.158945 0.179190 0.960952 00:13

Referenced another Kaggle notebook for this, we don't need it but it's good to see what fastai metrics is actually packaging up for you

preds, targs = learn.get_preds()
preds[0:1][0][0], preds[0:1][0][1]
(tensor(1.0000), tensor(1.0704e-05))

Here was my mistake, I was looking at the wrong classifier value

len(preds)
65189
(preds[:][:][:,1] >= 0.5).sum(), (preds[:][:][:,1] < 0.5).sum()
(tensor(9910), tensor(55279))
from sklearn.metrics import roc_auc_score
valid_score = roc_auc_score(to_np(targs), to_np(preds[:][:][:,1]))
valid_score
0.9609515205141398

Doing inferences based on this blog post from Walk With Fastai initially, but then experimenting to get this

dl_test = dls.test_dl(df_test)
preds, _ = learn.get_preds(dl=dl_test)
(preds[:][:][:,1] >= 0.5).sum(), (preds[:][:][:,1] < 0.5).sum()
(tensor(26627), tensor(147209))

Submission To Kaggle

path.ls()
(#18) [Path('homesite-quote-conversion.zip'),Path('models'),Path('sample_submission.csv.zip'),Path('test.csv.zip'),Path('train.csv.zip'),Path('train.csv'),Path('test.csv'),Path('sample_submission.csv'),Path('submission.csv'),Path('submission2.csv')...]
file_extract(path/"sample_submission.csv.zip")
path.ls()
(#18) [Path('homesite-quote-conversion.zip'),Path('models'),Path('sample_submission.csv.zip'),Path('test.csv.zip'),Path('train.csv.zip'),Path('train.csv'),Path('test.csv'),Path('sample_submission.csv'),Path('submission.csv'),Path('submission2.csv')...]
df_submission = pd.read_csv(path/"sample_submission.csv") #I could add `low_memory=false` but it makes things slower
df_submission.head()
QuoteNumber QuoteConversion_Flag
0 3 0
1 5 0
2 7 0
3 9 0
4 10 0
df_submission.tail()
QuoteNumber QuoteConversion_Flag
173831 434570 0
173832 434573 0
173833 434574 0
173834 434575 0
173835 434589 0
len(df_test.index), len(preds[:][:][:,1])
(173836, 173836)
type(preds)
torch.Tensor
preds.dtype
torch.float32
preds[0,1]
tensor(0.0010)
preds[:1][:1]
tensor([[0.9990, 0.0010]])

We want the 2nd value, this is what gives us our confidence value if it is going to be a sale

preds[:1][:1][:,1]
tensor([0.0010])
preds_for_submission = preds[:][:][:,1].tolist() 
preds_for_submission[0:3]
[0.0010175276547670364, 0.03169810026884079, 0.025726569816470146]
fpfs = [float(pfs) for pfs in preds_for_submission]
fpfs[0:2]
[0.0010175276547670364, 0.03169810026884079]
integers = [[1], [2], [3]]

strings = [int(integer[0]) for integer in integers]
strings
[1, 2, 3]
submission = pd.DataFrame({'QuoteNumber': df_test.index, 'QuoteConversion_Flag': preds[:][:][:,1].tolist()}, columns=['QuoteNumber', 'QuoteConversion_Flag'])

Needed to figure out how to extract the floating point value alone from the list to properly compose the csv output dataframe

type(submission.QuoteConversion_Flag)
pandas.core.series.Series
type(submission.QuoteConversion_Flag[0])
numpy.float64
submission.QuoteConversion_Flag[0]
0.0010175276547670364

Played around with the example on list comprehension here to get it to work with what I had to work with

submission.QuoteConversion_Flag = [float(qcf) for qcf in submission.QuoteConversion_Flag]
submission.head()
QuoteNumber QuoteConversion_Flag
0 3 0.001018
1 5 0.031698
2 7 0.025727
3 9 0.000540
4 10 0.429832
submission.QuoteConversion_Flag = round(submission.QuoteConversion_Flag).astype('Int64')
submission.head()
QuoteNumber QuoteConversion_Flag
0 3 0
1 5 0
2 7 0
3 9 0
4 10 0
len(submission[submission.QuoteConversion_Flag==1])
26627
len(submission[submission.QuoteConversion_Flag==0])
147209
submission.to_csv(path/'submission11.csv', index=False)
api.competition_submit(path/'submission11.csv',message="Tenth pass", competition='homesite-quote-conversion')
100%|██████████| 1.45M/1.45M [00:01<00:00, 765kB/s]
Successfully submitted to Homesite Quote Conversion
learn.save('homesite_fastai_nn11')
Path('models/homesite_fastai_nn11.pth')