Another Pass Using Fastai Tabular For Homesite Competition
Here I use fastai with some changes to the defaults to make another submission to Kaggle for the Homesite competition.
- Introduction
- Setup fastai and Google drive
- Setup kaggle environment parameters
- Exploring the Homesite data
- First things first
- Goal: Better model training, refining fastai parameters, using EDA insights gathered to date
- Submission To Kaggle
Introduction
This is a modification of the "first pass" submission to the Homesite Competition on Kaggle Competition using Google Colab, but modifying some of the default parameters and maybe adding some learning from the initial exploratory data analysis at this time, to see if I can improve the baseline after applying what we learnt so far to see how it improves (or not) our submission then.
Changes made:
- Changed from RandomSplitter() to TrainTestSplitter() for making test and validation sets more fairly weighted based on the bias of the input data towards negative results
- Increase batch size to 1024 to make training shorter, but to still hopefully get a better predictor for it. Set a separate validator batch size to 128.
- Increase the validation percentage to 0.25
- Fix the learning rate to 1e-2
- Increase epochs to 7, see if it overfits and what effect that has
- Modified the
cat_names
andcont_names
arrays with the initial insights from the EDA notebook post - Add a date part for dates
- Add weight decay of 0.2
!pip install -Uqq fastai
from fastai.tabular.all import *
The snippet below is only useful in Colab for accessing my Google Drive and is straight out the fastbook source code in Github
global gdrive
gdrive = Path('/content/gdrive/My Drive')
from google.colab import drive
if not gdrive.exists(): drive.mount(str(gdrive.parent))
Only add the Kaggle bits below if I'm running locally, in Collab they're already here
!ls /content/gdrive/MyDrive/Kaggle/kaggle.json
Useful links here:
- Documentation on Path library
- Documentation on fastai extensions to Path library
Path.cwd()
!mkdir -p ~/.kaggle
!cp /content/gdrive/MyDrive/Kaggle/kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json
from kaggle import api
path = Path.cwd()
path.ls()
path = path/"gdrive/MyDrive/Kaggle/homesite_data"
path.mkdir(exist_ok=True)
Path.BASE_PATH = path
api.competition_download_cli('homesite-quote-conversion', path=path)
file_extract(path/"homesite-quote-conversion.zip")
file_extract(path/"train.csv.zip")
file_extract(path/"test.csv.zip")
path
path.ls()
First set the random seed so that the results are reproducible
set_seed(42)
bs = 1024
val_bs = 128
test_size = 0.25
epochs = 7
lr = 1e-2
wd=0.2
df_train = pd.read_csv(path/"train.csv", low_memory=False)
df_train.head()
df_train.shape
df_test = pd.read_csv(path/"test.csv", low_memory=False)
df_test.head()
df_test.shape
y_column = df_train.columns.difference(df_test.columns)
y_column
From this it looks like QuoteConversion_Flag
is the value we want to predict. Let's take a look at this
type(df_train.QuoteConversion_Flag)
df_train.QuoteConversion_Flag.unique()
type(df_train.QuoteConversion_Flag.unique()[0])
Make this a boolean for the purpose of generating predictions as a binary classification
df_train.QuoteConversion_Flag = df_train.QuoteConversion_Flag.astype(dtype='boolean')
Let's see how the training data outcomes are balanced
df_train.QuoteConversion_Flag.describe()
train_data_balance = pd.DataFrame(df_train["QuoteConversion_Flag"]).groupby("QuoteConversion_Flag")
train_data_balance["QuoteConversion_Flag"].describe()
We have about 5 times as many "No Sale" data rows as we do data that shows a successful sale happened. This data bias may have an impact on the effectiveness of our model to predict positive sales results
First things first
Learning from my colleague Tim's work already we know:
-
Quotenumber
is unique so we can make it the index -
Original_Quote_Date
column should be set as a date type
Additionally, we should make sure to apply any changes to data types to both train and test data so predictions don't fail later on
df_train = df_train.set_index('QuoteNumber')
df_test = df_test.set_index('QuoteNumber')
We may have some NaN values for Original_Quote_Date in either the training or test dataset, but let's confirm there are none.
df_train['Original_Quote_Date'].isna().sum(), df_test['Original_Quote_Date'].isna().sum()
df_train['Original_Quote_Date'] = pd.to_datetime(df_train['Original_Quote_Date'])
df_test['Original_Quote_Date'] = pd.to_datetime(df_test['Original_Quote_Date'])
Add the date_part to see if this helps improve modeling
df_train = add_datepart(df_train, 'Original_Quote_Date')
df_test = add_datepart(df_test, 'Original_Quote_Date')
y_names = [y_column[0]]
y_names
cont_names, cat_names = cont_cat_split(df_train, dep_var=y_names)
len(cont_names), len(cat_names)
Modifying these lists here based on EDA notebook learnings to date
'Field8' in cont_names
'Field9' in cat_names, 'Field9' in cont_names
field9_categories = df_train['Field9'].unique()
df_train['Field9'] = df_train['Field9'].astype('category')
df_train['Field9'].cat.set_categories(field9_categories, inplace=True)
cont_names.remove('Field9')
cat_names.append('Field9')
'Field11' in cat_names, 'Field11' in cont_names
field11_categories = df_train['Field11'].unique()
df_train['Field11'] = df_train['Field11'].astype('category')
df_train['Field11'].cat.set_categories(field11_categories, inplace=True)
cont_names.remove('Field11')
cat_names.append('Field11')
'PropertyField25' in cat_names, 'PropertyField25' in cont_names
propertyfield25_categories = df_train['PropertyField25'].unique()
df_train['PropertyField25'] = df_train['PropertyField25'].astype('category')
df_train['PropertyField25'].cat.set_categories(propertyfield25_categories, inplace=True)
cont_names.remove('PropertyField25')
cat_names.append('PropertyField25')
df_train.drop('PropertyField29', axis=1, inplace=True)
df_train.drop('PersonalField84', axis=1, inplace=True)
cont_names.remove('PersonalField84')
cont_names.remove('PropertyField29')
"QuoteConversion_Flag" in cont_names, "QuoteConversion_Flag" in cat_names #Make sure we've gotten our y-column excluded
procs = [Categorify, FillMissing, Normalize]
splits = TrainTestSplitter(test_size=test_size, stratify=df_train[y_names])(df_train)
to = TabularPandas(df=df_train, procs=procs, cat_names=cat_names, cont_names=cont_names, y_names=y_names,splits=splits)
dls = to.dataloaders(bs=bs, val_bs=val_bs)
dls.valid.show_batch()
len(dls.train)*512, len(dls.valid)*128
roc_auc_binary = RocAucBinary()
learn = tabular_learner(dls, metrics=roc_auc_binary)
type(roc_auc_binary)
learn.lr_find()
Note I ran
fit_one_cycle
with a value of 10 when prepping this notebook for publishing, but the test results came out suspiciously high on1
outputs, given that the test submission I ran before was heavily weighted with0
outputs and got a 0.83 score when I ran with5
but I didn't set the random seed value then. What happened I think was changing the splitter, I got a different tensor output and I was looking at the alternate value column instead column with the prediction value.
learn.fit_one_cycle(epochs, lr, wd=wd)
Referenced another Kaggle notebook for this, we don't need it but it's good to see what fastai metrics is actually packaging up for you
preds, targs = learn.get_preds()
preds[0:1][0][0], preds[0:1][0][1]
Here was my mistake, I was looking at the wrong classifier value
len(preds)
(preds[:][:][:,1] >= 0.5).sum(), (preds[:][:][:,1] < 0.5).sum()
from sklearn.metrics import roc_auc_score
valid_score = roc_auc_score(to_np(targs), to_np(preds[:][:][:,1]))
valid_score
Doing inferences based on this blog post from Walk With Fastai initially, but then experimenting to get this
dl_test = dls.test_dl(df_test)
preds, _ = learn.get_preds(dl=dl_test)
(preds[:][:][:,1] >= 0.5).sum(), (preds[:][:][:,1] < 0.5).sum()
path.ls()
file_extract(path/"sample_submission.csv.zip")
path.ls()
df_submission = pd.read_csv(path/"sample_submission.csv") #I could add `low_memory=false` but it makes things slower
df_submission.head()
df_submission.tail()
len(df_test.index), len(preds[:][:][:,1])
type(preds)
preds.dtype
preds[0,1]
preds[:1][:1]
We want the 2nd value, this is what gives us our confidence value if it is going to be a sale
preds[:1][:1][:,1]
preds_for_submission = preds[:][:][:,1].tolist()
preds_for_submission[0:3]
fpfs = [float(pfs) for pfs in preds_for_submission]
fpfs[0:2]
integers = [[1], [2], [3]]
strings = [int(integer[0]) for integer in integers]
strings
submission = pd.DataFrame({'QuoteNumber': df_test.index, 'QuoteConversion_Flag': preds[:][:][:,1].tolist()}, columns=['QuoteNumber', 'QuoteConversion_Flag'])
Needed to figure out how to extract the floating point value alone from the list to properly compose the csv output dataframe
type(submission.QuoteConversion_Flag)
type(submission.QuoteConversion_Flag[0])
submission.QuoteConversion_Flag[0]
Played around with the example on list comprehension here to get it to work with what I had to work with
submission.QuoteConversion_Flag = [float(qcf) for qcf in submission.QuoteConversion_Flag]
submission.head()
submission.QuoteConversion_Flag = round(submission.QuoteConversion_Flag).astype('Int64')
submission.head()
len(submission[submission.QuoteConversion_Flag==1])
len(submission[submission.QuoteConversion_Flag==0])
submission.to_csv(path/'submission11.csv', index=False)
api.competition_submit(path/'submission11.csv',message="Tenth pass", competition='homesite-quote-conversion')
learn.save('homesite_fastai_nn11')