Improving `cont_cat_split` in Fastai Tabular For Homesite Competition
Here I improve on fastai's `cont_cat_split` function and add some changes to the defaults to make another submission to Kaggle for the Homesite competition.
- Introduction
- Old stuff, read about in other notebooks
- Goal: Automate the bits we can, continue refining fastai parameters, continue using EDA insights gathered to date
- Submission To Kaggle
Introduction
This is an additional modification of the "first pass" submission and subsequent additional work to the Homesite Competition on Kaggle Competition using Google Colab. Modification of some of the default parameters is done with some learning from the initial exploratory data analysis at this time, as well as other sources of readings. The major addition is listed below, but all are to see if I can improve the baseline after applying what we learnt so far to see how it improves (or not) our submission then.
Changes made:
- Use a function to better split categorical and continuous fields, and set up the categories needed for training automatically
- Included in the function is a new list
triage
which would require a manual analysis to determine a filling strategy that best suits modelling, or if to ignore the field for training - Changed from RandomSplitter() to TrainTestSplitter() for making test and validation sets more fairly weighted based on the bias of the input data towards negative results
- Increase batch size to 1024 to make training shorter, but to still hopefully get a better predictor for it. Set a separate validator batch size to 128.
- Increase the validation percentage to 0.25
- Fix the learning rate to 1e-2
- Increase epochs to 7, see if it overfits and what effect that has
- Modified the
cat_names
andcont_names
arrays with the initial insights from the EDA notebook post - Add a date part for dates
- Add weight decay of 0.2
!pip install -Uqq fastai
from fastai.tabular.all import *
The snippet below is only useful in Colab for accessing my Google Drive and is straight out the fastbook source code in Github
global gdrive
gdrive = Path('/content/gdrive/My Drive')
from google.colab import drive
if not gdrive.exists(): drive.mount(str(gdrive.parent))
Only add the Kaggle bits below if I'm running locally, in Collab they're already here
!ls /content/gdrive/MyDrive/Kaggle/kaggle.json
Useful links here:
- Documentation on Path library
- Documentation on fastai extensions to Path library
Path.cwd()
!mkdir -p ~/.kaggle
!cp /content/gdrive/MyDrive/Kaggle/kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json
from kaggle import api
path = Path.cwd()
path.ls()
path = path/"gdrive/MyDrive/Kaggle/homesite_data"
path.mkdir(exist_ok=True)
Path.BASE_PATH = path
api.competition_download_cli('homesite-quote-conversion', path=path)
file_extract(path/"homesite-quote-conversion.zip")
file_extract(path/"train.csv.zip")
file_extract(path/"test.csv.zip")
path
path.ls()
Set the random seed so that the results are reproducible, set other parameters so changes can be made quickly. Trying to avoid 'magic numbers') where possible
set_seed(42)
bs = 1024
val_bs = 128
test_size = 0.25
epochs = 5
lr = 0.0012
wd=0.2
df_train = pd.read_csv(path/"train.csv", low_memory=False)
df_train.head()
df_train.shape
df_test = pd.read_csv(path/"test.csv", low_memory=False)
df_test.head()
df_test.shape
y_column = df_train.columns.difference(df_test.columns)
y_column
From this it looks like QuoteConversion_Flag
is the value we want to predict. Let's take a look at this
type(df_train.QuoteConversion_Flag)
df_train.QuoteConversion_Flag.unique()
type(df_train.QuoteConversion_Flag.unique()[0])
Make this a boolean for the purpose of generating predictions as a binary classification
df_train.QuoteConversion_Flag = df_train.QuoteConversion_Flag.astype(dtype='boolean')
Let's see how the training data outcomes are balanced
df_train.QuoteConversion_Flag.describe()
train_data_balance = pd.DataFrame(df_train["QuoteConversion_Flag"]).groupby("QuoteConversion_Flag")
train_data_balance["QuoteConversion_Flag"].describe()
We have about 5 times as many "No Sale" data rows as we do data that shows a successful sale happened. This data bias may have an impact on the effectiveness of our model to predict positive sales results
First things first
Learning from my colleague Tim's work already we know:
-
Quotenumber
is unique so we can make it the index -
Original_Quote_Date
column should be set as a date type
Additionally, we should make sure to apply any changes to data types to both train and test data so predictions don't fail later on
df_train = df_train.set_index('QuoteNumber')
df_test = df_test.set_index('QuoteNumber')
We may have some NaN values for Original_Quote_Date in either the training or test dataset, but let's confirm there are none.
df_train['Original_Quote_Date'].isna().sum(), df_test['Original_Quote_Date'].isna().sum()
df_train['Original_Quote_Date'] = pd.to_datetime(df_train['Original_Quote_Date'])
df_test['Original_Quote_Date'] = pd.to_datetime(df_test['Original_Quote_Date'])
Add the date_part to see if this helps improve modeling
df_train = add_datepart(df_train, 'Original_Quote_Date')
df_test = add_datepart(df_test, 'Original_Quote_Date')
y_names = [y_column[0]]
y_names
cont_names, cat_names = cont_cat_split(df_train, dep_var=y_names)
len(cont_names), len(cat_names)
# df_train.drop('PersonalField84', axis=1, inplace=True)
First I'll create a triage
list for any fields that can't be programmatically optimized. These are the ones we have to do manual steps for until I find a way to do better
triage = L()
Let's take a quick look at the descriptions of the categorical and continuous splits automatically done
df_train[cont_names].astype('object').describe()
The first thing I notice, is that for a number of these fields, there is quite a low number of unique
values. I also notice some, like PersonalField84
are missing quite a bit of data. These are a lot of fields to go through and manually recategorize and then get their categories, but I am hoping I can do this programmatically.
Let's have a look at the existing categoricals to make sure there's nothing suspeicious about these here either
df_train[cat_names].astype('object').describe()
So there may be some fields with missing data here, like PropertyField38
we will have to look at a strategy for these too
Question for later:Should those with only two categories be mapped as Categorical or Booleans? Would there be any impact here? See an example field below
field = "Field12"
df_train[field].unique()
Here I define two functions, which will help
- to reset the
cont_names
andcat_names
arrays with better fits of the actual data fields for those that are categorical, but were put into the continuous array. - For all cateogircal fields, it will also setup the categories for these fields and change their dtype
- For any fields that have null values, it will remove them from their respective field, and place them in the
triage
list
def reassign_to_categorical(field, df, continuous, categorical, triage):
if df[field].isna().sum()==0:
field_categories = df[field].unique()
df[field] = df[field].astype('category')
df[field].cat.set_categories(field_categories, inplace=True)
if field in continuous: continuous.remove(field)
if field not in categorical: categorical.append(field)
else:
if field in continuous: continuous.remove(field)
if field in categorical: categorical.remove(field)
triage.append(field)
return df, continuous, categorical, triage
def categorize( df, cont_names, cat_names, triage, category_threshold):
for field in df.columns:
if ((len(df[field].unique()) <= category_threshold) and (type(df[field].dtype) != pd.core.dtypes.dtypes.CategoricalDtype)):
reassign_to_categorical(field, df, cont_names, cat_names, triage)
return df, cont_names, cat_names, triage
field = 'Field8'
df_train[field].unique()
df_train, cont_names, cat_names, triage = categorize(df_train, cont_names, cat_names, triage, 100)
len(cont_names), len(cat_names)
So this is a big rebalancing between continuous and categorical fields. I saved alot of time with this function rather than doing this manually like I had for my initial data exploration. Let's take a look at how many came up for triaging still though
triage
And let's look again at Field8
to see how it did the categorizations for me
field = 'Field8'
df_train[field].unique()
ToDo: Put in bits that triage those 9 fields that popped up. For now I will run the modelling ignoring these fields and see if any improvements happen.
triage
field = 'PersonalField7'
True in df_train[field].isna()
df_train[field].value_counts()
"QuoteConversion_Flag" in cont_names, "QuoteConversion_Flag" in cat_names #Make sure we've gotten our y-column excluded
if (y_column in cont_names): cont_names.remove(y_column)
if (y_column in cat_names): cat_names.remove(y_column)
procs = [Categorify, FillMissing, Normalize]
splits = TrainTestSplitter(test_size=test_size, stratify=df_train[y_names])(df_train)
to = TabularPandas(df=df_train, procs=procs, cat_names=cat_names, cont_names=cont_names, y_names=y_names,splits=splits)
dls = to.dataloaders(bs=bs, val_bs=val_bs)
dls.valid.show_batch()
len(dls.train)*bs, len(dls.valid)*val_bs
roc_auc_binary = RocAucBinary()
learn = tabular_learner(dls, metrics=roc_auc_binary)
type(roc_auc_binary)
learn.lr_find()
Note I ran
fit_one_cycle
with a value of 10 when prepping this notebook for publishing, but the test results came out suspiciously high on1
outputs, given that the test submission I ran before was heavily weighted with0
outputs and got a 0.83 score when I ran with5
but I didn't set the random seed value then. What happened I think was changing the splitter, I got a different tensor output and I was looking at the alternate value column instead column with the prediction value.
learn.fit_one_cycle(epochs, lr, wd=wd)
Referenced another Kaggle notebook for this, we don't need it but it's good to see what fastai metrics is actually packaging up for you
preds, targs = learn.get_preds()
preds[0:1][0][0], preds[0:1][0][1]
Here was my mistake, I was looking at the wrong classifier value
len(preds)
(preds[:][:][:,1] >= 0.5).sum(), (preds[:][:][:,1] < 0.5).sum()
from sklearn.metrics import roc_auc_score
valid_score = roc_auc_score(to_np(targs), to_np(preds[:][:][:,1]))
valid_score
Doing inferences based on this blog post from Walk With Fastai initially, but then experimenting to get this
dl_test = dls.test_dl(df_test)
preds, _ = learn.get_preds(dl=dl_test)
(preds[:][:][:,1] >= 0.5).sum(), (preds[:][:][:,1] < 0.5).sum()
path.ls()
len(df_test.index), len(preds[:][:][:,1])
preds[:1][:1]
We want the 2nd value, this is what gives us our prediction value of how likely our model thinks this is going to be a sale
preds[:1][:1][:,1]
submission = pd.DataFrame({'QuoteNumber': df_test.index, 'QuoteConversion_Flag': preds[:][:][:,1].tolist()}, columns=['QuoteNumber', 'QuoteConversion_Flag'])
Played around with the example on list comprehension here to get it to work with what I had to work with
submission.QuoteConversion_Flag = [float(qcf) for qcf in submission.QuoteConversion_Flag]
submission.head()
submission.QuoteConversion_Flag = round(submission.QuoteConversion_Flag).astype('Int64')
submission.head()
len(submission[submission.QuoteConversion_Flag==1])
len(submission[submission.QuoteConversion_Flag==0])
submission.to_csv(path/'submission13.csv', index=False)
api.competition_submit(path/'submission13.csv',message="Thirteenth pass", competition='homesite-quote-conversion')
learn.save('homesite_fastai_nn13')