Using Fastai Tabular For Tabular Playground Series June 2021 Competition
Here I use fastai with some changes to the defaults to make a submission to Kaggle for the Tabular Playground Series June 2021 competition.
- Introduction
- Setup fastai and Google drive
- Environment Setup
- Setup kaggle environment parameters
- Exploring the Playground data
- First things first
- Goal: Better model training, refining fastai parameters, using EDA insights gathered to date
- EDA on the categorical and continuous feature splits
- Submission To Kaggle
Introduction
This is a modification of the "first pass" submission to the Homesite Competition on Kaggle Competition using Google Colab, but modifying some of the default parameters and maybe adding some learning from the initial exploratory data analysis on other projects to apply to this one, to see if I can applying what was learnt so far to see how it fairs in a submission.
Changes made:
- Use TestTrainSplitter() for making test and validation sets more fairly weighted based on the bias of any input data towards negative results
- Increase batch size to 1024 to make training shorter, but to still hopefully get a better predictor for it. Set a separate validator batch size to 128.
- Increase the validation percentage to 0.25
- Fix the learning rate to 1e-3
- Increase epochs to 5
- Modified the
cat_namesandcont_namesarrays with any initial EDA insights - Add a date part for dates
- Add weight decay of 0.2
!pip install -Uqq fastai
from fastai.tabular.all import *
!pip install kaggle
Useful links here:
- Documentation on Path library
- Documentation on fastai extensions to Path library
Path.cwd()
Environment Setup
Setup the environment variables here (these will change as we play around to get the best model for submission)
- Set the random seed so that the results are reproducible
- Set the batch size for training
- Set the batch size for validation
- Set what portion of training data to set aside for validation
- Set the number of epochs of training to run to not overfit
- Set the learning rate to be used
- Set the weight decay
set_seed(42)
bs = 1024
val_bs = 128
test_size = 0.25
epochs = 5
lr = 1e-3
wd=0.2
from kaggle import api
path = Path.cwd()
path.ls()
This bit is to make sure I don't checkin my data to Github when I'm finished
!touch .gitignore
!echo "_data" > .gitignore
!head .gitignore
!mkdir _data
os.chdir('_data')
Path.cwd()
path = Path.cwd()/"playground_Jun_2021_data"
path.mkdir(exist_ok=True)
Path.BASE_PATH = path
api.competition_download_cli('tabular-playground-series-jun-2021', path=path)
path.ls()
file_extract(path/"tabular-playground-series-jun-2021.zip")
path.ls()
df_train = pd.read_csv(path/"train.csv", low_memory=False)
df_train.head()
df_train.shape
df_train.describe()
True in df_train.isna()
df_test = pd.read_csv(path/"test.csv", low_memory=False)
df_test.head()
df_test.shape
True in df_test.isna()
y_column = df_train.columns.difference(df_test.columns)
y_column
From this it looks like target is the value we want to predict. Let's take a look at this
type(df_train.target)
df_train.target.unique()
type(df_train.target.unique()[0])
df_train.target.isna().sum()
Make this a category for the purpose of generating predictions as a classification
df_train.target = df_train.target.astype(dtype='category')
target_categories = df_train['target'].unique()
target_categories
df_train['target'].cat.set_categories(target_categories, inplace=True)
Let's see how the training data outcomes are balanced
df_train.target.describe()
train_data_balance = pd.DataFrame(df_train["target"]).groupby("target")
train_data_balance["target"].describe()
It's not quite equally weighted, e.g. Class_4 and Class_5 are ten times less than Class_6
df_train.id.value_counts(), df_test.id.value_counts()
df_train = df_train.set_index('id')
df_test = df_test.set_index('id')
y_names = [y_column[0]]
y_names
cont_names, cat_names = cont_cat_split(df_train, dep_var=y_names)
len(cont_names), len(cat_names)
The goal here is to validate the splot, rearrange as needed, and explicitly set the dtype and categories for categorical columns. I am going to use what I wrote for another post with some functions that help speed up evaluating the splits and reassigning them better than the fastai defaults
First I'll create a triage list for any fields that can't be programmatically optimized. These are the ones we have to do manual steps for until I find a way to do better
triage = L()
Let's take a quick look at the descriptions of the categorical and continuous splits automatically done
df_train[cont_names].astype('object').describe()
The first thing I notice, is that for a number of these fields, there is quite a low number of unique values. Luckily, I don't see any missing or NA values as all the count totals are equal to the total number of rows of data. These are a lot of fields to go through and manually recategorize and then get their categories, but I am hoping I can do this programmatically.
Let's have a look at the existing categoricals to make sure there's nothing suspeicious about these here either
df_train[cat_names].astype('object').describe()
Here I define two functions, which will help
- to reset the
cont_namesandcat_namesarrays with better fits of the actual data fields for those that are categorical, but were put into the continuous array. - For all cateogorical fields, it will also setup the categories for these fields and change their dtype
- For any fields that have null values, it will remove them from their respective field, and place them in the
triagelist
def reassign_to_categorical(field, df, continuous, categorical, triage):
if df[field].isna().sum()==0:
field_categories = df[field].unique()
df[field] = df[field].astype('category')
df[field].cat.set_categories(field_categories, inplace=True)
if field in continuous: continuous.remove(field)
if field not in categorical: categorical.append(field)
else:
if field in continuous: continuous.remove(field)
if field in categorical: categorical.remove(field)
triage.append(field)
return df, continuous, categorical, triage
def categorize( df, cont_names, cat_names, triage, category_threshold):
for field in df.columns:
if ((len(df[field].unique()) <= category_threshold) and (type(df[field].dtype) != pd.core.dtypes.dtypes.CategoricalDtype)):
reassign_to_categorical(field, df, cont_names, cat_names, triage)
return df, cont_names, cat_names, triage
df_train, cont_names, cat_names, triage = categorize(df_train, cont_names, cat_names, triage, 100)
len(cont_names), len(cat_names)
So this is a big rebalancing between continuous and categorical fields. I saved alot of time with this function rather than doing this manually like I had for my initial data exploration. Let's take a look at how many came up for triaging still though
triage
This was expected, since my triage fields would only be those that have NA values,so that we can look closer at the data to evaluate a FillMissing strategy, or if we have too many missing data values, to ignore the field in modelling
Check to make sure I didn't make any typos, if the counts show any missing rows that's a sign
df_train.describe(include='all')
True in df_train.isna()
"id" in cont_names, "id" in cat_names #Make sure we've gotten our y-column excluded
procs = [Categorify, FillMissing, Normalize]
splits = TrainTestSplitter(test_size=test_size, stratify=df_train[y_names])(df_train)
to = TabularPandas(df=df_train, procs=procs, cat_names=cat_names, cont_names=cont_names, y_names=y_names,splits=splits)
dls = to.dataloaders(bs=bs, val_bs=val_bs, layers=[500,1000], ps=[0.01,0.001])
dls.valid.show_batch()
len(dls.train)*bs, len(dls.valid)*val_bs
learn = tabular_learner(dls, metrics=accuracy)
learn.lr_find()
lr = 0.0014
epochs = 4
learn.fit_one_cycle(epochs,lr, wd=wd)
preds, targs = learn.get_preds()
preds.shape
preds[0:1]
len(preds)
Doing inferences based on this blog post from Walk With Fastai initially, but then experimenting to get this
dl_test = dls.test_dl(df_test)
preds, _ = learn.get_preds(dl=dl_test)
preds.shape
path.ls()
df_submission = pd.read_csv(path/"sample_submission.csv") #I could add `low_memory=false` but it makes things slower
df_submission.head()
df_submission.tail()
len(df_test.index), len(preds)
type(preds)
preds.dtype
preds_for_submission = preds.tolist()
preds_for_submission[0:1]
submission = pd.DataFrame({'id': df_test.index, 'Predictions': preds.tolist()}, columns=['id', 'Predictions'])
Needed to figure out how to extract the floating point value alone from the list to properly compose the csv output dataframe
type(submission.Predictions)
type(submission.Predictions[0][0])
submission.Predictions[0][0]
submission.Predictions.tolist()[0]
submission.head()
submission.index
submission[['Class_1','Class_2', 'Class_3', 'Class_4', 'Class_5', 'Class_6', 'Class_7', 'Class_8', 'Class_9']] = pd.DataFrame(submission.Predictions.tolist(), index= submission.index)
submission.head()
submission.drop(["Predictions"],axis=1, inplace=True)
submission.head()
submission.to_csv(path/'submission10.csv', index=False)
api.competition_submit(path/'submission10.csv',message="Tenth pass", competition='tabular-playground-series-jun-2021')
learn.save('tabular-playground-series-jun-2021_fastai_nn9')