HomeSite Quote Conversion A Fastai Tabular Approach
We used Fastai Tabular libary and WalkWithFastai functions to build our Deep Learning model for HomeSite Quote Conversion competition in Kaggle. The techniques used in this notebook include Permutation Importance Analysis, Model Ensembling, Bayesian Optimisation for hyperparameter tuning and Entity Embddings.
As Jeremy Howard mentioned in one of his lessons about tabular data, there are 2 main methods in modelling tabular or structured data:
- Ensembles of Decision Trees (Random Forest or Gradient Boosting Machine)
- Multilayered Neural Networks
The former method is far more popular and usually our first approach when analysing new tabular dataset. This is why he was focusing more on how to train and improve performance of a random forest model but not going too deep into training a deep learning model
However, Fastai itself has a lot of great function to train and improve performance of tabular data model and this is what we want to explore in this article.
Most of the techniques and codes in this notebooks were borrowed from Walk with Fastai from Zack Mueller so please check out his official website: https://walkwithfastai.com/
When you think about improving your model’s performance, some of the methods that you will definitely consider include:
- Feature Engineering
- Hyperparameter Tuning
- Combining Models
This is also exactly what we are going to discuss in this article.
- Basic Deep Learning Model for Tabular Data
- Permutation Analysis to identify Important Features
- Model Ensembling
- Hyperparameter Tuning with Bayesian Optimisation
- Entity Embeddings
!pip install -Uqq fastai fastbook
import fastbook
fastbook.setup_book()
from fastai.tabular.all import *
import pandas as pd
import numpy as np
pd.options.mode.chained_assignment = None # default='warn'
from sklearn.metrics import roc_auc_score
Download your Kaggle KPI, store it in a folder in Google Drive and run the below codes
!mkdir -p ~/.kaggle
!cp /content/gdrive/MyDrive/Kaggle/kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json
Assign the folder where to want to store the data to path
path = Path('/content/gdrive/MyDrive/Kaggle/' + 'data/homesite-quote')
path.mkdir(parents=True, exist_ok=True)
path
Download your data into that folder
!kaggle competitions download -c homesite-quote-conversion -p /content/gdrive/MyDrive/Kaggle/data/homesite-quote
Unzip your data
! unzip -q -n '{path}/train.csv.zip' -d '{path}'
! unzip -q -n '{path}/test.csv.zip' -d '{path}'
Import data and store it as a dataframe
df = pd.read_csv(path/'train.csv', low_memory=False, parse_dates=['Original_Quote_Date'])
test_df=pd.read_csv(path/'test.csv', low_memory=False, parse_dates=['Original_Quote_Date'])
df_train = df.copy()
df_test = test_df.copy()
df_train.QuoteConversion_Flag = df_train.QuoteConversion_Flag.astype(dtype='boolean')
df_train = df_train.set_index('QuoteNumber')
df_test = df_test.set_index('QuoteNumber')
df_train = add_datepart(df_train, 'Original_Quote_Date')
df_test = add_datepart(df_test, 'Original_Quote_Date')
y_names = 'QuoteConversion_Flag'
cont_names, cat_names = cont_cat_split(df_train, dep_var=y_names)
len(cont_names), len(cat_names)
Set up hyperparameters
random_seed = 42
bs = 4096
val_bs = 512
test_size = 0.3
epochs = 5
lr = 1e-2
wd=0.002
layers = [10000,500]
dropout = [0.001, 0.01]
y_block=CategoryBlock()
emb_dropout=0.02
set_seed(42)
roc_auc_binary = RocAucBinary()
procs = [Categorify, FillMissing, Normalize]
splits = TrainTestSplitter(test_size=test_size, stratify=df_train[y_names])(df_train)
to = TabularPandas(df=df_train, procs=procs, cat_names=cat_names,
cont_names=cont_names, y_names=y_names,splits=splits,
y_block=y_block)
dls = to.dataloaders(bs=bs, val_bs=val_bs, layers=layers, embed_ps=emb_dropout, ps=dropout)
dls.valid.show_batch()
learn = tabular_learner(dls, metrics=roc_auc_binary)
learn.lr_find(start_lr=1e-07, end_lr=1000,suggest_funcs=(valley, slide, minimum, steep))
learn.fit_one_cycle(epochs,lr, wd=wd)
preds, targs = learn.get_preds()
dl_roc_auc_score=roc_auc_score(to_np(targs), to_np(preds[:,1]))
class PermutationImportance():
"Calculate and plot the permutation importance"
def __init__(self, learn:Learner, df=None, bs=None):
"Initialize with a test dataframe, a learner, and a metric"
self.learn = learn
self.df = df if df is not None else None
bs = bs if bs is not None else learn.dls.bs
self.dl = learn.dls.test_dl(self.df, bs=bs) if self.df is not None else learn.dls[1]
self.x_names = learn.dls.x_names.filter(lambda x: '_na' not in x)
self.na = learn.dls.x_names.filter(lambda x: '_na' in x)
self.y = dls.y_names
self.results = self.calc_feat_importance()
self.plot_importance(self.ord_dic_to_df(self.results))
def measure_col(self, name:str):
"Measures change after column shuffle"
col = [name]
if f'{name}_na' in self.na: col.append(name)
orig = self.dl.items[col].values
perm = np.random.permutation(len(orig))
self.dl.items[col] = self.dl.items[col].values[perm]
metric = learn.validate(dl=self.dl)[1]
self.dl.items[col] = orig
return metric
def calc_feat_importance(self):
"Calculates permutation importance by shuffling a column on a percentage scale"
print('Getting base error')
base_error = self.learn.validate(dl=self.dl)[1]
self.importance = {}
pbar = progress_bar(self.x_names)
print('Calculating Permutation Importance')
for col in pbar:
self.importance[col] = self.measure_col(col)
for key, value in self.importance.items():
self.importance[key] = (base_error-value)/base_error #this can be adjusted
return OrderedDict(sorted(self.importance.items(), key=lambda kv: kv[1], reverse=True))
def ord_dic_to_df(self, dict:OrderedDict):
return pd.DataFrame([[k, v] for k, v in dict.items()], columns=['feature', 'importance'])
def plot_importance(self, df:pd.DataFrame, limit=30, asc=False, **kwargs):
"Plot importance with an optional limit to how many variables shown"
df_copy = df.copy()
df_copy['feature'] = df_copy['feature'].str.slice(0,25)
df_copy = df_copy.sort_values(by='importance', ascending=asc)[:limit].sort_values(by='importance', ascending=not(asc))
ax = df_copy.plot.barh(x='feature', y='importance', sort_columns=True, **kwargs)
for p in ax.patches:
ax.annotate(f'{p.get_width():.4f}', ((p.get_width() * 1.005), p.get_y() * 1.005))
imp = PermutationImportance(learn)
From this most important fields are PropertyField37
, PersonalField2
, PersonalField1
, SalesField5
import xgboost as xgb
n_estimators = 100
max_depth = 8
learning_rate = 0.1
subsample = 0.5
X_train, y_train = to.train.xs, to.train.ys.values.ravel()
X_valid, y_valid = to.valid.xs, to.valid.ys.values.ravel()
model = xgb.XGBClassifier(n_estimators = n_estimators, max_depth=max_depth, learning_rate=0.1, subsample=subsample, tree_method='gpu_hist')
xgb_model = model.fit(X_train, y_train)
xgb_preds = xgb_model.predict_proba(X_valid)
xg_roc_auc_score=roc_auc_score(y_valid, xgb_preds[:,1])
from xgboost import plot_importance
plot_importance(xgb_model, height=1,max_num_features=20,)
From this most important fields were SalesField1A
, PersonalField9
, Original_Quote_Elapsed
, PersonalField10A
, PersonalField10B
, PropertyField37
avgs = (preds + xgb_preds) / 2
dl_roc_auc_score
xg_roc_auc_score
dlxg_roc_auc_score=roc_auc_score(y_valid, avgs[:,1])
dlxg_roc_auc_score
So we have a slightly better performance with ensembling these two
from sklearn.ensemble import RandomForestClassifier
tree = RandomForestClassifier(n_estimators=100)
tree.fit(X_train, y_train)
!pip install rfpimp
from rfpimp import *
impTree = importances(tree, X_valid, to.valid.ys)
plot_importances(impTree)
So here the most important are PropertyField37
, Field7
,PersonalField1
, SalesField5
, PersonalField9
,PersonalField2
forest_preds = tree.predict_proba(X_valid)
rf_roc_auc_score=roc_auc_score(y_valid, forest_preds[:,1])
rf_roc_auc_score
new_avgs = (preds + xgb_preds + forest_preds) / 3
dlxg_roc_auc_score
dlxgrf_roc_auc_score=roc_auc_score(y_valid, new_avgs[:,1])
dlxgrf_roc_auc_score
So it gets slightly worse when we add Random Forest to the ensemble.
There are 4 common methods of hyperparameter optimisation for machine learning:
- Manual - Manually test every single hyperparameter combination
- Grid Search - Set up a grid of model hyperparameters and run an automatic loop through every single scenario
- Random Search - Similar to Grid Search but Random
- Bayesian Optimisation
The biggest difference between Grid Search, Random Search and Bayesian Opt is the formers do not take into account past evaluations while the later does.
There is a whole field of research dedicated to this particular optimisation method. So if you want to understand the mathematical or statistical formulas behind this model, I would suggest to read a book about it.
But simply put, Bayesian Optimisation search builds a probability model of the objective function and use it to select the most promising hyperparameters to evaluate in the true objective function:
- Build a surrogate probability model of the objective function
- Find the hyperparameters that perform best on the surrogate
- Apply these hyperparameters to the true objective function
- Update the surrogate model incorporating the new results
- Repeat steps 2–4 until max iterations or time is reached
You can easilly install a Bayesian Optimisation Python libary
pip install bayesian-optimization
from bayes_opt import BayesianOptimization
Define a objective function to optimise. In this case, we are going to test out learning rate, embedding dropout rate, number of layers and layer sizes.
def fit_with(lr:float, wd:float, dp:float, n_layers:float, layer_1:float, layer_2:float, layer_3:float):
print(lr, wd, dp)
if round(n_layers) == 2:
layers = [round(layer_1), round(layer_2)]
elif int(n_layers) == 3:
layers = [round(layer_1), round(layer_2), round(layer_3)]
else:
layers = [round(layer_1)]
config = tabular_config(embed_p=float(dp),
ps=float(wd))
learn = tabular_learner(dls, layers=layers, metrics=roc_auc_binary, config = config)
with learn.no_bar() and learn.no_logging():
learn.fit(5, lr=float(lr))
preds, targs = learn.get_preds()
auc_score = roc_auc_score(to_np(targs), to_np(preds[:,1]))
return auc_score
We can also specify a range of values for each hyperparameter that we want to tune.
hps = {'lr': (1e-05, 1e-01),
'wd': (4e-4, 0.4),
'dp': (0.01, 0.5),
'n_layers': (1,3),
'layer_1': (50, 200),
'layer_2': (100, 1000),
'layer_3': (200, 2000)}
optim = BayesianOptimization(
f = fit_with, # our fit function
pbounds = hps, # our hyper parameters to tune
verbose = 2, # 1 prints out when a maximum is observed, 0 for silent, 2 prints out everything
random_state=1
)
%time optim.maximize(n_iter=10)
Print out the set of hyperparameters that produce the best accuracy score and use them to train our deep learning model again
print(optim.max)
random_seed = 42
bs = 4096
val_bs = 512
test_size = 0.3
epochs = 5
lr = 0.02 #0.03
wd=0.3 #0.32
layers = [200] #200,150[50,1000]
dropout = [0.001, 0.01]
y_block=CategoryBlock()
emb_dropout=0.17 #0.13 #0.2
set_seed(42)
roc_auc_binary = RocAucBinary()
dls = to.dataloaders(bs=bs, val_bs=val_bs, layers=layers, embed_ps=emb_dropout, ps=dropout)
learn = tabular_learner(dls, metrics=roc_auc_binary)
learn.fit_one_cycle(6,lr, wd=wd)
preds, targs = learn.get_preds()
dl_opt_roc_au_score = roc_auc_score(to_np(targs), to_np(preds[:,1]))
dl_roc_auc_score, dl_opt_roc_au_score
The final roc_au_score increases from 0.962 to 0.963, which is quite significant for this particular dataset. Because if you have a look at the Kaggle Leaderboard, the accuracy score starts at around 0.95 and the highest score is around 0.97
The most common variable types in structured data are:
- continuous variables
- categorical variables
To use cat variables in our model, we need to represent them as integers.
Sometimes, there are intrinsic ordering in those numbers (for example: low - 1, medium - 2, high - 3) and they are called ordinal numbers.
But more than often, those integers are random and don’t provide any information and they are called nominal numbers. For example, sex or states (we cannot really rank states).
Our model will perform better if those integers contain some relevant information of that paticular cat variable.
One way to deal with this problem is to use one hot end coding. But the issue with one hot end coding is that it is computationally expensive.
What entity embeddings can do is putting similar values of a cat variable closer together in the embedding space. You can first train a neural network with cat embeddings and then use those cat embeddings instead of raw categorical columns in the models.
learn.model.embeds
Function to embed features, obtained from Fastai forums
def embed_features(learner, xs):
xs = xs.copy()
for i, feature in enumerate(learner.dls.cat_names):
emb = learner.model.embeds[i]
new_feat = pd.DataFrame(emb(tensor(xs[feature], dtype=torch.int32).to('cuda:0')), index=xs.index, columns=[f'{feature}_{j}' for j in range(emb.embedding_dim)])
xs.drop(columns=feature, inplace=True)
xs = xs.join(new_feat)
return xs
Run this function for both your training and valid sets
emb_xs = embed_features(learn, to.train.xs)
emb_valid_xs = embed_features(learn, to.valid.xs)
Use the embedded layers to train your gradient boosting model
model = xgb.XGBClassifier(n_estimators = n_estimators, max_depth=max_depth, learning_rate=0.1, subsample=subsample, tree_method='gpu_hist')
xgb_model_emb=model.fit(emb_xs, y_train)
xgb_emb_preds = xgb_model_emb.predict_proba(emb_valid_xs)
xg_emb_roc_auc_score=roc_auc_score(y_valid, xgb_emb_preds[:,1])
xg_roc_auc_score, xg_emb_roc_auc_score
In this problem, it doesn't seem to change the accuracy score. However, you might be more sucessful with other datasets.