Load pickled model and convert quote to successful
This notebook loads a previously trained model and uses it to predict quote success rate. Then any quote can be chosen (from train or test) and a sensitivity analysis will determine which fields can be changed to turnaround quote success.
- Set up
- Load Deep Learning model
- Load XGBoost model (optional)
- Create a DataFrame of all quotes, train and test
- Create a sensitivity analysis tool
- Use the sensitivity tool
import logging
from fastai.tabular.all import *
import pandas as pd
import numpy as np
pd.options.mode.chained_assignment = None # default='warn'
from sklearn.metrics import roc_auc_score
from IPython.utils import io # using io.capture_output
path = Path('data/homesite-quote')
logger = logging.getLogger("load_pickled_model")
logging.basicConfig(level=logging.INFO)
Load Deep Learning model
I created my pickled deep learning model using notebook https://redditech.github.io/team-fast-tabulous/kaggle/fastai/2021/07/08/HomeSite-Quote-A-Fastai-Tabular-Approach.html and in the deep learning model section after cell
dl_roc_auc_score=roc_auc_score(to_np(targs), to_np(preds[:,1]))
I ran the command
save_pickle(path/"learn_0708.pkl", learn)
Now I can load that pickle into this notebook and check it gives the same dl_roc_auc_score which should be 0.963051
trained_dl_pkl = "learn_0708.pkl"
learn = load_pickle(path/trained_dl_pkl)
preds, targs = learn.get_preds()
print(f"Trained deep learning model {trained_dl_pkl} has a roc_auc_score of {roc_auc_score(to_np(targs), to_np(preds[:,1]))}")
Load XGBoost model (optional)
I created my pickled xgboost model using notebook https://redditech.github.io/team-fast-tabulous/kaggle/fastai/2021/07/08/HomeSite-Quote-A-Fastai-Tabular-Approach.html and in the XGBoost model section after cell
plot_importance(xgb_model, height=1,max_num_features=20,)
I ran the commands
save_pickle(path/"to_0708.pkl", to)
save_pickle(path/"xgb_model_0708.pkl", xgb_model)
Now I can load those pickles into this notebook and check it gives the same xg_roc_auc_score which should be 0.964158
to = load_pickle(path/"to_0708.pkl")
trained_xgb_pkl = "xgb_model_0708.pkl"
xgb_model = load_pickle(path/trained_xgb_pkl)
xgb_preds = xgb_model.predict_proba(to.valid.xs)
print(f"Trained XGBoost model {trained_xgb_pkl} has a roc_auc_score of {roc_auc_score(to.valid.ys.values.ravel(), xgb_preds[:, 1])}")
df_train = pd.read_csv(path/'train.csv', low_memory=False, parse_dates=['Original_Quote_Date'], index_col="QuoteNumber")
df_test = pd.read_csv(path/'test.csv', low_memory=False, parse_dates=['Original_Quote_Date'], index_col="QuoteNumber")
sr_conv = df_train['QuoteConversion_Flag']
df_train.drop('QuoteConversion_Flag', inplace=True, axis=1)
df = pd.concat([df_train, df_test])
df = add_datepart(df, 'Original_Quote_Date')
print(df.shape, df_train.shape, df_test.shape, sr_conv.shape)
df_train = None
df_test = None
Looking at actual conversion flags we can see a few which were successful, eg 25, 26, 32, 47
sr_conv.head(30)
Looking at sorted df we can see all quote numbers are represented, at least in the first 30 ;)
df.sort_index().head(30)
Check that we have every quote number between 1 and 434589 inclusive. We do.
df.index.min(), df.index.max(), df.index.max() - df.index.min() + 1, df.shape[0], df.index.nunique()
To select a quote, use DataFrame.loc with the quote number in square brackets. For example quote number 1
df.loc[1]
Similarly to select an "actual" quote conversion use Series of conversions sr_conv with quote number in square brackets
sr_conv[2]
We can see that our model predicts that quote 25 to be sucessful and it was
qn = 25
prd = learn.predict(df.loc[qn])
print("Predicted success", prd[1], "with confidence", prd[2])
print("Actual success", sr_conv[qn])
Create a sensitivity analysis tool
A field is sensitive if changing the value of the field can change the outcome of the predicted quote success
While logging is INFO some logging will occur during a normal run. Setting logging level to WARNING will only log if an unknown dtype is encountered. See setup above to set level.
def sensitivity_analysis(qn):
"""Using data from quote number qn do a sensitivity analysis on all independent variables"""
# Independent variables
ind_original = df.loc[qn]
prd = learn.predict(ind_original)
# Predicted quote conversion flag
qcf_original = prd[1].item()
# Probability that quote conversion flag is as predicted
prb_original = prd[2][qcf_original].item()
logger.info(f"Sensitivity Analysis for Quote {qn}")
# Check if we actually know the correct answer
if qn in sr_conv.index:
logger.info(f"Actual QuoteConversion_Flag {sr_conv[qn]}")
def tf_sensitive(f, v_original, lst_v, p_original):
"""predicts quote success after changing field f from v_original to each value in lst_v.
If prediction changes then quote is sensitive to the value of this field and True is returned"""
# Create a DataFrame which has every row identical except for field in question
# Field f iterates through every value provided
ind_other = df.loc[qn:qn].copy().drop(f, axis=1) # fields other than f
ind_f = pd.DataFrame(data={f: lst_v}, index=[qn] * len(lst_v))
# Merge these two DataFrames to create one with all rows identical except field f
ind = pd.merge(ind_other, ind_f, right_index=True, left_index=True)
# Copy lines from learn.predict() because we want to predict several rows at once (faster than one at a time)
dl = learn.dls.test_dl(ind)
dl.dataset.conts = dl.dataset.conts.astype(np.float32)
# stop learn.get_preds() printing blank lines
with io.capture_output() as captured:
# using get_preds() rather than predict() because get_preds can do multiple rows at once
inp,preds,_,dec_preds = learn.get_preds(dl=dl, with_input=True, with_decoded=True)
tf = False
# Check if any predictions changed
for i, dp in enumerate(dec_preds):
qcf = dp.item()
if qcf != qcf_original:
prb = preds[i][qcf].item()
logger.info(f"Changing {f} from {val_original} to {lst_v[i]} changes predicted quote conversion flag "
f"from {prb_original:.2%} {qcf_original} to {prb:.2%} {qcf}")
tf = True
return tf
set_sensitive = set()
# Loop through all fields. Check different values of each field to see if result is sensitive to it.
for field in df.columns:
ind = ind_original.copy()
val_original = ind[field]
tf_important = False
num_unique = df[field].nunique()
# If number of unique values is under 30 then try every value (or for objects try every value)
if num_unique < 30 or df.dtypes[field] == 'O':
lst_unique = df[field].unique()
if tf_sensitive(field, val_original, lst_unique, prb_original):
tf_important = True
if tf_important:
logger.info(f"Possible values of {field} are {lst_unique}")
set_sensitive.add(field)
else:
if df.dtypes[field] == "int64":
vmin = df[field].min()
vmax = df[field].max()
lst_val = [vmin + (vmax - vmin) * i // 10 for i in range(11)]
logger.debug(f"{field} {num_unique} {df.dtypes[field]!r} {vmin} {vmax} {lst_val}")
if tf_sensitive(field, val_original, lst_val, prb_original):
tf_important = True
elif df.dtypes[field] == "float64":
vmin = df[field].min()
vmax = df[field].max()
lst_val = [vmin + (vmax - vmin) * i / 10 for i in range(11)]
logger.debug(f"{field} {num_unique} {df.dtypes[field]!r} {vmin} {vmax} {lst_val}")
if tf_sensitive(field, val_original, lst_val, prb_original):
tf_important = True
else:
logger.warning(f"Unknown type {field} {num_unique} {df.dtypes[field]!r}")
if tf_important:
set_sensitive.add(field)
# return the set of fields which had individual effects on the prediction
return set_sensitive
Use the sensitivity tool
Here are the results of running the sensitivity analysis on quote number 2.
Sensitivity Analysis for Quote 2
Changing SalesField5 from 5 to 3 changes predicted quote conversion flag from 97.04% 0 to 63.74% 1
Changing SalesField5 from 5 to 4 changes predicted quote conversion flag from 97.04% 0 to 51.11% 1
Changing PersonalField2 from 1 to 0 changes predicted quote conversion flag from 97.04% 0 to 65.11% 1
Changing PersonalField13 from 2 to 1 changes predicted quote conversion flag from 97.04% 0 to 68.25% 1
Changing PropertyField29 from nan to 10.0 changes predicted quote conversion flag from 97.04% 0 to 100.00% 1
Changing PropertyField37 from N to Y changes predicted quote conversion flag from 97.04% 0 to 95.74% 1
sensitivity_analysis(2)