Exploring Homesite Data (WORK IN PROGRESS)
Exploring Homesite Data in Collab
- Introduction
- Setup fastai and Google drive
- Setup kaggle environment
- Exploring the Homesite data
- First things first
- Goal: Identifying our categorical and continuous data columns
!pip install -Uqq fastai
from fastai.tabular.all import *
The snippet below is only useful in Colab for accessing my Google Drive and is straight out the fastbook source code in Github
global gdrive
gdrive = Path('/content/gdrive/My Drive')
from google.colab import drive
if not gdrive.exists(): drive.mount(str(gdrive.parent))
Add the Kaggle bits
!pip install kaggle
!ls /content/gdrive/MyDrive/Kaggle/kaggle.json
Useful links here:
- Documentation on Path library
- Documentation on fastai extensions to Path library
Path.cwd()
!mkdir -p ~/.kaggle
!cp /content/gdrive/MyDrive/Kaggle/kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json
from kaggle import api
path = Path.cwd()
path.ls()
path = path/"gdrive/MyDrive/Kaggle/homesite_data"
path.mkdir(exist_ok=True)
Path.BASE_PATH = path
api.competition_download_cli('homesite-quote-conversion', path=path)
file_extract(path/"homesite-quote-conversion.zip")
file_extract(path/"train.csv.zip")
file_extract(path/"test.csv.zip")
path
path.ls()
df_train = pd.read_csv(path/"train.csv", low_memory=False)
df_train.head()
df_test = pd.read_csv(path/"test.csv", low_memory=False)
df_test.head()
y_column = df_train.columns.difference(df_test.columns)
y_column
From this it looks like QuoteConversion_Flag
is the value we want to predict. Let's take a look at this
type(df_train.QuoteConversion_Flag)
df_train.QuoteConversion_Flag.unique()
type(df_train.QuoteConversion_Flag.unique()[0])
What's interesting here is that the training data outcomes that we want to generate predictions on seems to just be a binary classification, either there's a sale (1) or there's no sale (0). With maths and rounding up and down, predictions above 50% will be rounded up to a sale and predictions less than 50% will be classified as predicting no sale if we keep this field as an integer.
In reality, 50% might not be a good threshold to keep and we may want to vary this rounding value of confidence. If we convert it to a floating point, we can have more control over this threshold, and thus the final prediction, so we can predict a sale if there's say, more than 90% confidence in the prediction.
df_train.QuoteConversion_Flag = df_train.QuoteConversion_Flag.astype(dtype='float64')
First things first
Learning from my collague Tim's work already we know:
-
Quotenumber
is unique so make it the index -
Original_Quote_Date
should be made into a date
Additionally, we should make sure to apply any changes to data types to both train and test data so predictions don't fail later on
df_train = df_train.set_index('QuoteNumber')
df_test = df_test.set_index('QuoteNumber')
df_train['Original_Quote_Date'] = pd.to_datetime(df_train['Original_Quote_Date'])
df_test['Original_Quote_Date'] = pd.to_datetime(df_train['Original_Quote_Date'])
categorical=L()
continuous=L()
notused =L()
Let's now see how many columns we are working with
df_train.columns
df_train.info()
298 columns is alot, let's split these by type
df_train.dtypes.unique()
Let's take a look at the smallest subset, the floating point columns, to see if there's anything interesting about them
(df_train[df_train.columns[df_train.dtypes=='float64']].drop(columns=['QuoteConversion_Flag'])).head()
Referencing this article for some good tips on EDA (Exploratory Data Analysis)
(df_train[df_train.columns[df_train.dtypes=='float64']].drop(columns=['QuoteConversion_Flag'])).describe()
df_train['Field8'].value_counts()
df_train['Field8'].isna().sum()
df_train["Field8"].plot.hist(bins=200)
Field8
looks like a good candidate as a continuous field. Let's add it to that list
continuous.append('Field8')
continuous
df_train["Field9"].value_counts()
df_train["Field9"].isna().sum()
df_train["Field9"].plot.hist(bins=50)
Even though it is a floating point field, it looks very categorical in nature rather than continuous. We could peak at the test dataset to confirm this
df_test[~df_test["Field9"].isin(df_train["Field9"].unique())].value_counts()
So this hypothesis holds true, so we're going with the assumption that Field9
is categorical not continuous
categorical.append("Field9")
df_train["Field9"].unique()
df_train["Field11"].value_counts()
df_train["Field11"].isna().sum()
df_train["Field11"].plot.hist(bins=200)
Like Field9
this too also looks more categorical than continuous in nature. We can help reassure this hypothesis by again "peeking" to see if any test data values don't confirm to this theory
df_test[~df_test["Field11"].isin(df_train["Field11"].unique())].value_counts()
Once again, the hypothesis holds true, so let's add Field11
to our list of categorical fields
categorical.append("Field11")
df_train["PersonalField84"].value_counts()
df_train["PersonalField84"].isna().sum()
df_train["PersonalField84"].plot.hist(bins=100)
This one also seems a bit categorical. But there are quite a few NA values. We have a few choices here.
- Given the heavy skewing towards one value, it would arguably be a good assumption to fill this with the
mode
value and not amean
value. - Given that a significant number of NA is present, we could ignore it totally as part of our model, and revisit it after if we're looking to fine tune our input dataset
- For now, let's ignore it
notused.append("PersonalField84")
df_train["PropertyField25"].value_counts()
df_train["PropertyField25"].isna().sum()
df_train["PropertyField25"].plot.hist(bins=100)
This also looks very much like a categorical field and not a continuous field. "Peeking" once more at our test data to help the belief we're not making a bad assumption here
df_test[~df_test["PropertyField25"].isin(df_train["PropertyField25"].unique())].value_counts()
Once again, the hypothesis holds true, so let's add PropertyField25
to our list of categorical fields
categorical.append("PropertyField25")
PropertyField29
looks like it might be a boolean. Let's test this
df_train['PropertyField29'].value_counts()
df_train['PropertyField29'].isna().sum()
This one also seems a bit categorical. But there are quite a few NA values. We have a few choices here.
- Given the heavy skewing towards one value, it would arguably be a good assumption to fill this with the
mode
value and not amean
value. - Given that a significant number of NA is present, we could ignore it totally as part of our model, and revisit it after if we're looking to fine tune our input dataset
- For now, let's ignore it
notused.append("PropertyField29")
What if we decided instead we wanted to include this data, with assumptions for values in NA, as part of our model?
How do we fill the NA values?
Should we use 0 or should we use 1?
Let's first look at if it correlates with any other field that might give us insight about how to treat with it
Let's have a look at the ranges in each of these columns, to identify if they're continuous or categorical in nature. Reference
correlations = df_train.corr()['PropertyField29'][:-1].sort_values(kind="quicksort")
print(correlations)
df_train[['PropertyField29','Field7']].head()
df_train[['PropertyField29','Field7']].plot.scatter(x='Field7',
... y='PropertyField29',
... c='DarkBlue')
categorical, continuous, notused
df_train[df_train.columns[df_train.dtypes=='int64']].head()
df[df.columns[df.dtypes=='object']].head()
There are a lot of fields, let's make sure those that look like they should be boolean columns are set as boolean data types
Using this article about correlation, let's see if there's any obvious correlations in our data columns
correlations = df_train.corr()
print(correlations)
Need to filter this a little better. Borrowing from StackOverflow article to get the highest correlated values
df_train.shape
c = correlations.abs()
s = c.unstack()
so = s.sort_values(kind="quicksort")
print(so[0:10])
Need to filter this a little better. Borrowing from StackOverflow article to get the highest correlated values
Borrowing from seaborn documentation to make this a little more visual
from string import ascii_letters
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_theme(style="white")
# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(so, dtype=bool))
# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))
# Generate a custom diverging colormap
cmap = sns.diverging_palette(230, 20, as_cmap=True)
# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(so, mask=mask, cmap=cmap, vmax=.3, center=0,
square=True, linewidths=.5, cbar_kws={"shrink": .5})