Identify problem type
This notebook works in Google Colab. It gets data from kaggle's homesite-quote-conversion, cleans dates and ints stored as str, checks quote date in training data and test data to see if time series
Install packages recommended in fastbook Ch09
!pip install -Uqq fastbook kaggle waterfallcharts treeinterpreter dtreeviz
import fastbook
fastbook.setup_book()
from fastbook import *
from fastai.vision.widgets import *
from pandas.api.types import is_string_dtype, is_numeric_dtype, is_categorical_dtype
from fastai.tabular.all import *
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from dtreeviz.trees import *
from IPython.display import Image, display_svg, SVG
pd.options.display.max_rows = 20
pd.options.display.max_columns = 8
Upload your kaggle.json API key
btn_upload = widgets.FileUpload(description="kaggle.json")
btn_upload
Save credentials
cred_path = Path('~/.kaggle/kaggle.json').expanduser()
if not cred_path.parent.exists():
cred_path.parent.mkdir()
if len(btn_upload.data) > 0:
with open(cred_path, mode="wb") as cred_file:
cred_file.write(btn_upload.data[-1])
cred_path.chmod(0o600)
from kaggle import api
Note that '!pip install kaggle' does not update cli kaggle in Google colab and is only v1.5.4 while kaggle.api is v1.5.12
!kaggle --version
Python's kaggle.api
is using a more recent version
api.__version__
Get data from kaggle, extract and store in _data
path_hqc = (Path.cwd()/"_data")
path_hqc.mkdir(exist_ok=True)
Path.BASE_PATH = path_hqc
api.competition_download_cli('homesite-quote-conversion', path=path_hqc)
file_extract(path_hqc/"homesite-quote-conversion.zip")
file_extract(path_hqc/"train.csv.zip")
file_extract(path_hqc/"test.csv.zip")
Check what the data looks like
df = pd.read_csv(path_hqc/"train.csv", low_memory=False)
df.head()
Check how much data we have and check if QuoteNumber
is unique
df.shape, len(df['QuoteNumber'].unique())
Conclusion: QuoteNumber
is unique
We don't want to use QuoteNumber
as a feature but we could use it as the index
df = df.set_index('QuoteNumber')
Examine data types in train.csv
df.info()
Find the 28 fields which do not have numeric datatypes
from collections import defaultdict
dct_fields_by_dtype = defaultdict(list)
for i, dt in enumerate(df.dtypes):
dct_fields_by_dtype[dt].append(df.dtypes.index[i])
print("dtypes in train.csv:", dct_fields_by_dtype.keys())
print("fields for object dtype:", dct_fields_by_dtype[np.dtype('O')])
print("number of fields of object dtype:", len(dct_fields_by_dtype[np.dtype('O')]))
Original_Quote_Date
can be converted to datetime
df['Original_Quote_Date'] = pd.to_datetime(df['Original_Quote_Date'])
Recalculate breakdown now that we have changed dtype of Original_Quote_Date
dct_fields_by_dtype = defaultdict(list)
for i, dt in enumerate(df.dtypes):
dct_fields_by_dtype[dt].append(df.dtypes.index[i])
df.info()
Compare Original_Quote_Date
in train.csv
and test.csv
df_test = pd.read_csv(path_hqc/"test.csv", low_memory=False)
df_test["Original_Quote_Date"] = pd.to_datetime(df_test["Original_Quote_Date"])
print("train.csv", df['Original_Quote_Date'].min(), df['Original_Quote_Date'].max(), df.shape)
print("test.csv ", df_test['Original_Quote_Date'].min(), df_test['Original_Quote_Date'].max(), df_test.shape)
Conclusion: overlapping date ranges (in fact identical date ranges) so don't need to consider as time series problem
Check the non-numeric values in other object fields
for col in dct_fields_by_dtype[np.dtype('O')]:
print(f"{col:20s} {df[col].unique()}")
Field10
looks like integers stored as strings so convert to int
s
df['Field10'] = df['Field10'].str.replace(",", "").astype(int)
Recalculate breakdown now that we have changed dtype of Field10
dct_fields_by_dtype = defaultdict(list)
for i, dt in enumerate(df.dtypes):
dct_fields_by_dtype[dt].append(df.dtypes.index[i])
df.info()