Python EDA Librares for Large Dataset (Google Colab)
This notebook applies 4 different python EDA Packages (Pandas Profiling, Sweetviz, Dtale and Autoviz) on Kaggle Homesite Quote Conversion Data set (~200k rows and 300 columns)
Store your data in a google drive folder and then mount drive to connect to your google drive
from google.colab import drive
drive.mount('/content/drive')
Import relevant libraries
import pandas as pd
import numpy as np
from pathlib import Path
Copy Path of the folder where you store your data
path = Path('/content/drive/MyDrive/Kaggle/data/homesite-quote')
path.mkdir(parents=True, exist_ok=True)
path
Import data and store it as a dataframe
df = pd.read_csv(path/'train_df.csv', low_memory=False)
test_df=pd.read_csv(path/'test.csv', low_memory=False)
EDA
There are 4 EDA libraries that we are exploring today. Each has its own advantages and disadvantages.This page references https://towardsdatascience.com/4-libraries-that-can-perform-eda-in-one-line-of-python-code-b13938a06ae
Pandas Profiling
Please refer to https://github.com/pandas-profiling/pandas-profiling for the full instruction and examples
Run the below code to install Data Profiling straight from Github
! pip install https://github.com/pandas-profiling/pandas-profiling/archive/master.zip
Import the library
from pandas_profiling import ProfileReport
Creating the a report. For this report, I set minimal=True to disable expensive computation to save runtime. This mode only gives you basic analysis with no correlation matrix (Because we have more than 300 columns, the correlation matrix is too big to be displayed). I tried running this on Full mode and it took forever to load.
profile=ProfileReport(df, title="Homesite Quote Conversion",minimal=True)
Run this code in Google Colab to show the report
%%time
profile.to_notebook_iframe()