Download Data

Store your data in a google drive folder and then mount drive to connect to your google drive

from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).

Import relevant libraries

import pandas as pd
import numpy as np
from pathlib import Path

Copy Path of the folder where you store your data

path = Path('/content/drive/MyDrive/Kaggle/data/homesite-quote')
path.mkdir(parents=True, exist_ok=True)
path
PosixPath('/content/drive/MyDrive/Kaggle/data/homesite-quote')

Import data and store it as a dataframe

df = pd.read_csv(path/'train_df.csv', low_memory=False)
test_df=pd.read_csv(path/'test.csv', low_memory=False)

EDA

There are 4 EDA libraries that we are exploring today. Each has its own advantages and disadvantages.This page references https://towardsdatascience.com/4-libraries-that-can-perform-eda-in-one-line-of-python-code-b13938a06ae

Pandas Profiling

Please refer to https://github.com/pandas-profiling/pandas-profiling for the full instruction and examples

Run the below code to install Data Profiling straight from Github

! pip install https://github.com/pandas-profiling/pandas-profiling/archive/master.zip

Import the library

from pandas_profiling import ProfileReport

Creating the a report. For this report, I set minimal=True to disable expensive computation to save runtime. This mode only gives you basic analysis with no correlation matrix (Because we have more than 300 columns, the correlation matrix is too big to be displayed). I tried running this on Full mode and it took forever to load.

profile=ProfileReport(df, title="Homesite Quote Conversion",minimal=True)

Run this code in Google Colab to show the report

%%time
profile.to_notebook_iframe()
CPU times: user 2min 38s, sys: 1min 2s, total: 3min 41s
Wall time: 2min 32s

Save an output file. This will create a html page in your Google Colab temporary folder. Download or move to it Google Drive if you want to save it

profile.to_file(output_file="Homesite_Quote_EDA_Data_Profilling.html")

A sidenote for people are working on the dataset. Looking at the descriptive analysis, it seems like a lot of columns have data with values from 1 - 25. These fields might have been encoded and transformed from categorical fields. But we don't know whether these codes are ordered according to their level of or not.

picture

We can also view the Warnings tab to view the issues that each column might experience

picture

DTale

D-Tale is the combination of a Flask back-end and a React front-end to bring you an easy way to view & analyze Pandas data structures. Please refer to https://pypi.org/project/dtale/ for full instruction and examples

Import the library to Google Colab

!pip install -U dtale

Set up server for the app. You can use either USE_NGROK or USE_COLAB

import dtale
import dtale.app as dtale_app

#dtale_app.USE_NGROK=True

dtale_app.USE_COLAB = True
/usr/local/lib/python3.7/dist-packages/distributed/config.py:20: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
  defaults = yaml.load(f)

Show the report

%%time
dtale.show(df)
https://vq30jy6l36k-496ff2e9c6d22116-40000-colab.googleusercontent.com/dtale/main/1
CPU times: user 10.5 s, sys: 1.6 s, total: 12.1 s
Wall time: 12.7 s

Dtale is very interactive app and contains rich information and a lot of different graphs.

picture

picture

This is also the only report that can generate correlation matrix in a very short period of time - 12.6s. The layout is also flexible enough to show all the values of the correlation matric. We can see that fields with similar name (for example CoverageFields1,2,3 etc) have quite high pearson correlation score

picture

Sweetviz

Sweetviz is an open-source Python library that generates beautiful, high-density visualizations to kickstart EDA (Exploratory Data Analysis) with just two lines of code. Output is a fully self-contained HTML application. Please refer to https://pypi.org/project/sweetviz/ for the full instruction and examples. The additional feature of Sweetviz is it allows user to compare 2 datasets (for example: train set and test set)

Install the library

!pip install sweetviz

Reimport pandas, np and import sweetviz. The layout of the app might be affected if you don't reimport pandas and numpy

import pandas as pd
import numpy as np
import sweetviz as sv

Once we run the code, sweetviz also give us clear instruction of how to handle errors. The below 3 columns caused some issues so I just handle them before running the model

df['PropertyField29']=df['PropertyField29'].fillna(-1)
test_df['PropertyField29']=test_df['PropertyField29'].fillna(-1)
df['PersonalField84']=df['PersonalField84'].fillna(-1)
test_df['PersonalField84']=test_df['PersonalField84'].fillna(-1)
df['PropertyField37']=df['PropertyField37'].astype('bool')
test_df['PropertyField37']=test_df['PropertyField37'].astype('bool')

Generate the report. We are giving names to each dataset (optional), and specifying a target feature (optional also). Specifying a target feature is extremely valuable as it will show how "Survived" is affected by each variable

%%time
comparison_report = sv.compare([df,'Train'], [test_df,'Test'], target_feat='QuoteConversion_Flag',pairwise_analysis='off')
CPU times: user 3min 44s, sys: 37.7 s, total: 4min 22s
Wall time: 3min 57s

Show the report. The report can be output as a standalone HTML file, OR embedded in this notebook. For notebooks, we can specify the width, height of the window, as well as scaling of the report itself. This line of code uses the default values (w="100%", h=750, layout="vertical")

comparison_report.show_notebook()

Save an output file. This will create a html page in your Google Colab temporary folder. Download or move to it Google Drive if you want to save it

comparison_report.show_html(filepath='SWEETVIZ_REPORT_Full.html', 
            open_browser=True, 
            layout='widescreen')
Report SWEETVIZ_REPORT_Full.html was generated! NOTEBOOK/COLAB USERS: the web browser MAY not pop up, regardless, the report IS saved in your notebook/colab files.

A sidenote for people are working on the dataset. Looking at the descriptive analysis comparing train set and test set, it seems like they are strikingly similar. The distribution of values in each column is almost identical for most columns. This may mean that the test set just resemble the train set perfectly

picture

Autoviz

Automatically Visualize any dataset, any size with a single line of code. https://autoviz.io/

Install library

!pip install autoviz

Import

from autoviz.AutoViz_Class import AutoViz_Class
AV = AutoViz_Class()

Download the dataset to your google colab temporary folder. This is because Autoviz seems to only able to load dataset in the direct folder

!wget https://media.githubusercontent.com/media/redditech/team-fast-tabulous/master/dataset/train_df.csv

Run the report. This library doesn't seem to support large dataset with a lot of variables. It threw some error messages after a few minutes

%%time
df_Au = AV.AutoViz('train_df.csv')
Shape of your Data Set: (208602, 299)
############## C L A S S I F Y I N G  V A R I A B L E S  ####################
Classifying variables in data set...
    Number of Numeric Columns =  5
    Number of Integer-Categorical Columns =  242
    Number of String-Categorical Columns =  12
    Number of Factor-Categorical Columns =  0
    Number of String-Boolean Columns =  11
    Number of Numeric-Boolean Columns =  21
    Number of Discrete String Columns =  5
    Number of NLP String Columns =  0
    Number of Date Time Columns =  0
    Number of ID Columns =  1
    Number of Columns to Delete =  2
    299 Predictors classified...
        This does not include the Target column(s)
        8 variables removed since they were ID or low-information variables
Since Number of Rows in data 208602 exceeds maximum, randomly sampling 150000 rows for EDA...
5 numeric variables in data exceeds limit, taking top 30 variables
Number of All Scatter Plots = 15
Could not draw Violin Plot
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/IPython/core/formatters.py in __call__(self, obj)
    332                 pass
    333             else:
--> 334                 return printer(obj)
    335             # Finally look for special method names
    336             method = get_real_method(obj, self.print_method)

/usr/local/lib/python3.7/dist-packages/IPython/core/pylabtools.py in <lambda>(fig)
    239 
    240     if 'png' in formats:
--> 241         png_formatter.for_type(Figure, lambda fig: print_figure(fig, 'png', **kwargs))
    242     if 'retina' in formats or 'png2x' in formats:
    243         png_formatter.for_type(Figure, lambda fig: retina_figure(fig, **kwargs))

/usr/local/lib/python3.7/dist-packages/IPython/core/pylabtools.py in print_figure(fig, fmt, bbox_inches, **kwargs)
    123 
    124     bytes_io = BytesIO()
--> 125     fig.canvas.print_figure(bytes_io, **kw)
    126     data = bytes_io.getvalue()
    127     if fmt == 'svg':

/usr/local/lib/python3.7/dist-packages/matplotlib/backend_bases.py in print_figure(self, filename, dpi, facecolor, edgecolor, orientation, format, bbox_inches, **kwargs)
   2092                         self.figure,
   2093                         functools.partial(
-> 2094                             print_method, dpi=dpi, orientation=orientation)
   2095                         )
   2096                     ctx = (renderer._draw_disabled()

/usr/local/lib/python3.7/dist-packages/matplotlib/backend_bases.py in _get_renderer(figure, print_method)
   1558     with cbook._setattr_cm(figure, draw=_draw):
   1559         try:
-> 1560             print_method(io.BytesIO())
   1561         except Done as exc:
   1562             renderer, = figure._cachedRenderer, = exc.args

/usr/local/lib/python3.7/dist-packages/matplotlib/backends/backend_agg.py in print_png(self, filename_or_obj, metadata, pil_kwargs, *args, **kwargs)
    512         }
    513 
--> 514         FigureCanvasAgg.draw(self)
    515         if pil_kwargs is not None:
    516             from PIL import Image

/usr/local/lib/python3.7/dist-packages/matplotlib/backends/backend_agg.py in draw(self)
    386         Draw the figure using the renderer.
    387         """
--> 388         self.renderer = self.get_renderer(cleared=True)
    389         # Acquire a lock on the shared font cache.
    390         with RendererAgg.lock, \

/usr/local/lib/python3.7/dist-packages/matplotlib/backends/backend_agg.py in get_renderer(self, cleared)
    402                           and getattr(self, "_lastKey", None) == key)
    403         if not reuse_renderer:
--> 404             self.renderer = RendererAgg(w, h, self.figure.dpi)
    405             self._lastKey = key
    406         elif cleared:

/usr/local/lib/python3.7/dist-packages/matplotlib/backends/backend_agg.py in __init__(self, width, height, dpi)
     90         self.width = width
     91         self.height = height
---> 92         self._renderer = _RendererAgg(int(width), int(height), dpi)
     93         self._filter_renderers = []
     94 

ValueError: Image size of 1440x256680 pixels is too large. It must be less than 2^16 in each direction.
<Figure size 1440x256680 with 1425 Axes>
Time to run AutoViz (in seconds) = 496.620

 ###################### VISUALIZATION Completed ########################
CPU times: user 7min 55s, sys: 26.4 s, total: 8min 21s
Wall time: 8min 16s

Conclusion

All of these EDA libraries really help to speed up your EDA process, especially when you are a beginner and want to spend too much time on data visualisation and analysis.

In my opinion, Dtale has the best performance on large dataset as it can report detailed complex data analyses in the shortest time without breaking the code. It is also quite interactive and allows you to export code.

Below is the runtime of each library:

  • Pandas Profiling (minimal version): 2mins 32s
  • Dtale: 12.7 s
  • Sweetviz (comparing dataset): 3min 57s
  • Autoviz: more than 8 mins (doens't seem to work with large dataset)