Getting Data From Kaggle
A tutorial of fastpages for Jupyter notebooks.
This is a riff off my learning team blog post about getting specific data from Kaggle. You should reference it to understand the setups bit below for the why behind the instructions for setting up Kaggle-CLI and Fastai. Good links to reference:
!pip install kaggle
!mamba install -c fastchan fastai -y
Let's say I want to find some getting started competitions, Kaggle CLI let's you do keyword searches
!kaggle competitions list -s "Getting Started"
I know there's a Titanic dataset, but it isn't found with keyword searching, so let's try the category
!kaggle competitions list --category "Getting Started"
!kaggle competitions list --category gettingStarted
There it is! This is where those fastai helper functions come in handy. I'm going to work with tabular data, so I'll get this from the tabular library
from fastai.tabular.all import *
Path.cwd()
Now I need to make sure I put this data in the correct place so it doesn't get checked in when I commit my changes in Github. I want to create a folder _data and have add an entry to my .gitignore to avoid the dataset I download into it from being checked in
!touch .gitignore
!echo "_data" > .gitignore
!head .gitignore
!mkdir _data
os.chdir('_data')
Path.cwd()
Ok, looks like I'm ready to download that Titanic dataset
!kaggle competitions download -c titanic
Ok, this was a head scratcher, but it looks like before I can download any dataset, I need to go to the competition page and join the competition and accept the rules. I should have read the documentation that said this. Usual format for competition URLs is https://www.kaggle.com/c/<competition-name>/rules
!kaggle competitions download -c titanic
Now let's verify the file is there, and extract the data
path = Path.cwd()
path.ls()
file_extract('titanic.zip')
path.ls()
Using the Kaggle API
After I wrote above, I just read a chapter of the book that showed the Kaggle API is available programmatically, so I can use this instead of command line to load the data, and since I already have the tabular libraries loaded and load and take a first look at the top rows of the data
from kaggle import api
api.competitions_list(category='gettingStarted')
api.competition_download_cli('titanic', path=path)
file_extract("titanic.zip")
df = pd.read_csv(path/'train.csv', skipinitialspace=True)
df.head()