Introduction

Here I will play around with the Kaggle CLI to find an interesting competition, and load the data from it

This is a riff off my learning team blog post about getting specific data from Kaggle. You should reference it to understand the setups bit below for the why behind the instructions for setting up Kaggle-CLI and Fastai. Good links to reference:

!pip install kaggle
Requirement already satisfied: kaggle in /usr/local/anaconda3/envs/fastai/lib/python3.8/site-packages (1.5.12)
Requirement already satisfied: python-dateutil in /usr/local/anaconda3/envs/fastai/lib/python3.8/site-packages (from kaggle) (2.8.1)
Requirement already satisfied: six>=1.10 in /usr/local/anaconda3/envs/fastai/lib/python3.8/site-packages (from kaggle) (1.16.0)
Requirement already satisfied: python-slugify in /usr/local/anaconda3/envs/fastai/lib/python3.8/site-packages (from kaggle) (5.0.2)
Requirement already satisfied: requests in /usr/local/anaconda3/envs/fastai/lib/python3.8/site-packages (from kaggle) (2.25.1)
Requirement already satisfied: urllib3 in /usr/local/anaconda3/envs/fastai/lib/python3.8/site-packages (from kaggle) (1.26.4)
Requirement already satisfied: tqdm in /usr/local/anaconda3/envs/fastai/lib/python3.8/site-packages (from kaggle) (4.59.0)
Requirement already satisfied: certifi in /usr/local/anaconda3/envs/fastai/lib/python3.8/site-packages (from kaggle) (2021.5.30)
Requirement already satisfied: text-unidecode>=1.3 in /usr/local/anaconda3/envs/fastai/lib/python3.8/site-packages (from python-slugify->kaggle) (1.3)
Requirement already satisfied: chardet<5,>=3.0.2 in /usr/local/anaconda3/envs/fastai/lib/python3.8/site-packages (from requests->kaggle) (4.0.0)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/anaconda3/envs/fastai/lib/python3.8/site-packages (from requests->kaggle) (2.10)
!mamba install -c fastchan fastai -y
                  __    __    __    __
                 /  \  /  \  /  \  /  \
                /    \/    \/    \/    \
███████████████/  /██/  /██/  /██/  /████████████████████████
              /  / \   / \   / \   / \  \____
             /  /   \_/   \_/   \_/   \    o \__,
            / _/                       \_____/  `
            |/
        ███╗   ███╗ █████╗ ███╗   ███╗██████╗  █████╗
        ████╗ ████║██╔══██╗████╗ ████║██╔══██╗██╔══██╗
        ██╔████╔██║███████║██╔████╔██║██████╔╝███████║
        ██║╚██╔╝██║██╔══██║██║╚██╔╝██║██╔══██╗██╔══██║
        ██║ ╚═╝ ██║██║  ██║██║ ╚═╝ ██║██████╔╝██║  ██║
        ╚═╝     ╚═╝╚═╝  ╚═╝╚═╝     ╚═╝╚═════╝ ╚═╝  ╚═╝

        mamba (0.13.0) supported by @QuantStack

        GitHub:  https://github.com/mamba-org/mamba
        Twitter: https://twitter.com/QuantStack

█████████████████████████████████████████████████████████████


Looking for: ['fastai']

pkgs/r/osx-64            [>                   ] (--:--) No change
pkgs/r/osx-64            [====================] (00m:00s) No change
pkgs/main/osx-64         [=>                  ] (--:--) No change
pkgs/main/osx-64         [====================] (00m:00s) No change
pkgs/main/noarch         [=>                  ] (--:--) No change
pkgs/main/noarch         [====================] (00m:00s) No change
pkgs/r/noarch            [>                   ] (--:--) No change
pkgs/r/noarch            [====================] (00m:00s) No change
fastchan/osx-64          [=>                  ] (--:--) No change
fastchan/osx-64          [====================] (00m:00s) No change
fastchan/noarch          [=>                  ] (--:--) No change
fastchan/noarch          [====================] (00m:00s) No change
Transaction

  Prefix: /usr/local/anaconda3/envs/fastai

  All requested packages already installed

Let's say I want to find some getting started competitions, Kaggle CLI let's you do keyword searches

!kaggle competitions list -s "Getting Started"
ref                                     deadline             category            reward  teamCount  userHasEntered  
--------------------------------------  -------------------  ---------------  ---------  ---------  --------------  
tpu-getting-started                     2030-06-03 23:59:00  Getting Started  Knowledge        936           False  
nlp-getting-started                     2030-01-01 00:00:00  Getting Started  Knowledge       3171           False  
gan-getting-started                     2030-07-01 23:59:00  Getting Started     Prizes        321           False  
acm-sf-chapter-hackathon-small          2012-09-30 01:00:00  Research              $600         96           False  
getting-started                         2012-02-26 00:00:00  Featured           $10,000          0           False  
street-view-getting-started-with-julia  2017-01-07 00:00:00  Getting Started  Knowledge         56           False  

I know there's a Titanic dataset, but it isn't found with keyword searching, so let's try the category

!kaggle competitions list --category "Getting Started"
Invalid category specified. Valid options are ['all', 'featured', 'research', 'recruitment', 'gettingStarted', 'masters', 'playground']
!kaggle competitions list --category gettingStarted
ref                                          deadline             category            reward  teamCount  userHasEntered  
-------------------------------------------  -------------------  ---------------  ---------  ---------  --------------  
contradictory-my-dear-watson                 2030-07-01 23:59:00  Getting Started     Prizes        188           False  
gan-getting-started                          2030-07-01 23:59:00  Getting Started     Prizes        321           False  
tpu-getting-started                          2030-06-03 23:59:00  Getting Started  Knowledge        936           False  
digit-recognizer                             2030-01-01 00:00:00  Getting Started  Knowledge       5879            True  
titanic                                      2030-01-01 00:00:00  Getting Started  Knowledge      48369           False  
house-prices-advanced-regression-techniques  2030-01-01 00:00:00  Getting Started  Knowledge      12623            True  
connectx                                     2030-01-01 00:00:00  Getting Started  Knowledge        951           False  
nlp-getting-started                          2030-01-01 00:00:00  Getting Started  Knowledge       3171           False  
facial-keypoints-detection                   2017-01-07 00:00:00  Getting Started  Knowledge        175           False  
street-view-getting-started-with-julia       2017-01-07 00:00:00  Getting Started  Knowledge         56           False  
word2vec-nlp-tutorial                        2015-06-30 23:59:00  Getting Started  Knowledge        577           False  
data-science-london-scikit-learn             2014-12-31 23:59:00  Getting Started  Knowledge        190           False  
just-the-basics-the-after-party              2013-03-01 01:00:00  Getting Started  Knowledge         48           False  
just-the-basics-strata-2013                  2013-02-26 20:30:00  Getting Started  Knowledge         49           False  

There it is! This is where those fastai helper functions come in handy. I'm going to work with tabular data, so I'll get this from the tabular library

from fastai.tabular.all import *
Path.cwd()
Path('/Users/nissan/code/reddi-hacking/_notebooks')

Now I need to make sure I put this data in the correct place so it doesn't get checked in when I commit my changes in Github. I want to create a folder _data and have add an entry to my .gitignore to avoid the dataset I download into it from being checked in

!touch .gitignore
!echo "_data" > .gitignore
!head .gitignore
_data
!mkdir _data
os.chdir('_data')
Path.cwd()
Path('/Users/nissan/code/reddi-hacking/_notebooks/_data')

Ok, looks like I'm ready to download that Titanic dataset

!kaggle competitions download -c titanic
403 - Forbidden

Ok, this was a head scratcher, but it looks like before I can download any dataset, I need to go to the competition page and join the competition and accept the rules. I should have read the documentation that said this. Usual format for competition URLs is https://www.kaggle.com/c/<competition-name>/rules

!kaggle competitions download -c titanic
Downloading titanic.zip to /Users/nissan/code/reddi-hacking/_notebooks/_data
  0%|                                               | 0.00/34.1k [00:00<?, ?B/s]
100%|██████████████████████████████████████| 34.1k/34.1k [00:00<00:00, 1.41MB/s]

Now let's verify the file is there, and extract the data

path = Path.cwd()
path.ls()
(#1) [Path('/Users/nissan/code/reddi-hacking/_notebooks/_data/titanic.zip')]
file_extract('titanic.zip')
path.ls()
(#4) [Path('/Users/nissan/code/reddi-hacking/_notebooks/_data/test.csv'),Path('/Users/nissan/code/reddi-hacking/_notebooks/_data/titanic.zip'),Path('/Users/nissan/code/reddi-hacking/_notebooks/_data/train.csv'),Path('/Users/nissan/code/reddi-hacking/_notebooks/_data/gender_submission.csv')]

Using the Kaggle API

After I wrote above, I just read a chapter of the book that showed the Kaggle API is available programmatically, so I can use this instead of command line to load the data, and since I already have the tabular libraries loaded and load and take a first look at the top rows of the data

from kaggle import api
api.competitions_list(category='gettingStarted')
[contradictory-my-dear-watson,
 gan-getting-started,
 tpu-getting-started,
 digit-recognizer,
 titanic,
 house-prices-advanced-regression-techniques,
 connectx,
 nlp-getting-started,
 facial-keypoints-detection,
 street-view-getting-started-with-julia,
 word2vec-nlp-tutorial,
 data-science-london-scikit-learn,
 just-the-basics-the-after-party,
 just-the-basics-strata-2013]
api.competition_download_cli('titanic', path=path)
file_extract("titanic.zip")
titanic.zip: Skipping, found more recently modified local copy (use --force to force download)
df = pd.read_csv(path/'train.csv', skipinitialspace=True)
df.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S