Messing around with fastai - based on the fastbook content and fastai forum notes
!pip install -Uqq fastbook
import fastbook
fastbook.setup_book()
Does some magic to setup my environment, here I'm looking to see how hard it was to do this manually. The first thing I'll do is load up fastai instead of fastbook so I still have all the fastai libraries at hand if I need any core or helper functions
!rm -rf /content/sample_data/birds #cleanup for new run
!pip install -Uqq fastai
from fastai.vision.all import *
Running fastbook.setup_book?? showed that there was a boolean IN_COLLAB that called another function fastbook.setup_colab() to do the setup.
Inside setup_collab it sets a global variable gdrive to my current path, and then imports google.collab to use drive to mount my google drive. I can do this myself here.
The snippets below are mostly straight out the fastbook source code in Github
global gdrive
gdrive = Path('/content/gdrive/My Drive')
from google.colab import drive
if not gdrive.exists(): drive.mount(str(gdrive.parent))
path = Path.cwd()
path
There is also a nice little function for pulling images from Duck Duck Go instead of having to sign up to Microsoft Azure to get a key to use the Bing API for image search
def search_images_ddg(term, max_images=200):
"Search for `term` with DuckDuckGo and return a unique urls of about `max_images` images"
assert max_images<1000
url = 'https://duckduckgo.com/'
res = urlread(url,data={'q':term})
searchObj = re.search(r'vqd=([\d-]+)\&', res)
assert searchObj
requestUrl = url + 'i.js'
params = dict(l='us-en', o='json', q=term, vqd=searchObj.group(1), f=',,,', p='1', v7exp='a')
urls,data = set(),{'next':1}
while len(urls)<max_images and 'next' in data:
try:
data = urljson(requestUrl,data=params)
urls.update(L(data['results']).itemgot('image'))
requestUrl = url + data['next']
except (URLError,HTTPError): pass
time.sleep(0.2)
return L(urls)
This code has a reference to an L object from fastcore, which in documentation on L indicates the intention of it is to replace lists. I should dive deeper into this to understand more.
I originally wrote a Trinidad hummingbird classifier from a previous attempt at the fastai course, in honor my native country of Trinidad and Tobago. This time, since I'm now in Australia, and celebrating my 4th year as a Queenslander living in Brisbane, I wanted to take a look at classifying Brisbane birds. There are actually alot of them, I'll pick just a few that I actually have remembered seeing so I can do the data cleansing step with a fair level of confidence. I added three types of ducks as well, just to make it a little tougher I think, although I'm not sure if I know the differences myself so we will see how my cleansing activities fare. The names are based on this list. Let's start with the most famous one, the kookaburra.
kookaburra_bird_images = search_images_ddg("laughing kookaburra bird")
kookaburra_bird_images[0]
type(kookaburra_bird_images)
type(kookaburra_bird_images[0])
There's a method attrgot() that is used to extract the file names from a column when doing this with Bing in Lesson 2 I read up the documentation on and got further clarification on fastai forums here. Duck Duck Go returns just strings, so there's no need to extract a column attribute here.
path.ls()
dest = path/'sample_data/kookaburra_bird.jpg'
download_url(kookaburra_bird_images[0], dest)
im = Image.open(dest)
im.to_thumb(128,128)
brisbane_bird_types = "kookaburra","magpie goose","australian white ibis", "australian pelican", "pacific black duck", "plumed whistling duck", "australian wood duck"
path = path/'sample_data/birds'
path
if not path.exists():
path.mkdir()
path
path.ls()
It may take some time to run with the default of 200 images.
for o in brisbane_bird_types:
dest = (path/o)
dest.mkdir(exist_ok=True)
results = search_images_ddg(f'{o} bird')
download_images(dest, urls=results)
Let's verify we got images as we expected
path.ls()
!ls "sample_data/birds"
get_image_files??
fns = get_image_files(path)
fns
len(fns)
Check for the corrupt images
failed = verify_images(fns)
failed
Remove corrupted images
failed.map(Path.unlink);
I'm going to bring forward the steps to clean up the data here from the book before cleaning, since I'm pretty sure some of these aren't going to be bird images so I want to get rid of them. Since we don't have a classifier yet, we're going to have to use ImagesCleaner to do this pre-training cleaning step
from fastai.vision.widgets import *
We need the widgets to do cleanup. Reference this forum post
We need to see how to browse the individual directories since I don't think ImagesCleaner gives that directory selector that the ImageClassifierCleaner module does, so let's peek at how that works
ImageClassifierCleaner??
ImagesCleaner??
path.cwd()
path = Path('sample_data/birds')
path.ls()
Rerun the next three cells substituting the different categories of kookaburra with magpie goose,australian white ibis, australian pelican, pacific black duck, plumed whistling duck, australian wood duck to do some initial pre-training cleaning of possibly irrelevant images from search engine
fns_to_clean = get_image_files(path/'kookaburra')
fns_to_clean
cleaner = ImagesCleaner()
cleaner.set_fns(fns_to_clean)
cleaner
Now that we've marked the ones that should get remove, let's take them out of the image data before we start training our model
for idx in cleaner.delete(): cleaner.fns[idx].unlink()
path.ls()
Load the data into the datablock
birds = DataBlock(
blocks=(ImageBlock, CategoryBlock),
get_items=get_image_files,
splitter=RandomSplitter(valid_pct=0.2, seed=42),
get_y=parent_label,
item_tfms=Resize(128))
dls = birds.dataloaders(path)
dls.valid.show_batch(max_n=10, nrows=2)
Squishing large images to fit into the size of the image
birds = birds.new(item_tfms=Resize(128, ResizeMethod.Squish))
dls = birds.dataloaders(path)
dls.valid.show_batch(max_n=10, nrows=2)
Padding the image borders so they fit
birds = birds.new(item_tfms=Resize(128, ResizeMethod.Pad, pad_mode='zeros'))
dls = birds.dataloaders(path)
dls.valid.show_batch(max_n=10, nrows=2)
Randomly resizing images to allow it to learn on specific parts of an image during epoch
birds = birds.new(item_tfms=RandomResizedCrop(128, min_scale=0.3))
dls = birds.dataloaders(path)
dls.train.show_batch(max_n=10, nrows=2, unique=True)
Starting the data augmentation step here, which does a randomization of all three previous steps.
TODO:Add the individual keywords for image rotation, flipping, perspective warping, brightness changes and contrast changes.
birds = birds.new(item_tfms=Resize(128), batch_tfms=aug_transforms(mult=2))
dls = birds.dataloaders(path)
dls.train.show_batch(max_n=8, nrows=2, unique=True)
Putting it all together now to train our model
birds = birds.new(
item_tfms=RandomResizedCrop(224, min_scale=0.5),
batch_tfms=aug_transforms())
dls = birds.dataloaders(path)
learn = cnn_learner(dls, resnet18, metrics=error_rate)
learn.fine_tune(4)
Take a look at how well it did
interp = ClassificationInterpretation.from_learner(learn)
interp.plot_confusion_matrix()
Let's see where it went wrong
interp.plot_top_losses(10, nrows=10)
If I missed some categorization pre-training, now I can use the code from the book to fix these mistakes here
cleaner = ImageClassifierCleaner(learn)
cleaner
for idx in cleaner.delete(): cleaner.fns[idx].unlink()
for idx,cat in cleaner.change(): shutil.move(str(cleaner.fns[idx]), path/cat)
I can now re-run the learner with the fixed dataset and see if that helps improve anything. I removed the young birds which looked starkly different to adults, as well as some pictures of eggs that were in the data, and some drawings that were not photos
learn.fine_tune(2)
interp = ClassificationInterpretation.from_learner(learn)
interp.plot_confusion_matrix()
interp.plot_top_losses(10, nrows=10)
Possible improvement notes:
- I need to learn what these birds actually look like by definition so I can fix the training data better
- I may be running fine tune too many times as it seems to be starting to overfit, but need to read up more on this
- Learning from my hummingbird app experience I need to investigate if male and female of a species have distinguishing characteristics that require them to have separate categories to allow for better training, as well as looking at chicks/ducklings/young birds characteristics since these may also be very different from how an adult in the species looks