This is a demonstration of using CoreML to recognize succulents from images. It is still very much in it's early stages.
Overview of the process
- Create an R script that scrapes the plant names from World of Succulents.
- Create a shell script that uses 'googliser' to download the images to a directory called "images/" and each plant has a subdirectory.
- Use TransorFlow to retrain an image classifier with my new data set.
- Use the
core-ml
python package to convert the TensorFlow model into one that can be imported into Xcode for CoreML.
Current Status
- I have made the web-scraping script and created a list of over 1,500 succulents.
- I have 'googliser' funcitoning and a job-array submission system to parrallelize the process for each plant.
- Here, I have demonstrated the feasibility of the workflow using a sample of 5 plants.
I scraped plant names from World of Succulents using 'rvest' to retrieve and parse the HTML. The code is in "make_plant_list.r" and outputs a list of names to "plant_names.txt"
Rscript make_plant_list.r
I am using the bash tool 'googliser' to download plant images. It currently has a limit of 1,000 images per query. This should be sufficient for my needs, though.
The tool can be installed from GitHub using the following command.
wget -qN git.io/googliser.sh && chmod +x googliser.sh
It requires imagemagick
, which is available on O2.
module load imageMagick/6.9.1.10
Below is an example command to download 20 images of Euphorbia obesa.
./googliser.sh \
--phrase "Euphorbia obesa" \
--number 20 \
--no-gallery \
--output images/Euphorbia_obesa
I downloaded all of the images for every plant by submitting a job-array, where each job downloads N images for a single plant. The script "download_google_images.sh" takes an integer (the job number) and downloads the images for the plant on that line of "plant_names.txt".
sbatch \
--array=1-$(wc -l < plant_names.txt) download_google_images.sh plant_names.txt \
--constraint="scratch2"
(The following step may no longer be necessary since each image is reportedly a JPEG.)
Some of the images were corrupted or of WEBP format that the TensorFlow script could not accept. These were filtered using another R script.
module load imageMagick/6.9.1.10
Rscript filter_bad_images.r
The R Markdown file "check_images_downloaded.Rmd" checks that each plant has images downloaded. It outputs an HTML file of the results.
Rscript -e 'rmarkdown::render("check_images_downloaded.Rmd")'
In addition, if there are plants that do not have all of the images downloaded (or are within 50 images of the expected number), it creates the file "failed_dwnlds_plant_names.txt" with a list of plant names to be run, again.
sbatch \
--array=1-$(wc -l < failed_dwnlds_plant_names.txt) download_google_images.sh failed_dwnlds_plant_names.txt \
--constraint="scratch2"
I began by following the tutorial How to Retrain an Image Classifier for New Categories to retrain a general image classifier to recognize the images. I can then convert to and export a CoreML object and import that into a simple iOS app that tries to predict the cactus from a new image.
TensorFlow is an incredibly powerful machine learning framework that is used extensively in education, research and production. (Excitingly, there is Swift for TensorFlow, though it is still in beta (as of August 18, 2019)).
"TensorFlow Hub is a library for the publication, discovery, and consumption of reusable parts of machine learning models."
To install both, we can use pip
from within the virtual environment.
Create virtual environment.
# create and activate a virtual environment
module load python/3.6.0
python3 -m venv image-download
source image-download/bin/activate
# install the necessary packages
pip3 install --upgrade pip
pip3 install setuptools>=41.0.0
pip3 install tensorflow tensorflow-hub
pip3 install coremltools==3.0b5 tfcoreml==0.4.0b1 # betas required for CoreML3
There is an example on the tutorial for retraining ImageNet to identify several different plants by their flower. All of this was performed in a subdirectory called "flowers_example".
mkdir flowers_example
cd flowers_example
The images were downloaded and unarchived.
curl -LO http://download.tensorflow.org/example_images/flower_photos.tgz
tar xzf flower_photos.tgz
ls flower_photos
#> daisy dandelion LICENSE.txt roses sunflowers tulips
The retraining script was downloaded from GitHub.
curl -LO https://github.com/tensorflow/hub/raw/master/examples/image_retraining/retrain.py
The script was run on the plant images.
python retrain.py --image_dir ./flower_photos
If the connection to O2 is set up correctly, the TensorBoard can be run and opened locally.
tensorboard --logdir /tmp/retrain_logs
#> TensorBoard 1.14.0 at http://compute-e-16-229.o2.rc.hms.harvard.edu:6006/ (Press CTRL+C to quit)
Finally, the newe model was used to classify a photo using the "label_image.py" script (downloaded from GitHub).
# download the script
curl -LO https://github.com/tensorflow/tensorflow/raw/master/tensorflow/examples/label_image/label_image.py
# run it on an image
python label_image.py \
--graph=/tmp/output_graph.pb \
--labels=/tmp/output_labels.txt \
--input_layer=Placeholder \
--output_layer=final_result \
--image=./flower_photos/daisy/21652746_cc379e0eea_m.jpg
#> daisy 0.99798715
#> sunflowers 0.0011478926
#> dandelion 0.00045892605
#> tulips 0.0003524925
#> roses 5.3392014e-05
It worked!
You can see the results from a small-scale experiement here. Overall, it went well, but the plants used were obviously different from each other, so it may be worth running a test with more similar types of plants.
Activate the virtual environment.
module load python/3.6.0
source image-download/bin/activate
Retrain ImageNet.
python3 imageClassifierModel/retrain.py \
--image_dir=/n/scratch2/jc604_plantimages \
--output_graph=imageClassifierModel/tf_succulent_classifier.pb \
--output_labels=imageClassifierModel/tf_output_labels.txt \
--summaries_dir=imageClassifierModel/tf_summaries \
--output_layer=plant_classifier \
--random_brightness=5
Test on some images.
python label_image.py \
--graph=imageClassifierModel/tf_succulent_classifier.pb \
--labels=imageClassifierModel/tf_output_labels.txt \
--input_layer=Placeholder \
--output_layer=plant_classifier \
--image=imageClassifierModel/my_plant_images/Euphorbia obesa_5.JPG
Convert to CoreML format. (untested)
import tfcoreml as tf_converter
tf_converter.convert(tf_model_path='my_model.pb',
mlmodel_path='my_model.mlmodel',
output_feature_names=['softmax'],
input_name_shape_dict={'input': [1, 227, 227, 3]},
use_coreml_3=True)
- Meghan Kane - Bootstrapping the Machine Learning Training Process
- There are models already available from Apple: https://developer.apple.com/machine-learning/models/
- use "transfer learning" to use knowledge learned from source task (eg. MobileNet or SqueezeNet) to train target task
- "tensorboard" to track learning for the TensorFlow training
- Google Images Download python library (can
pip
install) - googliser