ai-exp

ML experiments with Aerial Intelligence data for wheat yield forecasting.

Problem description

Tabular geographic, time, weather and NDVI data for corresponding wheat yields is given. Through machine learning methods an estimator should be created which can forecast future yields based on measurements.

Timeline

Read full instructions on github
Check the given data files:

How many columns?
What columns exist?
What are their data types?

Load CSV files with python and do more checks:

How many data rows?
- 288033
Is data missing?
- precipIntensity, precipIntensityMax, precipProbability, pressure, visibility has missing entries (-> drop them)

Visual analysis:

plot lat/lon with colorized yield on map
- strong variations by lat/lon visible
other columns have almost zero (visible!) effect on yield

Machine learning

standardize data (zero mean and standard deviation of 1)
first try: linear regression (naiv and fast)
- results in low accuracy (probably not linear relations!)
second try: decision tree (fast on big data sets)
- better accuracy but maybe bagging methods give better results
thrid try: random forest (still fast but uses averaging over a set of (random) decision trees)
- accuracy: 0.972 (+/- 0.001) very good!
- 15 trees give reasonable accuracy in acceptable runtime
final improvements tried: different column combinations

Best approach

Random forest regressor Accuracy: 0.972 (+/- 0.001) with 5 fold cross validation on training set (80% of the data) Explained variance score: 0.98 Median absolute error: 0.09

Technical choices

Programming language: python2.7
- most experience and well-suited for ML
Data manipulation toolkits: numpy and pandas
- imho best choices for python
ML toolkit: scikit-learn
- works well with numpy (and pandas) + quick implementations possible
Visualization toolkit: matplotlib
- standard choice

Challenges faced

Display of lat/lon data on map in python (never done before)
Big data amount reduced possible ML methods

Lessons learned

How to use matplotlib basemap module

Possible To-Do

Impute missing data (e.g. with average or most frequent value) and use these columns too
Try to reduce dimensions with PCA

PS: Since this is just an experiment I commented out the 'old' regression estimators (linear, tree). I would not leave commented code in a commit otherwise.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
README.md		README.md
randomforest.py		randomforest.py
viz.py		viz.py
wheat-2013-supervised.csv		wheat-2013-supervised.csv
wheat-2014-supervised.csv		wheat-2014-supervised.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ai-exp

Problem description

Timeline

Best approach

Technical choices

Challenges faced

Lessons learned

Possible To-Do

About

Uh oh!

Releases

Packages

Uh oh!

Languages

wolfm89/ai-exp

Folders and files

Latest commit

History

Repository files navigation

ai-exp

Problem description

Timeline

Best approach

Technical choices

Challenges faced

Lessons learned

Possible To-Do

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages