- In this notebook, we implement a prediction system that uses recipe-ingredient data from this Kaggle challenge to predict the cuisine of a given recipe
- We implement and evaluate techniques from the following domains:
- Simple (baseline) heuristics
- Baseline #1: For each ingredient belonging to the given test ingredient list, find the cuisine in which this ingredient is used the most. Among all such cuisines, pick the most frequently occurring cuisine
- Baseline #2: Rank the set of training recipes based on number of ingredients common with test ingredient list. Assign weights to recipes based on their rank and add to scores of their corresponding cuisines. Finally, choose the cuisine with the highest score
- Machine Learning
- Neural Network
- Support Vector Machine (SVM)
- Machine Learning + Network
- node2vec embeddings of the unipartite projection network (where each node is an ingredient) of the recipe-ingredient bipartitite network are first obtained
- Neural Network, SVM, GRU are then trained with these embeddings as features
- Network-based (clustering) heuristics
- Do clustering of ingredients using K-means on node2vec embeddings of ingredient-ingredient network (we set K = # cuisines). We analyze if these generated clusters have one-to-one correlation with different cuisines. If yes, we can use these clusters to create a prediction heuristic in which the cuisine having most number of test ingredients (i.e. the cluster having most number of nodes out of a given set of nodes) is picked
- Simple (baseline) heuristics
- Download recipe-ingredient data from Kaggle's challenge (Go to Data tab and click on Download All)
- Unzip and place
whats-cooking
directory inside the root directory of this project
- Unzip and place
cuisine-prediction.ipynb
notebook contains the full code along with the results obtained- If GitHub is unable to render the above notebook in your browser, you can instead download and see the notebook's equivalent HTML export
cuisine_prediction.html
(seeExports/
directory) Submissions/
directory contains predictions of various models on Kaggle's test data (these can be submitted directly on Kaggle)- We split Kaggle's train data into
my_train_split.json
andmy_test_split.json
so that we can do more sophisticated analysis of results rather than just analyzing accuracy (this is necessary since ground truth of Kaggle's test data is not available) embeddings/
contains files that hold node2vec embeddings of the nodes of themy_train_split
network
Note: See the notebook for in-depth analysis of results
Model | Accuracy | Weighted F1-score | Unweighted F1-score |
---|---|---|---|
Baseline #1 | 53.2 | 0.459 | 0.268 |
Baseline #2 | 40.92 | 0.319 | 0.140 |
Baseline #2(b) | 52.97 | 0.450 | 0.249 |
NN (1-hot) | 77.82 | 0.776 | 0.701 |
SVM (1-hot) | 76.71 | 0.763 | 0.684 |
NN (embedding) | 72.41 | 0.714 | 0.600 |
SVM (embedding) | 69.25 | 0.660 | 0.489 |
GRU (embedding) | 65.11 | 0.619 | 0.441 |
Clustering heuristic | 40.03 | 0.382 | 0.246 |
- What's Cooking challenge at Kaggle
- node2vec: Scalable Feature Learning for Networks by Aditya Grover and Jure Leskovec
- Available implementation of node2vec by Aditya Grover