Oral Cancer Prediction Using Microbiome Data

This project builds a machine learning pipeline to predict oral cancer using microbiome data sourced from The Cancer Microbiome Atlas (TCMA). It includes end-to-end preprocessing, feature selection, model training, evaluation, and explainability using SHAP values.

Overview

Goal: Predict oral cancer based on microbial features derived from 16S rRNA and WGS data.
Model Used: Random Forest Classifier
Explainability: SHAP (SHapley Additive exPlanations)
Tools: scikit-learn, pandas, matplotlib, shap, joblib

Data Source

Due to data licensing and privacy considerations, the full TCMA dataset is not included in this repository.

To Reproduce:

Please download the following data files from TCMA:

bacteria.WGS.solid.case.clr.txt
metadata.WGS.solid.case.txt

Place them in the following directory:

data/raw/TCMA/

You must also download and install the required Python packages using:

pip install -r requirements.txt

Then, run the preprocessing script as described below.

Preprocessing Pipeline

We use only TCMA (not HOMD) due to data inconsistency issues.
Merging, cleaning, imputing, scaling, and feature selection are performed.
Sequential Feature Selection (SFS) chooses the most informative 17 features.
Feature 1678.0 is explicitly dropped due to noise.

Run preprocessing (takes time depending on CPU):

python src/preprocessing.py

This will generate:

data/processed/merged_with_labels.csv
data/processed/selected_features.txt

Model Training & Evaluation

This project presents a predictive approach to assessing oral cancer likelihood based on microbiome profiles. The model analyzes microbial patterns and provides a probability-based prediction that supports non-invasive diagnostic decision-making.

To further enhance the predictive value—especially for anticipating cancer before clinical onset—future iterations could integrate longitudinal data, enabling time-aware modeling and early detection frameworks. Incorporating methods like survival analysis, Cox regression, or deep learning-based time-to-event modeling would support forecasting the potential onset or progression of oral cancer more precisely over time.

Random Forest is trained with class_weight='balanced'
Evaluation is done via:
- Accuracy: 92.89%
- AUROC: 0.9714
- PR-AUC: 0.9588

Key Visual Outputs:

Run the full pipeline:

python main.py

Or step through it via notebook:

notebooks/_01_OralCancer_Modeling.ipynb

Project Structure

OralCancerPrediction/
├── data/
│   ├── raw/                  # Place downloaded TCMA files here
│   └── processed/            # Outputs from preprocessing
│       ├── merged_with_labels.csv
│       └── selected_features.txt
├── notebooks/               # Jupyter notebooks
│   ├── _00_brief.ipynb
│   ├── _01_OralCancer_Modeling.ipynb
│   ├── _02_SHAP_Explainability.ipynb
│   └── _03_Deployment_Testing.ipynb
├── models/
│   └── rf_model.pkl          # Trained Random Forest model
├── outputs/                 # Evaluation results
│   ├── confusion_matrix.png
│   ├── roc_curve.png
│   ├── shap_summary_plot.png
│   ├── metrics_summary.json
│   └── metrics_summary.txt
├── src/                     # Source code
│   ├── preprocessing.py
│   ├── modeling.py
│   ├── evaluation.py
│   └── utils.py
├── main.py                  # Main script to run pipeline
├── requirements.txt
└── README.md

SHAP Explainability

Global feature importance is visualized using summary_plot
Additional force plots explain individual predictions

See: notebooks/_02_SHAP_Explainability.ipynb

Contributor

_{Yassien Tawfik}

_{Madonna Mosaad}

_{Mazen Marwan}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Oral Cancer Prediction Using Microbiome Data

Overview

Data Source

To Reproduce:

Preprocessing Pipeline

Run preprocessing (takes time depending on CPU):

Model Training & Evaluation

Key Visual Outputs:

Run the full pipeline:

Project Structure

SHAP Explainability

Contributor

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
documentation		documentation
notebooks		notebooks
src		src
.gitignore		.gitignore
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

YassienTawfikk/Oral-Cancer-Prediction

Folders and files

Latest commit

History

Repository files navigation

Oral Cancer Prediction Using Microbiome Data

Overview

Data Source

To Reproduce:

Preprocessing Pipeline

Run preprocessing (takes time depending on CPU):

Model Training & Evaluation

Key Visual Outputs:

Run the full pipeline:

Project Structure

SHAP Explainability

Contributor

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages