10000 GitHub - YassienTawfikk/Oral-Cancer-Prediction: Machine learning-based tool to predict oral cancer from oral microbiome data, leveraging advanced analytics to aid early diagnosis and prevention.
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Machine learning-based tool to predict oral cancer from oral microbiome data, leveraging advanced analytics to aid early diagnosis and prevention.

Notifications You must be signed in to change notification settings

YassienTawfikk/Oral-Cancer-Prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

32 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Oral Cancer Prediction Using Microbiome Data

This project builds a machine learning pipeline to predict oral cancer using microbiome data sourced from The Cancer Microbiome Atlas (TCMA). It includes end-to-end preprocessing, feature selection, model training, evaluation, and explainability using SHAP values.


Overview

  • Goal: Predict oral cancer based on microbial features derived from 16S rRNA and WGS data.
  • Model Used: Random Forest Classifier
  • Explainability: SHAP (SHapley Additive exPlanations)
  • Tools: scikit-learn, pandas, matplotlib, shap, joblib

Data Source

Due to data licensing and privacy considerations, the full TCMA dataset is not included in this repository.

To Reproduce:

Please download the following data files from TCMA:

Place them in the following directory:

data/raw/TCMA/

You must also download and install the required Python packages using:

pip install -r requirements.txt

Then, run the preprocessing script as described below.


Preprocessing Pipeline

  • We use only TCMA (not HOMD) due to data inconsistency issues.
  • Merging, cleaning, imputing, scaling, and feature selection are performed.
  • Sequential Feature Selection (SFS) chooses the most informative 17 features.
  • Feature 1678.0 is explicitly dropped due to noise.

Run preprocessing (takes time depending on CPU):

python src/preprocessing.py

This will generate:

  • data/processed/merged_with_labels.csv
  • data/processed/selected_features.txt

Model Training & Evaluation

This project presents a predictive approach to assessing oral cancer likelihood based on microbiome profiles. The model analyzes microbial patterns and provides a probability-based prediction that supports non-invasive diagnostic decision-making.

To further enhance the predictive value—especially for anticipating cancer before clinical onset—future iterations could integrate longitudinal data, enabling time-aware modeling and early detection frameworks. Incorporating methods like survival analysis, Cox regression, or deep learning-based time-to-event modeling would support forecasting the potential onset or progression of oral cancer more precisely over time.

  • Random Forest is trained with class_weight='balanced'

  • Evaluation is done via:

    • Accuracy: 92.89%
    • AUROC: 0.9714
    • PR-AUC: 0.9588

Key Visual Outputs:

confusion_matrix roc_curve shap_summary_plot


Run the full pipeline:

python main.py

Or step through it via notebook:

notebooks/_01_OralCancer_Modeling.ipynb

Project Structure

OralCancerPrediction/
├── data/
│   ├── raw/                  # Place downloaded TCMA files here
│   └── processed/            # Outputs from preprocessing
│       ├── merged_with_labels.csv
│       └── selected_features.txt
├── notebooks/               # Jupyter notebooks
│   ├── _00_brief.ipynb
│   ├── _01_OralCancer_Modeling.ipynb
│   ├── _02_SHAP_Explainability.ipynb
│   └── _03_Deployment_Testing.ipynb
├── models/
│   └── rf_model.pkl          # Trained Random Forest model
├── outputs/                 # Evaluation results
│   ├── confusion_matrix.png
│   ├── roc_curve.png
│   ├── shap_summary_plot.png
│   ├── metrics_summary.json
│   └── metrics_summary.txt
├── src/                     # Source code
│   ├── preprocessing.py
│   ├── modeling.py
│   ├── evaluation.py
│   └── utils.py
├── main.py                  # Main script to run pipeline
├── requirements.txt
└── README.md

SHAP Explainability

  • Global feature importance is visualized using summary_plot
  • Additional force plots explain individual predictions

See: notebooks/_02_SHAP_Explainability.ipynb


Contributor

About

Machine learning-based tool to predict oral cancer from oral microbiome data, leveraging advanced analytics to aid early diagnosis and prevention.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  
0