This repository contains a reproducible set of scripts to analyze biomarker data and distinguish responders from non-responders to natalizumab treatment in multiple sclerosis. The analysis is based on two patient cohorts and aims to identify the most relevant features for predicting treatment response.
The approach uses machine learning techniques, specifically Random Forest classifiers, to evaluate the importance of features and their combinations. We already have a set of features that are known to be important for the analysis, and we will use these features to train our models. The scripts are designed to run sequentially, covering hyperparameter tuning, feature selection, and evaluation of feature combinations.
The analysis is organized into three scripts that should be run in sequence:
-
grid_search_tuning.py
Performs hyperparameter tuning withGridSearchCV
on a RandomForest classifier and saves the best parameters in the fileparameters/selected_grid_search_parameters.tsv
. -
feature_selection.py
Selects the top 20 most important features using the best model parameters and saves them toresults/new_features_selected.txt
. -
machine_learning.py
Loads the selected features, tests all possible combinations, evaluates them on multiple metrics, and outputs a full report.
data/
— Contains input cohort files (first_cohort.tsv
,second_cohort.tsv
).parameters/
— Stores selected hyperparameters (selected_grid_search_parameters.tsv
).results/
— Saves feature importance, selected features, and combination results.
All dependencies are listed in environment.yml
.
To set up the environment:
conda env create -f environment.yml
conda activate natalizumab_ml
- Run grid search tuning:
python 01_grid_search_tuning.py
- Run feature selection:
python 02_feature_selection.py
- Run machine learning combination evaluation:
python 03_machine_learning.py
parameters/selected_grid_search_parameters.tsv
— Best hyperparameters.results/feature_importances.tsv
— Feature importance scores.results/new_features_selected.txt
— Top N selected features.results/report_combination_metrics.tsv
— Performance metrics for all feature combinations.
- You can switch datasets (
first_cohort.tsv
orsecond_cohort.tsv
) by editing the dataset loading line in each script. - Make sure
selected_features.txt
inresults/
matches the feature file you want to test (you can renamenew_features_selected.txt
if needed).
For questions or collaboration, please contact the repository owner.