An easy-to-run tool to classify cancer samples to defined TCGA subtypes using molecular profile data
- Introduction
- Data Availability and Publication
- Quickstart Guide
- Tutorials
- Troubleshooting
- Acknowledgment and Funding
- Maintainers
The TMP toolkit is designed to classify cancer samples to subtypes using molecular data. This tool can provide reliable subtype classification on non-TCGA studies, clinical trials, or other user datasets.
The top-performing models (of the hundreds of thousands models evaluated) have been pre-trained and available within Docker containers for ease of use.
The TMP toolkit is applicable to 26 different cancer cohorts (ex. breast invasive carcinoma, colon adenocarcinoma) and has been trained on TCGA primary tumor samples to classify any of 106 cancer subtype to new samples.
Cancer cohorts include:
-
ACC, BLCA, BRCA, CESC, COADREAD, ESCC, GEA, HNSC, KIRCKICH, KIRP, LGGGBM, LIHCCHOL, LUAD, LUSC, MESO, OV, PAAD, PCPG, PRAD, SARC, SKCM, TGCT, THCA, THYM, UCEC, UVM
Full cancer cohort names of these abbreviations can be found here. Note that our study combined several of the TCGA primary tumors in the linked list
The data platforms supported are gene expression, DNA methylation, miRNA, copy number, and/or mutation calls.
Visit Cancer Cell for the publication of this work (open access):
Data are freely available for download:
-
Note: Data required to run tool must be downloaded from the above link
Install requirements - detailed instructions are found on the Requirements page:
- Install Python 3+
- Install Docker Desktop (or Docker)
- Install Synapse Client
- Install AWS Client
Ensure that steps are completed on the Requirements page - (includes creating working environment, signining in, and manually downloading required data)
Alternatively, Docker images can be built directly. Instructions are found on the Requirements page
Activate the python virtual environment
source venv/bin/activate
User input data must be in tab-separated format (.tsv) - where original user data has rows labeled with samples and columns labeled with features (ex. genes).
Ensure your data matches above description
Translate your feature names (ex. genes, etc.) from Entrez IDs to our unique TMP tooklit IDs and transpose matrix.
Example: convert gene TP53 to feature N:GEXP::TP53:7157:
python tools/convert.py \
--data <path/to/my-data.tsv> \
--out <path/to/my-updated-data.tsv> \
--cancer <cancer>
If the data contains a meta-data column, use the option argument --delete_i_col
to delete the specified column (where n is an integer with zero-based indexing). If not specified, then will run with no column deletions.
Next, data must be transformed with a quantile rescale prior to running machine learning algorithms.
# Transform - creates transformed-data.tsv
bash tools/run_transform.sh \
<path/to/my-updated-data.tsv> \
<cancer>
python tools/zero_floor.py \
-in user-transformed-data/transformed-data.tsv \
-out user-transformed-data/transformed-data.tsv
The rescaled output file will written to disk at
user-transformed-data/transformed-data.tsv
.
Run a single command to predict the molecular subtype all samples.
bash RUN_model.sh <cancer> <platform> <method> <user-transformed-data/transformed-data.tsv>
Available cancers:
ACC
,BLCA
,BRCA
,CESC
,COADREAD
,ESCC
,GEA
,HNSC
,KIRCKICH
,KIRP
,LGGGBM
,LIHCCHOL
,LUAD
,LUSC
,MESO
,OV
,PAAD
,PCPG
,PRAD
,SARC
,SKCM
,TGCT
,THCA
,THYM
,UCEC
,UVM
Available platforms:
GEXP
,METH
,MUTA
,MIR
,CNVR
- (see model details for
OVERALL
,TOP
,MULTI
,All
,allcohorts
)
Available methods:
skgrid
,aklimate
,cloudforest
,jadbio
, andsubscope
Examples:
-
Run the SK Grid model that was trained on gene expression for the breast invasive carcinoma cohort
bash RUN_model.sh BRCA GEXP skgrid user-transformed-data/transformed-data.tsv
-
Run the AKLIMATE model that was trained on DNA methylation for the pancreatic adenocarcinoma cohort
bash RUN_model.sh PAAD METH aklimate user-transformed-data/transformed-data.tsv
-
Run the CloudForest model that was trained on somatic mutations for the colon adenocarcinoma + rectum adenocarcinoma combined cohort
bash RUN_model.sh COADREAD MUTA cloudforest user-transformed-data/transformed-data.tsv
For a guided tutorial of running our models for subtype classification, see the Guided Tutorial page.
To understand the specific parameters and other details of individual containerized models, see the Explore Models page.
To interprete and convert the TMP Toolkit subtype abbreviations, see the Understanding Subtype Abbreviations page.
-
Our models use the BRCA_1 abbreviation to denote the luminal A subtype. Learn how to automatically convert TMP Toolkit abbreviations to common names for all our subtypes
See How to Fix Common Issues for common error messages.
We would like to thank the National Cancer Institute for support.
Current maintainers:
- Jordan Tagle (GitHub jordan2lee)
- Kyle Ellrott (GitHub kellrott)
- Brian Karlberg (GitHub briankarlberg)