IMVRL-GCN:Multi-View Representation Learning for Identification of Novel Cancer Genes and Their Causative Biological Mechanisms
< 7F0F svg class="octicon octicon-link" viewBox="0 0 16 16" version="1.1" width="16" height="16" aria-hidden="true">Tumorigenesis arises from the dysfunction of cancer genes, leading to uncontrolled cell proliferation through various mechanisms. Establishing a complete cancer gene catalogue will make precision oncology possible. Although existing methods based on Graph Neural Networks (GNN) are effective in identifying cancer genes, they fall short in integrating data from multiple views and interpreting predictive outcomes. To address these shortcomings, an interpretable representation learning framework IMVRL-GCN is proposed to capture both shared and specific representations from multi-view data, offering significant insights for the identification of cancer genes.
This repository contains the source code and datasets for our paper, "Multi-View Representation Learning for Identification of Novel Cancer Genes and Their Causative Biological Mechanisms".
The dependencies is the pytorch environment on Linux system, the operating system is CentOS Linux release 7.7.1908. Some important Python packages are listed below:
-
pytorch 1.13.1
-
torch_geometric 2.3.1
-
scikit-learn 0.22
-
numpy 1.21.6
-
pandas 1.1.5
-
scipy 1.4.1
# Create a virtual environment and install the requirements
conda create -n [ENVIRONMENT NAME] python==3.7.0
conda activate [ENVIRONMENT NAME]
pip install -r requirements.txt
-
./data/CPDB_datasets.pkl
contains the PPI network (as an adjacency matrix for input into GCN,$n\times n$ ) extracted from the CPDB database and the feature matrixX
($n\times d$ , where$d$ is the size of the feature dimension, here$d=64$ ). -
./data/k_sets.pkl
contains information for five-fold cross-validation to better evaluate the performance of our model.
The command line code is:
python IMVRL-GCN.py
Description of some important functions and classes:
- Function
Args()
inIMVRL-GCN.py
contains hyper-parameters, such as device, epochs. Suitable parameters can be set according to the actual situation. - Function
load_datasets()
inIMVRL-GCN.py
is used to load data and experimental setup for five-fold cross validation. - Class
Experiment()
inIMVRL-GCN.py
is used to evaluate the performance of IMVRL-GCN with five-fold cross validation.
Excepted output: The output file is saved in the output
directory, including detailed results of training and testing. And the evaluation metrics include AUC and AUPR.
If you want to run IMVRL-GCN on your own dataset, you should refer to ./data/CPDB_datasets.pkl
and ./data/k_sets.pkl
to prepare your own adjacency matrix, feature matrix information and experiment setup information for five-fold cross validation. And then you should modify the relevant code in the function load_datasets()
in IMVRL-GCN.py
If you find this repository useful, please cite the following paper:
@article{10.1093/bib/bbae418,
author = {Yang, Jianye and Fu, Haitao and Xue, Feiyang and Li, Menglu and Wu, Yuyang and Yu, Zhanhui and Luo, Haohui and Gong, Jing and Niu, Xiaohui and Zhang, Wen},
title = "{Multiview representation learning for identification of novel cancer genes and their causative biological mechanisms}",
journal = {Briefings in Bioinformatics},
volume = {25},
number = {5},
pages = {bbae418},
year = {2024},
month = {08},
issn = {1477-4054},
doi = {10.1093/bib/bbae418},
url = {https://doi.org/10.1093/bib/bbae418},
}