RAPiDock: Pushing the Boundaries of Protein-peptide Docking with Rational and Accurate Diffusion Generative Model
RAPiDock is a diffusion generative model designed for rational, accurate, and rapid protein-peptide docking at an all-atomic level.
RAPiDock relies on external software/libraries to handle protein and peptide dataset files, to compute atom features, and to perform neural network calculations. The following is the list of required libraries and programs, as well as the version on which it was testes.
- Python (3.9).
- Pytorch (1.11.0). Use to model, train, and evaluate the actual neural networks.
- CUDA Toolkit (11.5.1).
- PyG (2.1.0). Used for implementing neural networks.
- MDAnalysis (2.6.1). To handle residue acids related data.
- BioPython (1.84). To parse PDB files.
- E3NN (0.5.1). Used to implement E(3) equivariant neural network.
- RDKit (2022.9.5). Used to process and extract atomic features.
- *PyRosetta (2024.10). Used to further optimize the peptides generated by the model and rescore the generated peptide. (Optional)
We will set up the environment using Anaconda or Docker Tools.
git clone https://github.com/huifengzhao/RAPiDock.git
Pre-trained models: We have provided two different pre-trained model's parameters: rapidock_local.pt for local docking and rapidock_global.pt for global docking. Download them and then correctly put into train_models/CGTensorProductEquivariantModel
in order for the model to run normally.
We can easy install the environment by using the provided rapidock_env.yaml and requirement.txt files. These configuration files are compatible with CUDA 12.4.
conda env create -f rapidock_env.yaml -n RAPiDock
conda activate RAPiDock # activating the constructed environment
pip install -r requirement.txt # We separated the dependencies of conda and pip for a better experience of environment installation.
python -c 'import pyrosetta_installer; pyrosetta_installer.install_pyrosetta()' # Installation of PyRosetta (Optional)
If we fail to install the environment via provided .yaml
file, we can also install the environment manually through the following steps:
conda create -n RAPiDock python=3.9 pytorch=1.11.0 cudatoolkit=11.5.1 pyg=2.1.0 pytorch-cluster pytorch-scatter pytorch-sparse MDAnalysis=2.6.1 mkl=2023.1.0 pyyaml -c pytorch -c nvidia -c pyg -c conda-forge
pip install e3nn==0.5.1 rdkit-pypi==2022.9.5 fair-esm==2.0.0 pyrosetta-installer==0.1.0
python -c 'import pyrosetta_installer; pyrosetta_installer.install_pyrosetta()' # Installation of PyRosetta (Optional)
If we are using higher versions of CUDA, such as CUDA 12.6 on H800, the PyTorch and other packages used in the aforementioned method will no longer be compatible. Therefore, we should install the environment manually through the following steps:
conda create -n RAPiDock python=3.9 pytorch torchvision torchaudio pytorch-cuda=12.4 MDAnalysis pyg pytorch-cluster pytorch-scatter pytorch-sparse pyyaml -c pytorch -c nvidia -c pyg -c conda-forge
pip install e3nn rdkit-pypi fair-esm pyrosetta-installer
python -c 'import pyrosetta_installer; pyrosetta_installer.install_pyrosetta()' # Installation of PyRosetta (Optional)
If we are not accustomed to using Conda for environment configuration, we also provide a Dockerfile, which is suitable for more general scenarios. We can easily build the environment through the following steps:
docker build --no-cache -t rapidock-image . # build the image
docker run -it --runtime=nvidia rapidock-image /bin/bash # step into the image we build for RAPiDock running
Then, we have built the environment required for the model and can proceed with the subsequent code.
Now, we can run the code as described below.
We are ready to run inference. To test the model, an example dataset is provided in the data/
directory.
python inference.py --config default_inference_args.yaml --protein_peptide_csv data/protein_peptide_example.csv --output_dir results/default
By defa 8000 ult, we will use 5 CPUs for computing, if we want to change the number of CPUs, we can simply:
python inference.py --config default_inference_args.yaml --protein_peptide_csv data/protein_peptide_example.csv --output_dir results/default --cpu 10
We offer several processed datasets, including two primary test sets for model performance evaluation—RefPepDB-RecentSet and PepSet—and datasets for extensive case analyses, covering four key application scenarios: PLK-PBD, SHP2, Import-a, and pHLA.
Each dataset is organized as follows:
Dataset Name/
PdbID/
protein.pdb
peptide.pdb
peptide_sequence
PdbID/
...
Take dataset RefPepDB-RecentSet for example:
-
Download and Extract Dataset
Download
RefPepDB-RecentSet.tar.gz
and place it in:/path/to/RAPiDock/testdataset/
Extract the dataset:
cd /path/to/RAPiDock/testdataset tar -xzvf RefPepDB-RecentSet.tar.gz
Navigate to the dataset directory and we will see 523 folders named with pdbid of different protein-peptide complexes that make up the RefPepDB-RecentSet
-
Run Docking for a Single Example (e.g.,
7aos
)Return to the
/path/to/RAPiDock/
root directory and execute:python inference.py \ --complex_name 7aos \ --protein_description testdataset/RefPepDB-RecentSet/7aos/7aos_protein_pocket.pdb \ --peptide_description $(cat testdataset/RefPepDB-RecentSet/7aos/7aos_peptide_sequence) \ --output_dir results/RefPepDB-RecentSet \ --N 10 \ --model_dir train_models/CGTensorProductEquivariantModel \ --ckpt rapidock_local.pt \ --scoring_function ref2015 \ --batch_size 4 \ --no_final_step_noise \ --inference_steps 16 \ --actual_steps 16 \ --conformation_partial 1:1:1 \ --cpu 10
-
Visualize Results
- Docking results will be saved in the directory specified by
--output_dir
. - Predicted protein-peptide binding affinities are stored in
ref2015_score.csv
within theoutput_dir
.
- Docking results will be saved in the directory specified by
We support multiple input formats depending on specific tasks.
-
When we try to perform local docking, the binding pocket and/or the protein structure are always known. So, we need to prepare:
-
Protein: The
.pdb
file of the protein or its pocket (for better accuracy).💡
The pocket structure of protein can be generated by using
pocket_trunction.py
. -
Peptide: The
.pdb
file or sequence of the peptide.📖
.pdb
file of peptide is only used for sequence extracting, and no 3D information of peptide will be kept.
For example:
python inference.py [--ohter options] --protein_description protein.pdb --peptide_description peptide.pdb
or just using the sequence information:
python inference.py [--ohter options] --protein_description protein.pdb --peptide_description HKILHRLLQDS
-
-
When we try to perform global docking, we have no idea about the binding pocket of the protein, and we even have no idea about the protein structure. So, we need to prepare:
-
Protein: The
.pdb
file of the protein or sequence of the protein (We can use the default ESMFold method to fold the protein).💡
We can also use other ways, such as using AlphaFold or homology modeling to generate the protein structure.
-
Peptide: The
.pdb
file or sequence of the peptide.
For example:
python inference.py [--ohter options] --protein_description ...SLAPYASLTEIEHLVQSVCKSYRETCQLRLEDLLRQRSNIFSREEVTGYQ... --HKILHRLLQDS
-
-
When we try to perform virtual screening, we always have the information about the binding pocket and/or the protein structure, and a multiple sequence of peptides. So, we provide Multi-task submission mode:
-
Protein and peptide description: A
.csv
file including multiple protein-peptide pair information. The.csv
file looks like:complex_name protein_description peptide_description complex1 /path/to/protein.pdb HKILHRLLQDS complex2 /path/to/protein.pdb EKHKILHRLLQDS ... ... ... complexN /path/to/protein.pdb LSGFMELCQ
Then, we can simply launch the model by using the following command:
python inference.py [--ohter options] --protein_peptide_csv /path/to/virtual_screening.csv
-
-
We can also perform multiple customized tasks in one shot, using a
.csv
file.- Protein and peptide description: A
.csv
file including multiple protein-peptide pair information. The.csv
file looks like:complex_name protein_description peptide_description complex1 /path/to/protein.pdb AAAARLLQDS complex2 ...SLAPYASLTEIEHLVQSVCKSYRETCQLRLEDLLRQRSNIFSREEVTGYQ... /path/to/peptide.pdb ... ... ... complexN /path/to/protein.pdb /path/to/peptide.pdb
Then, we can simply launch the model by using the following command:
python inference.py [--ohter options] --protein_peptide_csv /path/to/customized_tasks.csv
- Protein and peptide description: A
In current version of RAPiDock, we support 92 types of residues for protein-peptide binding pattern prediction. The supported residues are illustrated bellow:
For the use of 92 types of residues, we define a special format for model input:
G | A | V | I | L | M | F | Y | W | P |
---|---|---|---|---|---|---|---|---|---|
S | T | N | Q | D | E | C | R | H | K |
[HYP] | [SEP] | [TYS] | [ALY] | [TPO] | [PTR] | [DAL] | [MLE] | [M3L] | [DLE] |
[DLY] | [AIB] | [MSE] | [DPR] | [MVA] | [NLE] | [MLY] | [SAR] | [ABA] | [FME] |
[DAR] | [ORN] | [CGU] | [DPN] | [DTY] | [DTR] | [4BF] | [DGL] | [DCY] | [MK8] |
[MP8] | [GHP] | [ALC] | [BMT] | [MLZ] | [DVA] | [3FG] | [DAS] | [7ID] | [DSN] |
[AR7] | [MEA] | [PHI] | [MAA] | [LPD] | [KCR] | [PCA] | [DGN] | [2MR] | [DHI] |
[ASA] | [MLU] | [YCP] | [DSG] | [DTH] | [OMY] | [FP9] | [DPP] | [HCS] | [SET] |
[DBB] | [BTK] | [DAM] | [IIL] | [3MY] | [SLL] | [PFF] | [HRG] | [DIL] | [DNE] |
[MED] | [D0C] |
Then, we can simply launch the model for predicting peptide with non-canonical amino acids by using the following command:
python inference.py [--ohter options] --protein_description protein.pdb --peptide_description HK[HYP]RL[PTR]QDS
For a detailed description of model parameters, run:
python inference.py --help
Then, we will get all the parameters supported by the inference.py
of RAPiDock:
usage: inference.py [-h] [--config CONFIG] [--protein_peptide_csv PROTEIN_PEPTIDE_CSV] [--complex_name COMPLEX_NAME]
[--protein_description PROTEIN_DESCRIPTION] [--peptide_description PEPTIDE_DESCRIPTION] [--output_dir OUTPUT_DIR]
[--save_visualisation] [--N N] [--model_dir MODEL_DIR] [--ckpt CKPT] [--scoring_function SCORING_FUNCTION] [--fastrelax]
[--confidence_model_dir CONFIDENCE_MODEL_DIR] [--confidence_ckpt CONFIDENCE_CKPT] [--batch_size BATCH_SIZE]
[--no_final_step_noise] [--inference_steps INFERENCE_STEPS] [--actual_steps ACTUAL_STEPS]
[--conformation_partial CONFORMATION_PARTIAL] [--conformation_type CONFORMATION_TYPE] [--cpu CPU]
optional arguments:
-h, --help show this help message and exit
--config CONFIG
--protein_peptide_csv PROTEIN_PEPTIDE_CSV
Path to a .csv file specifying the multiple inputs as described in the README. If this is not None, it will be used
instead of the --protein_description and --peptide_description parameters
--complex_name COMPLEX_NAME
Name that the docked complex result will be saved with
--protein_description PROTEIN_DESCRIPTION
Either the path to a protein .pdb file or a sequence of the input protein for ESMFold
--peptide_description PEPTIDE_DESCRIPTION
Either the path to a peptide .pdb file or a sequence of the input peptide
--output_dir OUTPUT_DIR
Directory where the outputs will be written to
--save_visualisation Save a .pdb file with all of the steps of the reverse diffusion
--N N Number of samples to generate
--model_dir MODEL_DIR
Path to folder with trained score model and hyperparameters
--ckpt CKPT Checkpoint to use for the score model
--scoring_function SCORING_FUNCTION
The scoring function to use (confidence/ref2015)
--fastrelax Use FastRelax to optimize generated peptide. This option is on if --scoring_function is chosed to be ref2015
--confidence_model_dir CONFIDENCE_MODEL_DIR
Path to folder with trained confidence model and hyperparameters, this is used if --scoring_function is chosed to be
confidence
--confidence_ckpt CONFIDENCE_CKPT
Checkpoint to use for the confidence model, this is used if --scoring_function is chosed to be confidence
--batch_size BATCH_SIZE
The batch size used in inference process
--no_final_step_noise
Use no noise in the final step of the reverse diffusion
--inference_steps INFERENCE_STEPS
Number of denoising steps
--actual_steps ACTUAL_STEPS
Number of denoising steps that are actually performed
--conformation_partial CONFORMATION_PARTIAL
The partial of initial type of peptide conformation. H:E:P.
--conformation_type CONFORMATION_TYPE
The initial type of peptide conformation. H: Helical conformation: φ = -57°, ψ = -47°; E: Extended conformation: φ =
-139°, ψ = 135°; P: Polyproline II conformation: φ = -78°, ψ = 149°, this is ignored if --conformation_partial is not
None
--cpu CPU The cpu used in inference process
We also provide visualization of the model inference process. We can simply:
python inference.py --config default_inference_args.yaml --protein_peptide_csv data/protein_peptide_example.csv --output_dir results/default --save_visualisation
Then the inference process of generated peptide will be save by name of rankN_reverseprocess.pdb
hope you enjoy the processing of using RAPiDock.