The PLAS-HES-5k dataset is a curated collection of 5,000 protein-ligand complexes (PLCs) designed specifically for machine learning applications in drug discovery. This comprehensive dataset includes both bound and unbound conformations, annotated with binding free energies derived from non-equilibrium molecular dynamics simulations and approximate free energy calculations. The structural and energetic diversity represented in PLAS-HES-5k makes it an ideal benchmark and training resource for predictive and generative ML/DL models aimed at improving drug-target binding predictions.
The dataset contains:
-
5,000 protein-ligand complexes in two conformational variants:
- PLAS: Bound conformations starting from PDB database
- HES: Synthetic unbound conformations derived from PLAS-20K, including both low and high energy states
-
For each complex:
- Atomic coordinates
- Binding free energy data
- Parameter files for simulation
Each protein-ligand complex in the dataset includes:
- Bound conformations from PLAS-20K
- Complete atomic coordinates in PDB format
- Parameter files (.prmtop)
- Binding affinity calculations
- Synthetic unbound conformations seeded from PLAS-20K
- Both low and high energy conformational states
- Complete atomic coordinates in PDB format
- Parameter files (.prmtop)
- Binding affinity calculations
.tar.gz
archives containing collections of PLCs.txt
files listing which PLCs are present in each archive.pdb
files containing structural information.prmtop
files containing molecular dynamics parameters.csv
files with binding affinity data
The PLAS-HES-5k dataset is designed for:
- Training ML/DL models for predicting protein-ligand binding affinities
- Evaluating generative models in drug design
- Research on conformational sampling and binding dynamics
- Benchmarking novel ML/DL architectures for drug discovery applications
-
Environment Setup
- Environment configuration files for simulations are located in the
Env
folder - A YAML file is provided for setting up the conda environment with Plumed2 activated
- Environment configuration files for simulations are located in the
-
Steered Molecular Dynamics
- Simulation scripts are available in the
Steered_Molecular_Dynamics
folder - Submit jobs using the command:
sh Complete_simulation_setup_hs.sh "Index_Number" "PDB_ID" "cpu-number" "partition"
- Simulation scripts are available in the
-
Trajectory Validation
- The
Trajectory_validation
folder contains scripts to validate trajectories through:- Sigmoidal curve fitting
- RMSD analysis of protein and ligand
- Center-of-mass distance separation measurements
- The
-
Dataset Access
- File structures are generated for each PDB ID
- The complete dataset is publicly available on the India-Data website: https://india-data.org/dataset-details/ef3a1c5b-6ff2-49f7-ae7a-a99f69003849
- Extract the
.tar.gz
archives to access individual PLC data
-
Energy Component Analysis
- The
Distribution_Of_Energy_Components
folder contains scripts for:- Reproducing energy component distributions across all PLCs
- Analyzing energy components for individual PLCs
- The
-
Training Machine Learning Models
- The dataset is suitable for various ML/DL approaches:
- Graph Neural Networks
- 3D Convolutional Networks
- Equivariant Neural Networks
- Attention-based models
- The dataset is suitable for various ML/DL approaches:
-
Benchmarking
- Use the binding affinity data to evaluate model performance
- Compare model predictions across both PLAS (bound) and HES (unbound) conformations
If you use this dataset in your research, please cite: [Citation information to be provided by dataset creators after the publication of the dataset]
If you have any query regarding the dataset, you can reach out to Prathit Chatterjee (prathit.chatterjee@ihub-data.iiit.ac.in).
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
1.0.0