8000 GitHub - Yagr49/Photocatalyst_NN: Classical Machine Learning solution for prediction of molecules photophysical properties.
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Yagr49/Photocatalyst_NN

Repository files navigation

Classical machine learning approach for photodynamics properties prediction of photoredox catalyst

Photocatalysis is a rapidly developing area of chemistry in which the creation of new substances capable of converting the energy of visible light into reaction energy is a fundamental task.

Classical approaches based on human intuition are often quite complex, and quantum mechanical calculations of the photodynamic properties of molecules are cumbersome and time-consuming. Machine learning approaches solve time problems, but they work only for organic compounds. Our repository introduces Python notebooks and fine-tuned models that can predict photodynamic properties like absorption, emission wavelength, and quantum yield of fluorescence for not only organic compounds but also metalloroganic and metalorganic complexes.

Graphical_abstract _4

Preparation of data

The list of organic molecules, solvents, absorption wavelengths, emission wavelengths, and quantum yields was taken from paper. All corresponding structures were optimized by GFN2-xTB functional.

The same parameters for metal complexes and photocatalysts were taken from paper. All corresponding structures were optimized by GFN2-xTB functional.

Our model studied on classical chemical descriptors, as Morgan Fingerprints, MACCS Fingerprints, prepared by RDKIT, for metal complex we use combination of SLATM (from qml), Coulomb Matrix (from openbabel) and Bag of Bonds (from mol), additionaly Coulomb matrix and Bag of Bonds were subjected to a PCA procedure due to the large size of the data. All preparations of datasets we publish in repository (named it) and conclude information about the target molecule and the solvent in which the measurement experiment was carried out.

Persistence Barcodes features (sum, mean, std, entropy of H0, H1, H2) were prepared by gitto-tda. The way of preparation: the 3D structure of compounds was transformed into a point cloud, then giotto prepared persistence barcodes for H0, H1, H2 homologies. Sum, mean, standard deviation, and entropy were calculated from the resulting barcodes for all homologies.

persistent_homology

A combination of Coulomb matrix and MorganFingerprints for metal complexes was prepared by taking 10 atoms Coulomb matrix and ligand's SMILES transform to MorganFingerprints. The resulting metal complexes contain information about the environment of the largest atom and the ligands of the metal complexes; adding topological features allows you to enter information about the entire structure.

Coulomb_matrix_pict

The idea of combination ligand and metal center come from Kulik's paper.

To read more about SMILES problem with metal complex checl this ChemRxiv paper.

Selection of descriptors and models

We evaluated the most popular classical machine learning methods (Kernel Ridge Regression, Decision Trees, and Gradient Boosting) using a combination of chemical descriptors for quantum yield, emission wavelength, and absorption wavelength predictions. The SLATM and Coulomb matrix combination with topologies, features, and fingerprints yielded the lowest errors.

Heatmap_abs

Check model on organic compounds

As lead features, we choose Coulomb Matrix + MorganFingerprints + topology and SLATM + topology. The image below displays the correlation and mean absolute error (MAE) for the final model, which is the best result we were able to get using the Optuna on XGBoost and CatBoost Regressor models. The same metrics are shown by SLATM and Coulomb combinations, yet the Coulomb matrix is marginally superior. Because the structure of the ligands must be obtained manually, a model based on the SLATM combination will be constructed more easily for future use than the Coulomb Matrix combination.

Picture_corr

Validate approach on metal complexes

On metall complexes, we verified the optimal methodology, and the CatBoost model yielded the best outcome. The image below shows the RMSE and MAE for each machine learning strategy, and error from quantum calculation for absorption wavelength prediction carried out by ORCA.

Metal_error

How repository work

Datasets and trained models are available at Google Drive

MSU_AI_Photocatal/
└── Models
    ├── CatBoost_Absorption_metal_CoulombMatrix.cbm                   # CatBoost model for Absorption wavelenght prediction trained on combination Coulomb Matrix 10x10 + ligand MorganFingerptints + topologies feautures 
    ├── CatBoost_Absorption_metal_SLATM.cbm                           # CatBoost model for Absorption wavelenght prediction trained on combination SLATM + topologies feautures 
    ├── CatBoost_Emission_metal_CoulombMatrix.cbm                     # CatBoost model for Emission wavelenght prediction trained on combination Coulomb Matrix 10x10 + ligand MorganFingerptints + topologies feautures 
    ├── CatBoost_Emission_metal_SLATM.cbm                             # CatBoost model for Emission wavelenght prediction trained on combination SLATM + topologies feautures 
    ├── Catboost_Quantum_Yield_org_compound_CoulombMatrix.cbm         # CatBoost model for Quantum Yield prediction trained on combination Coulomb Matrix 10x10 + ligand MorganFingerptints + topologies feautures 
    ├── Catboost_Quantum_Yield_org_compound_SLATM.cbm                 # CatBoost model for Quantum Yield prediction trained on combination SLATM + topologies feautures 
    ├── XGB_Absorption_org_compound_CoulombMatrix.json                # XGBoost model for Emission wavelenght prediction trained on combination Coulomb Matrix 10x10 + ligand MorganFingerptints + topologies feautures
    ├── XGB_Absorption_org_compound_SLATM.json                        # XGBoost model for Emission wavelenght prediction trained on combination SLATM + topologies feautures
    ├── XGB_Emission_org_compound_CoulombMatrix.json                  # XGBoost model for Emission wavelenght prediction trained on combination Coulomb Matrix 10x10 + ligand MorganFingerptints + topologies feautures
    ├── XGB_Emission_org_compound_SLATM.json                          # XGBoost model for Emission wavelenght prediction trained on combination SLATM + topologies feautures
└── Datasets
    └── Coordinates_xyz
        ├── Metal_complexes_optimize_XTB_coordinates.zip              # Archive of GFN2-xTB optimized metal complexes structures in .xyz format 
        ├── Metal_complexes_solvents_optimize_XTB_coordinates.zip     # Archive of GFN2-xTB optimized metal complex's solvents structures in .xyz format
        ├── Organic_compounds_optimize_MM_coordinates.zip             # Archive of Molecular Mechanics optimized organic compounds structures in .xyz format
        ├── Organic_compounds_optimize_XTB_coordinates.zip            # Archive of GFN2-xTB optimized organic compounds structures in .xyz format
        ├── Organic_compounds_solvents_optimize_XTB_coordinates.zip   # Archive of GFN2-xTB optimized organic compound's solvent structures in .xyz format
    └── Metal_complexes_descriptors
        ├── Metal_complexes_ligands_SMILES.xlsx                       # SMILES Dataset of ligands corresponding metal complex
        ├── Metal_complexes_coulomb_matrix_10x10.csv                  # Coulomb Matrix metal complex and solvents Dataset , shape 10x10
        ├── Metal_complexes_ligands_MorganFingerprints_2048.csv       # Morgan FingerPrints Dataset of ligands corresponding metal complex and solvents
        ├── SLATM_metal_complexes_with_solv.csv                       # SLATM metal complex and solvents Dataset
        ├── SLATM_organic_compound_with_solv_metal_inf.csv            # SLATM organic compounds and solvents Dataset with metal complexes charge distribution
    └── Organic_compounds_descriptors
        ├── SLATM_organic_compounds_with_solv.csv                     # SLATM organic compounds and solvents Dataset
        ├── BoB_PCA_organic_compoubs_with_solv.csv                    # Bag of Bonds after PCA for organic compounds and solvents Dataset
        ├── CM_pca_organic_compounds_with_solv.csv                    # Coulomb Matrix after PCA for organic compounds and solvents Dataset
        ├── Organic_compounds_coulomb_matrix_10x10.csv                # Coulomb Matrix 10x10 for organic compounds and solvents Dataset
        ├── Organic_compounds_Morgan_FingerPrints.csv                 # Morgan FingerPrints for organic compounds and solvents Dataset
        ├── Organic_compounds_MACCS_FingerPrints.csv                  # MACCS FingerPrints for organic compounds and solvents Dataset
    └── Target_dataset
        ├── Metal_complexes_dataset.xlsx                              # Dataset consist Absorption, Emission wavelenght, Molecular weight and SMILES of corresponding solvents for metal complex
        ├── Organic_compound_final_dataset.xlsx                       # Dataset consist Imputed Absorption, Emission wavelenght, Quantum Yield, Molecular weight and SMILES of corresponding solvent for organic compounds
        ├── Organic_compound_final_dataset.xlsx                       # Dataset consist Absorption, Emission wavelenght, Quantum Yield, Molecular weight and SMILES of corresponding solvent for organic compounds from [paper](https://www.nature.com/articles/s41597-020-00634-8)
    └── Topology
        ├── topology_features_metal                                   # Folder with topology features (sum,mean,std,entropy of barcodes) for metal complexes
            ├── diagrams_basic_0_conc.csv                             # Example of topology features file
        ├── topology_features_metal_solvent                           # Folder with topology features (sum,mean,std,entropy of barcodes) for metal complex solvents
            ├── diagrams_basic_c(cl)cl_conc.csv                       # Example of topology features file
        ├── XYZ_persistence_barcodes_metal                            # Folder with persistence barcodes features for metal complex solvents
            ├── diagrams_basic_0.pkl                                  # Example of persistence barcodes file
        ├── XYZ_persistence_barcodes_metal                            # Folder with persistence barcodes for metal complex solvents
            ├── diagrams_basic_cc(=o)n(c)c.pkl                        # Example of persistence barcodes file
        ├── topology_features_organic_compounds                       # Dataset consist topology features (sum,mean,std,entropy of barcodes) for metal complexes and correspondibg solvents
            ├── diagrams_basic_0_conc.csv                             # Example of topology features file
        ├── topology_features_organic_compounds_solv                  # Dataset consist topology features (sum,mean,std,entropy of barcodes) for organic compounds and correspondibg solvents
            ├── diagrams_basic_c(cl)cl_conc.csv                       # Example of topology features file
        ├── XYZ_persistence_barcodes_organic_compounds                # Folder with persistence barcodes features for metal complex solvents
            ├── diagrams_basic_0.pkl                                  # Example of persistence barcodes file
        ├── XYZ_persistence_barcodes_organic_compounds_solv           # Folder with persistence barcodes for metal complex solvents
            ├── ddiagrams_basic_[2h]c(cl)(cl)cl.pkl                   # Example of persistence barcodes file
    └── Coulomb_matrix
        ├── Metal_complex                                             # Folder with topology features (sum,mean,std,entropy of barcodes) for metal complexes
            ├── 0_quad.csv                                            # Example of Coulomb matrix file
        ├── Metal_complex_solv                                        # Folder with topology features (sum,mean,std,entropy of barcodes) for metal complex solvents
            ├── CN(C)C=O.csv                                          # Example of Coulomb matrix file
        ├── Organic_compounds                                         # Folder with persistence barcodes features for metal complex solvents
            ├── 0.pkl                                                 # Example of Coulomb matrix file
        ├── Organic_compounds_solv                                    # Folder with persistence barcodes for metal complex solvents
            ├── cc(=o)n(c)c.pkl                                       # Example of Coulomb matrix file

For descriptors preparation you should run notebooks from /notebooks/preparing_datasets folder. Path for input file load from corresponding folders in Google Drive.

For models validation for organic compounds you should run notebooks from /notebooks/validation folder. Path for input file load from corresponding folders in Google Drive.

To get trained models for organic compounds you should run notebooks from /notebooks/optimization folder or load trained models . Path for input file load from corresponding folders in Google Drive.

To get trained models for metal complexes you should run notebooks from /src/optimization folder or load trained models . Path for input file load from corresponding folders in Google Drive.

The enviroment for run notebooks locally download enviroment.yml from /enviroment folder.

Acknowledgements

Work is greatly supported by Non-commercial Foundation for the Advancement of Science and Education INTELLECT, and my mentor, Sergey Kolpinskiy

About

Classical Machine Learning solution for prediction of molecules photophysical properties.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published
0