✨ Welcome to the official repository for "RILLIE: RNA In Silico Evolution via LLM and Inverse folding".
-- 🔗 Read our paper: ArXiv --
🚀 We introduce RILLIE, a general RNA foundation model that integrate sequence and structure information to evolve RNA in a zero-shot fashion. RILLIE integrating a large language model with an inverse folding model can generate functional RNA sequences aligning with natural evolutionary patterns at the sequence level while preserving the structural integrity of key functional regions. Using RILLIE, we successfully evolved two engineered RNA aptamers, Broccoli and Pepper, with a high success rate, low sequence similarity, improved binding affinity and fluorescence in live cell.
conda env create -f RILLIE.yaml -y
conda activate RILLIE
# Create a conda environment
conda create -y -n RILLIE python=3.10
conda activate RILLIE
# Install PyTorch and CUDA dependencies
pip install torch==2.1.0+cu121 torchvision==0.16.0+cu121 torchaudio==2.1.0+cu121
# Install PyTorch Geometric and related dependencies
pip install torch-geometric==2.6.1 torch-scatter==2.1.2+pt21cu121
# Install Bioinformatics and Structural Biology packages
pip install \
biopython==1.84 \
bio==1.7.1 \
biothings-client==0.3.1 \
biotite==1.1.0 \
biotraj==1.2.2 \
mygene==3.2.2 \
prody==2.4.1 \
pymatgen==2024.8.9 \
spglib==2.5.0 \
openmm==8.1.1 \
simtk==0.1.0 \
rdkit-pypi==2021.3.4
# Install Machine Learning and Deep Learning dependencies
pip install \
scikit-learn==1.6.0 \
torchdrug==0.2.1 \
transformers==4.47.0 \
pytorch-lightning==2.4.0 \
lightning==2.4.0 \
torchmetrics==1.6.0 \
peft==0.14.0
# Install Data Processing and Computation Libraries
pip install \
numpy==1.26.3 \
scipy==1.14.1 \
pandas==2.2.2 \
numba==0.60.0 \
sympy==1.12 \
tqdm==4.66.5 \
joblib==1.4.2 \
threadpoolctl==3.5.0
# Install Visualization Tools
pip install \
matplotlib==3.9.2 \
seaborn==0.13.2 \
plotly==5.23.0 \
bokeh==3.6.2 \
datashader==0.16.3 \
holoviews==1.20.0
# Install Web & API Utilities
pip install \
requests==2.32.3 \
aiohttp==3.11.10 \
huggingface-hub==0.26.5 \
pyyaml==6.0.2 \
urllib3==1.26.13
# Install Miscellaneous Tools
pip install \
rna-fm==0.2.2 \
ml-collections==0.1.1 \
uncertainties==3.2.2 \
markdown==3.7 \
jsonargparse==4.34.1
We collected 6 ncRNA DMS datasets including tRNA, RNA aptamer and ribozyme from previous papers or private data:
Our benchmark includes following models:
-
RNA language models:
-
DNA language models:
-
RNA inverse-folding models:
We use RhoFold and AlphaFold3 to generate RNA 3D Structure
We use Chai to generate RNA 2D structure (as the input of RILLIE).
The average spearman and pearson correlations across 6 datasets can be visualized through following command:
python ./RILLIE/utils/ncRNA_fitness_prediction_average.py
Spearman corelations are visualized as follows:
The specific spearman and pearson correlations across 6 datasets can be visualized through following command:
python ./RILLIE/utils/ncRNA_fitness_prediction_all.py
Spearman corelations are visualized as follows:
python ./RILLIE/utils/ncRNA_fitness_prediction_all.py
Spearman corelations are visualized as follows:
You can download the model checkpoint from Google Drive link.
Then, place the downloaded data into the ./RILLIE/model/IFM/checkpoint
directory.
secondary structure (.npy) can be generated by RhoFold or Chai
tertiary structure can be generated(.pdb) by Chai or AlphaFold3 or RhoFold
Tips: .pdb file and .npy file should have the same name
Example: test_1.pdb and test_1.npy
python ./RILLIE/model/IFModel/src/score_sequence_joint_likelihood.py
Example Output:
Processing pdb files for sequence: 100%|███████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 27.23it/s]
Sequence Sequence_63: IFM score = -3.0081936583227042, LLM score = -1.1260515451431274
Scoring sequence Sequence_64...
Processing pdb files for sequence: 100%|███████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 26.98it/s]
Sequence Sequence_64: IFM score = -2.9159689144212373, LLM score = -1.0990711450576782
Threshold for IFM (top 10.0%): -2.946614666374362
Threshold for LLM (top 10.0%): -1.0952333569526673
Selected 2 sequences out of 64.
Based on the wet-lab testing results, we can discarding harmful mutations to increase the success rate, while introducing new mutations to help direct evolution escape local optima and discover global optima. This approach enables efficient directed evolution without retraining the model and is specially useful when the tested variants are very few.
1、Visualize Fitness Heatmap(such as Fluorescence or Affinity) based on the previous wet-lab testing results
python ./RILLIE/utils/wet_data_analysis.py
Spearman corelations are visualized as follows:
The mutations in dark color indicates harmful mutations.
python ./RILLIE/utils/visualize_mutational_distribution.py
Mutational distribution(e.g. broccoli) is visualized as follows:
python ./RILLIE/model/IFModel/src/score_sequence_joint_likelihood.py
No Commercial use of either the model nor generated data, details to be found in license.md.
Our work builds upon AIDO.RNA(1.6B),RiNALMo,RNAFM,RNAMSM,Evo 1,Nucleotide Transformer,Grover,GENA ,RhoDesign,RhoFoldThanks for their excellent work and open-source contributions.