RNA-FM (RNA Foundation Model) is a state-of-the-art pretrained language model for RNA sequences, serving as t D7AF he cornerstone of an integrated RNA research ecosystem. Trained on 23+ million non-coding RNA (ncRNA) sequences via self-supervised learning, RNA-FM extracts comprehensive structural and functional information from RNA sequences without relying on experimental labels. mRNA‑FM is a direct extension of RNA-FM, trained exclusively on 45 million mRNA coding sequences (CDS). It is specifically designed to capture information unique to mRNA and has demonstrated excellent performance in related tasks. Consequently, RNA-FM generates general-purpose RNA embeddings suitable for a broad range of downstream tasks, including but not limited to secondary and tertiary structure prediction, RNA family clustering, and functional RNA analysis.
Originally introduced in Nature Methods as a foundational model for RNA biology, RNA-FM outperforms all evaluated single-sequence RNA language models across a wide reange of structure and function benchmarks, enabling unprecedented accuracy in RNA analysis. Building upon this foundation, our team developed an integrated RNA pipeline that includes:
- RhoFold – High-accuracy RNA tertiary structure prediction (sequence → structure).
- RiboDiffusion – Diffusion-based inverse folding for RNA 3D design (structure → sequence).
- RhoDesign – Geometric deep learning approach to RNA design (structure → sequence).
These tools work alongside RNA-FM to predict RNA structures from sequence, design new RNA sequences which could fold into desired 3D structures, and analyze functional properties. Our integrated ecosystem is built to advance the development of RNA therapeutics, drive innovation in synthetic biology, and deepen our understandings of RNA structure-function relationships.
References
@article{chen2022interpretable,
title={Interpretable RNA Foundation Model from Unannotated Data for Highly Accurate RNA Structure and Function Predictions},
author={Chen, Jiayang and Hu, Zhihang and Sun, Siqi and Tan, Qingxiong and Wang, Yixuan and Yu, Qinze and Zong, Licheng and Hong, Liang and Xiao, Jin and King, Irwin and others},
journal={arXiv preprint arXiv:2204.00300},
year={2022}
}
Table of Contents
RNA-FM Ecosystem Components: Our platform comprises four integrated tools, each addressing a critical step in the RNA analysis and design pipeline:
Model | Task | Description | Code | Paper |
---|---|---|---|---|
RNA-FM | Foundation Model (Representation) | Pretrained transformer (BERT-style) for ncRNA sequences (for RNA-FM) and messenger RNA sequences (for mRNA-FM); extracts embeddings and predicts base-pairing probabilities | GitHub | Nature Methods |
RhoFold | 3D Structure Prediction | RNA-FM-powered model for sequence-to-structure prediction (3D coordinates + secondary structure) | GitHub | Nature Methods |
RiboDiffusion | Inverse Folding | Generative diffusion model for structure-to-sequence RNA design | GitHub | ISMB'2024 |
RhoDesign | Inverse Folding | Geometric deep learning model (GVP+Transformer) for structure-to-sequence design | GitHub | Nature Computational Science |
Model | Training Corpus | # Sequences | Layers / Hidden | Params | Typical Use-cases |
---|---|---|---|---|---|
RNA-FM | non-coding RNAs | 23.7 M | 12 / 640 | 99 M | ncRNA structure & function, aptamer design |
mRNA-FM | messenger RNAs | 45 M | 12 / 1280 | 239 M | mRNA expression modelling, CDS analysis |
-
RNA-FM is a 12-layer BERT encoder pre-trained with masked‐token prediction on 23.7 M non-coding RNA sequences (RNAcentral100). It yields 640-d embeddings that already encode secondary-structure, 3-D proximity and even evolutionary signals, making it the representation backbone for every downstream tool in the ecosystem.
Click to fold RNA-FM details
- RNA-FM for Secondary Structure Prediction:
- Outperforms classic physics-based and machine learning methods (e.g., LinearFold, SPOT-RNA, UFold) by up to 20–30% in F1-score on challenging datasets.
- Performance gains are especially notable for long RNAs (>150 nucleotides) and low-homology families
- RNA-FM for Secondary Structure Prediction:
- mRNA-FM, an extension of RNA-FM, is exclusively trained on 45 million mRNA coding sequences (CDS). Purpose-built to model mRNA-specific features, it achieves state-of-the-art performance in mRNA-related tasks.
-
RhoFold (Tertiary Structure Prediction) – An RNA-FM–powered predictor for RNA 3D structures. Given an RNA sequence, RhoFold rapidly predicts its tertiary structure (3D coordinates in PDB format) along with the secondary structure (CT file) and per-residue confidence scores. It achieves high accuracy on RNA 3D benchmarks by combining RNA-FM embeddings with a structure prediction network, significantly outperforming prior methods in the RNA-Puzzles challenge.
Click to expand RhoFold details
RhoFold leverages the powerful embeddings from RNA-FM to revolutionize RNA tertiary structure prediction. By combining deep learning with structural biology principles, RhoFold translates RNA sequences directly into accurate 3D coordinates. The model employs a multi-stage architecture that first converts RNA-FM's contextual representations into distance maps and torsion angles, then assembles these into complete three-dimensional structures. Unlike previous approaches that often struggle with RNA's complex folding landscapes, RhoFold's foundation model approach captures subtle sequence-structure relationships, enabling state-of-the-art performance on challenging benchmarks like RNA-Puzzles. The system works in both single-sequence mode for rapid predictions and can incorporate multiple sequence alignments (MSA) when higher accuracy is needed, making it versatile for various research applications from small RNAs to complex ribozymes and riboswitches.
- RhoFold for Tertiary Structure:
- Delivers top accuracy on RNA-Puzzles / CASP-type tasks.
- Predicts 3D structures within seconds (single-sequence mode) and integrates MSA for further accuracy gains.
- Achieved Nature Methods–level benchmarks, generalizing to novel RNA families.
- RhoFold for Tertiary Structure:
-
RiboDiffusion (Inverse Folding – Diffusion) – A diffusion-based inverse folding model for RNA design. Starting from a target 3D backbone structure, RiboDiffusion iteratively generates RNA sequences that fold into that shape. This generative approach yields higher sequence recovery (≈11–16% improvement) than previous inverse folding algorithms, while offering tunable diversity in the designed sequences.
Click to expand RiboDiffusion details
RiboDiffusion represents a breakthrough in RNA inverse folding through diffusion-based generative modeling. While traditional RNA design methods often struggle with the vast sequence space, RiboDiffusion employs a novel approach inspired by recent advances in generative AI. Starting with random noise, the model iteratively refines RNA sequences to conform to target 3D backbones through a carefully controlled diffusion process. This approach allows RiboDiffusion to explore diverse sequence solutions while maintaining structural fidelity, a critical balance in biomolecular design. The diffusion framework inherently provides sequence diversity, enabling researchers to generate and test multiple candidate designs that all satisfy structural constraints. Published benchmarks demonstrate that RiboDiffusion achieves superior sequence recovery rates compared to previous methods, making it particularly valuable for designing functional RNAs like riboswitches, aptamers, and other structured elements where sequence-structure relationships are crucial.
- RiboDiffusion for Inverse Folding:
- A diffusion-based generative approach that surpasses prior methods by ~11–16% in sequence recovery rate.
- Provides tunable diversity in design, exploring multiple valid sequences for a single target shape.
- RiboDiffusion for Inverse Folding:
-
RhoDesign (Inverse Folding – Deterministic) – A deterministic geometric deep learning model for RNA design. RhoDesign uses graph neural networks (GVP) and Transformers to directly decode sequences for a given 3D structure (optionally incorporating secondary structure constraints). It achieves state-of-the-art accuracy in matching target structures, with sequence recovery rates exceeding 50% on standard benchmarks (nearly double traditional methods) and the highest structural fidelity (TM-scores) among current solutions.
Click to expand RhoDesign details
RhoDesign introduces a deterministic approach to RNA inverse folding using geometric deep learning. Unlike diffusion-based methods, RhoDesign directly translates 3D structural information into RNA sequences through a specialized architecture combining Graph Vector Perceptrons (GVP) and Transformer networks. This architecture effectively captures both local geometric constraints and global structural patterns in RNA backbones. RhoDesign can incorporate optional secondary structure constraints, allowing researchers to specify certain base-pairing patterns while letting the model optimize the remaining sequence. Benchmark tests demonstrate that RhoDesign achieves remarkable sequence recovery rates exceeding 50% on standard datasets—nearly double the performance of traditional methods. Moreover, the designed sequences exhibit the highest structural fidelity (as measured by TM-score) among current approaches. This combination of accuracy and efficiency makes RhoDesign particularly suitable for precision RNA engineering applications where structural integrity is paramount.
- RhoDesign for Inverse Folding:
- A deterministic GVP + Transformer model with >50% sequence recovery on standard 3D design benchmarks, nearly double that of older algorithms.
- Achieves highest structural fidelity (TM-score) among tested methods, validated in Nature Computational Science.
- RhoDesign for Inverse Folding:
Unified Workflow: These tools operate in concert to enable end-to-end RNA engineering. For any RNA sequence of interest, one can predict its structure (secondary and tertiary) using RNA-FM and RhoFold. Conversely, given a desired RNA structure, one can design candidate sequences using RiboDiffusion or RhoDesign (or both for cross-validation). Designed sequences can then be validated by folding them back with RhoFold, closing the loop. This forward-and-inverse design cycle, all powered by RNA-FM embeddings, creates a powerful closed-loop workflow for exploring RNA structure-function space. By seamlessly integrating prediction and design, the RNA-FM ecosystem accelerates the design-build-test paradigm in RNA science, laying the groundwork for breakthroughs in RNA therapeutics, synthetic biology constructs, and our understanding of RNA biology.
- Accurate RNA 3D structure prediction using a language-model–based deep learning approach – introduces RhoFold+, which couples RNA-FM embeddings with a geometry module to reach SOTA accuracy on CASP/RNA-Puzzles benchmarks (PAPER, CODE)
- NuFold: end-to-end RNA tertiary-structure prediction – integrates RNA-FM features into a U-former backbone, achieving accuracy competitive with state-of-the-art fold predictors (PAPER, CODE)
- TorRNA – improved backbone-torsion prediction by leveraging large language models – uses RNA-FM as sequence encoder and cuts median torsion-angle error by 2–16 % versus previous methods (PAPER)
- Deep generative design of RNA aptamers using structural predictions – employs RhoDesign to create Mango aptamer variants with >3-fold fluorescence gain (wet-lab verified) (PAPER, CODE)
- RiboDiffusion: tertiary-structure-based RNA inverse folding with generative diffusion models – diffusion sampler trained on RhoFold-generated data; boosts native-sequence recovery by 11 – 16 % over secondary-structure baselines (PAPER, CODE)
- gRNAde: geometric deep learning for 3-D RNA inverse design – validates every design by forward-folding with RhoFold, achieving 56 % native-sequence recovery vs 45 % for Rosetta (PAPER, CODE)
- RILLIE framework – integrates a 1.6 B-parameter RNA LM with RhoDesign for in-silico directed evolution of Broccoli/Pepper aptamers (CODE)
- RNALoc-LM: RNA subcellular localisation prediction with a pre-trained RNA language model – replaces one-hot inputs with RNA-FM embeddings, raising MCC by 4–8 % for lncRNA, circRNA and miRNA localisation (PAPER, CODE)
- PlantRNA-FM: an interpretable RNA foundation model for plant transcripts – adapts the RNA-FM architecture to >25 M plant RNAs; discovers translation-related structural motifs and attains F1 = 0.97 on genic-region annotation (PAPER, CODE)
- ZHMolGraph: network-guided deep learning for RNA–protein interaction prediction – combines RNA-FM (for RNAs) and ProtTrans (for proteins) embeddings within a GNN, boosting AUROC by up to 28 % on unseen RNA–protein pairs (PAPER, CODE)
Take-away: Across structure prediction, de novo sequence design, functional annotation and interaction modelling, the community is steadily adopting RNA-FM and its RhoFold/RiboDiffusion/RhoDesign toolkit as reliable building blocks—demonstrating the ecosystem’s versatility and real-world impact.
Below, we outline the environment setup for RNA-FM and its extended pipeline (e.g., RhoFold) locally.
(If you prefer not to install locally, refer to the Online Server mentioned earlier.)
- Clone the repository and create the Conda environment:
git clone https://github.com/ml4bio/RNA-FM.git
cd RNA-FM
conda env create -f environment.yml
- Activate and enter the workspace:
conda activate RNA-FM
cd ./redevelop
-
Download pre-trained models from our Hugging Face repo and place the
.pth
files into thepretrained
folder.For mRNA-FM, ensure that your input RNA sequences have lengths multiple of 3 (codons) and place the specialized weights for mRNA-FM in the same
pretrained
folder.
Once the environment is ready and weights are downloaded, you can perform common tasks as follows:
Use RNA-FM to extract nucleotide-level embeddings for input sequences:
python launch/predict.py \
--config="pretrained/extract_embedding.yml" \
--data_path="./data/examples/example.fasta" \
--save_dir="./results" \
--save_frequency 1 \
--save_embeddings
This command processes sequences in example.fasta
and saves 640-dimensional embeddings per nucleotide to ./results/representations/
.
-
Using mRNA-FM: To use the mRNA-FM variant instead of the default ncRNA model, add the model name argument and ensure input sequences are codon-aligned:
python launch/predict.py \ --config="pretrained/extract_embedding.yml" \ --data_path="./data/examples/example.fasta" \ --save_dir="./results" \ --save_frequency 1 \ --save_embeddings \ --save_embeddings_format raw \ MODEL.BACKBONE_NAME mrna-fm
As For mRNA-FM, you can call it with an extra argument,
MODEL.BACKBONE_NAME
. Remember mRNA-FM uses codon tokenization, so each sequence must have a length divisible by 3.
Predict an RNA secondary structure (base-pairing) from sequence using RNA-FM:
python launch/predict.py \
--config="pretrained/ss_prediction.yml" \
--data_path="./data/examples/example.fasta" \
--save_dir="./results" \
--save_frequency 1
RNA-FM will output base-pair probability matrices (.npy
) and secondary structures (.ct
) to ./results/r-ss
.
If you prefer not to install anything locally, you can use our RNA-FM server. The server provides a simple web interface where you can:
- Submit an RNA sequence to get its predicted secondary structure and/or embeddings.
- Obtain results without needing local compute resources or setup.
(A separate RhoFold server is also available for tertiary structure prediction of single RNA sequences.)
If you only want to use the pretrained model (rather than run all pipeline scripts), you can install RNA-FM
directly:
pip install rna-fm
Alternatively, for the latest version from GitHub:
cd ./RNA-FM
pip install .
Then, load RNA-FM within your own Python project:
import torch
import fm
# 1. Load RNA-FM model
model, alphabet = fm.pretrained.rna_fm_t12()
batch_converter = alphabet.get_batch_converter()
model.eval() # disables dropout for deterministic results
# 2. Prepare data
data = [
("RNA1", "GGGUGCGAUCAUACCAGCACUAAUGCCCUCCUGGGAAGUCCUCGUGUUGCACCCCU"),
("RNA2", "GGGUGUCGCUCAGUUGGUAGAGUGCUUGCCUGGCAUGCAAGAAACCUUGGUUCAAUCCCCAGCACUGCA"),
("RNA3", "CGAUUCNCGUUCCC--CCGCCUCCA"),
]
batch_labels, batch_strs, batch_tokens = batch_converter(data)
# 3. Extract embeddings (on CPU)
with torch.no_grad():
results = model(batch_tokens, repr_layers=[
3419
12])
token_embeddings = results["representations"][12]
For mRNA-FM, load with fm.pretrained.mrna_fm_t12()
and ensure input sequences are codon-aligned (as shown in the Quick Start above).
import torch
import fm
# 1. Load mRNA-FM model
model, alphabet = fm.pretrained.mrna_fm_t12()
batch_converter = alphabet.get_batch_converter()
model.eval() # disables dropout for deterministic results
# 2. Prepare data
data = [
("CDS1", "AUGGGGUGCGAUCAUACCAGCACUAAUGCCCUCCUGGGAAGUCCUCGUGUUGCACCCCUA"),
("CDS2", "AUGGGGUGUCGCUCAGUUGGUAGAGUGCUUGCCUGGCAUGCAAGAAACCUUGGUUCAAUCCCCAGCACUGCA"),
("CDS3", "AUGCGAUUCNCGUUCCC--CCGCCUCC"),
]
batch_labels, batch_strs, batch_tokens = batch_converter(data)
# 3. Extract embeddings (on CPU)
with torch.no_grad():
results = model(batch_tokens, repr_layers=[12])
token_embeddings = results["representations"][12]
More tutorials can be found from GitHub. The related notebooks are stored in the tutorials folder.
Get started with RNA-FM through our comprehensive tutorials:
Tutorial | Description | Format |
---|---|---|
RNA Family Clustering & Type Classification |
How to extract RNA-FM embeddings for clustering RNA families and classifying RNA types. This tutorial covers visualization of embeddings and training simple classifiers on top of them. | Jupyter Notebook |
RNA Secondary Structure Prediction | How to use RNA-FM to predict RNA secondary structures, output base-pairing probability matrices, and visualize the predicted base-pairing (secondary structure). | Python Script |
UTR Function Prediction |
How to leverage RNA-FM embeddings to predict functional properties of untranslated regions (5′ and 3′ UTRs) in mRNAs. This includes training a model to predict gene expression or protein translation metrics from UTR sequences. | Jupyter Notebook |
mRNA Expression Prediction |
How to use mRNA-FM variant to predict gene expression levels from mRNA sequences. This tutorial demonstrates loading the specialized mRNA model, extracting embeddings, and building a classifier to differentiate between high and low expression genes. | Jupyter Notebook |
Additional Resources:
Video Tutorial (Chinese) - Step-by-step guide to using RNA-FM for various RNA analysis tasks
These tutorials cover the core applications of RNA-FM from basic embedding extraction to advanced functional predictions. Each provides hands-on examples you can run immediately in your browser or local environment.
We recommend exploring the advanced RhoFold, RiboDiffusion, and RhoDesign projects for tasks like 3D structure prediction or RNA design. Below are brief usage samples:
Click to expand RNA-FM Ecosystem details
# Example: Predict 3D structure for an RNA sequence in FASTA.
cd RhoFold
python inference.py \
--input_fas ./example/input/5t5a.fasta \
--output_dir ./example/output/5t5a/ \
--ckpt ./pretrained/RhoFold_pretrained.pt
Outputs:
unrelaxed_model.pdb
/relaxed_1000_model.pdb
(3D coordinates)ss.ct
(secondary structure)results.npz
(distance/angle predictions + confidence scores)log.txt
(run logs, pLDDT, etc.)
cd RiboDiffusion
CUDA_VISIBLE_DEVICES=0 python main.py \
--PDB_file examples/R1107.pdb \
--config.eval.n_samples 5
This will generate 5 candidate RNA sequences that fold into the structure provided in R1107.pdb
. The output FASTA files will be saved under the exp_inf/fasta/
directory.
cd RhoDesign
python src/inference.py \
--pdb ../example/2zh6_B.pdb \
--ss ../example/2zh6_B.npy \
--save ../example/
This produces a designed RNA sequence predicted to fold into the target 3D shape (PDB file 2zh6_B.pdb
, with an optional secondary structure constraint from 2zh6_B.npy
). The output sequence will be saved in the specified folder. You can adjust parameters like the sampling temperature to explore more diverse or high-fidelity designs.
Each project in the RNA-FM ecosystem comes with both command-line interfaces and Python modules:
- RNA-FM: Core module
fm
for embedding extraction and secondary structure prediction.fm.pretrained.rna_fm_t12()
– load the 12-layer ncRNA modelfm.pretrained.mrna_fm_t12()
– load the 12-layer mRNA (codon) model
- RhoFold: Use the
RhoFoldModel
class or theinference.py
script.inference.py
takes a FASTA sequence (and optionally an MSA) and outputs a 3D structure.- Add
--single_seq_pred True
to run without an MSA (single-sequence mode).
- RiboDiffusion: Use the
main.py
script or import the diffusion model classes.main.py
takes a PDB structure as input and outputs designed sequences.- Modify settings in
configs/
(e.g.,cond_scale
,n_samples
) to tune the generation.
- RhoDesign: Use the
inference.py
script or import the design model module.inference.py
takes a PDB (and optional secondary structure/contact map) and outputs a designed sequence.- The GVP+Transformer architecture can incorporate partial structure constraints and supports advanced sampling strategies.
For further details, see each repo’s documentation or the notebooks in the tutorials folder.
Name | Dataset | Modality | Tokenization | Architecture | Backbone | Pre‑training Task | Layers | Model Params | Data Size | Code | Weights | Data | License |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
RNA‑FM | ncRNA | Sequence | Base | Enc‑only | Transformer | MLM | 12 | 100 M | 23 M | GitHub | HuggingFace | RNAcentral | MIT |
RNABERT | ncRNA | Sequence | Base | Enc‑only | Transformer | MLM / SAL | 6 | 0.5 M | 0.76 M | GitHub | Drive | Rfam 14.3 | MIT |
RNA‑MSM | ncRNA | Seq + MSA | Base | Enc‑only | MSA‑Transformer | MLM | 12 | 95 M | 3932 families | GitHub | Drive | Rfam 14.7 | MIT |
AIDO.RNA | ncRNA | Sequence | Base | Enc‑only | Transformer | MLM | 32 | 1.6 B | 42 M | GitHub | HuggingFace | Public ncRNA mix | Apache‑2.0 |
ERNIE‑RNA | ncRNA | Sequence | Base | Enc‑only | Transformer | MLM | 12 | 86 M | 20.4 M | GitHub | GitHub | Rfam + RNAcentral | MIT |
GenerRNA | ncRNA | Sequence | BPE | Dec‑only | Transformer | CLM | 24 | 350 M | 16.09 M | GitHub | HuggingFace | Public ncRNA mix | Apache‑2.0 |
RFamLlama | ncRNA | Sequence | Base | Dec‑only | Llama | CLM | 6‑10 | 13‑88 M | 0.6 M | HuggingFace | HuggingFace | Rfam 14.10 | CC BY‑NC‑4.0 |
RNA‑km | ncRNA | Sequence | Base | Enc‑only | Transformer | MLM | 12 | 152 M | 23 M | GitHub | Drive | Rfam + RNAcentral | MIT |
RNAErnie | ncRNA | Sequence | Base | Enc‑only | Transformer | MLM | 12 | 105 M | 23 M | GitHub | GitHub | Public ncRNA mix | Apache‑2.0 |
OPED | pegRNA | Sequence | k‑mer | Enc‑Dec | Transformer | Regression | n/a | n/a | 40 k | GitHub | — | Public pegRNA eff. | MIT |
GARNET | rRNA | Sequence | k‑mer | Dec‑only | Transformer | CLM | 18 | 19 M | 89 M tokens | GitHub | Release | Public rRNA | MIT |
IsoCLR | pre‑mRNA | Sequence | One‑hot | Enc‑only | CNN | Contrast Learning | 8 | 1‑10 M | 1 M | GitHub | — | Ensembl / RefSeq | — |
SpliceBERT | pre‑mRNA | Sequence | Base | Enc‑only | Transformer | MLM | 6 | 20 M | 2 M | GitHub | Zenodo | UCSC/GENCODE | MIT |
Orthrus | pre‑mRNA | Sequence | Base | Enc‑only | Mamba | Contrast Learning | 3‑6 | 1‑10 M | 49 M | GitHub | HuggingFace | Ortholog set | Apache‑2.0 |
LoRNA | pre‑mRNA | Sequence | Base | Dec‑only | StripedHyena | Contrast Learning | 16 | 6.5 M | 100 M | GitHub | (announced) | SRA (long‑read) | MIT |
CodonBERT | mRNA CDS | Sequence | Codon | Enc‑only | Transformer | MLM / HSP | 12 | 87 M | 10 M | GitHub | HuggingFace | NCBI mRNA | Apache‑2.0 |
UTR‑LM | 5′UTR | Sequence | Base | Enc‑only | Transformer | MLM / SSP / MFE | 6 | 1 M | 0.7 M | GitHub | GitHub | Public 5′UTR set | MIT |
3UTRBERT | 3′UTR | Sequence | k‑mer | Enc‑only | Transformer | MLM | 12 | 86 M | 20 k | GitHub | HuggingFace | Public 3′UTR | MIT |
G4mer | mRNA | Sequence | k‑mer | Enc‑only | Transformer | MLM | 6 | — | — | — | — | — | — |
HELM | mRNA | Sequence | Codon | Multi | Multi | MLM + CLM | — | 50 M | 15.3 M | — | — | — | — |
RiNALMo | RNA | Sequence | Base | Enc‑only | Transformer | MLM | 33 | 135‑650 M | 36 M | GitHub | (request) | Public ncRNA | MIT |
UNI‑RNA | RNA | Sequence | Base | Enc‑only | Transformer | MLM | 24 | 400 M | 500 M | — | — | — | — |
ATOM‑1 | RNA | Sequence | Base | Enc‑Dec | Transformer | — | — | — | — | — | — | — | — |
BiRNA‑BERT | RNA | Sequence | Base + BPE | Enc‑only | Transformer | MLM | 12 | 117 M | 36 M | GitHub | HuggingFace | Public ncRNA | MIT |
ChaRNABERT | RNA | Sequence | GBST | Enc‑only | Transformer | MLM | 6‑33 | 8‑650 M | 62 M | — | (8 M demo) | Public ncRNA | — |
DGRNA | RNA | Sequence | Base | Enc‑only | Mamba | MLM | 12 | 100 M | 100 M | — | — | — | — |
LAMAR | RNA | Sequence | Base | Enc‑only | Transformer | MLM | 12 | 150 M | 15 M | GitHub | (announced) | Public ncRNA | MIT |
OmniGenome | RNA | Sequence, Structure | Base | Enc‑only | Transformer | MLM / Seq2Str / Str2Seq | 16‑32 | 52‑186 M | 25 M | GitHub | HuggingFace | Public multi‑omics | Apache‑2.0 |
PlantRNA‑FM | RNA | Sequence, Structure | Base | Enc‑only | Transformer | MLM / SSP / CLS | 12 | 35 M | 25 M | HuggingFace | HuggingFace | Plant RNA set | CC BY‑NC‑4.0 |
MP‑RNA | RNA | Sequence, Structure | Base | Enc‑only | Transformer | SSP / SNMR / MRLM | 12 | 52‑186 M | 25 M | GitHub | (planned) | Public ncRNA mix | Apache‑2.0 |
If you use RNA-FM or any components of this ecosystem in your research, please cite the relevant papers. Below is a collection of key publications (in BibTeX format) covering the foundation model and associated tools:
BibTeX Citations
@article{chen2022interpretable,
title={Interpretable RNA foundation model from unannotated data for highly accurate RNA structure and function predictions},
author={Chen, Jiayang and Hu, Zhihang and Sun, Siqi and Tan, Qingxiong and Wang, Yixuan and Yu, Qinze and Zong, Licheng and Hong, Liang and Xiao, Jin and Shen, Tao and others},
journal={arXiv preprint arXiv:2204.00300},
year={2022}
}
@article{shen2024accurate,
title={Accurate RNA 3D structure prediction using a language model-based deep learning approach},
author={Shen, Tao and Hu, Zhihang and Sun, Siqi and Liu, Di and Wong, Felix and Wang, Jiuming and Chen, Jiayang and Wang, Yixuan and Hong, Liang and Xiao, Jin and others},
journal={Nature Methods},
pages={1--12},
year={2024},
publisher={Nature Publishing Group US New York}
}
@article{chen2020rna,
title={RNA secondary structure prediction by learning unrolled algorithms},
author={Chen, Xinshi and Li, Yu and Umarov, Ramzan and Gao, Xin and Song, Le},
journal={arXiv preprint arXiv:2002.05810},
year={2020}
}
@article{WANG2025102991,
title = {Deep learning for RNA structure prediction},
author = {Jiuming Wang and Yimin Fan and Liang Hong and Zhihang Hu and Yu Li},
journal = {Current Opinion in Structural Biology},
year = {2025},
doi = {https://doi.org/10.1016/j.sbi.2025.102991},
url = {https://www.sciencedirect.com/science/article/pii/S0959440X25000090},
}
@article{wong2024deep,
title={Deep generative design of RNA aptamers using structural predictions},
author={Wong, Felix and He, Dongchen and Krishnan, Aarti and Hong, Liang and Wang, Alexander Z and Wang, Jiuming and Hu, Zhihang and Omori, Satotaka and Li, Alicia and Rao, Jiahua and others},
journal={Nature Computational Science},
pages={1--11},
year={2024},
publisher={Nature Publishing Group US New York}
}
@article{huang2024ribodiffusion,
title={RiboDiffusion: tertiary structure-based RNA inverse folding with generative diffusion models},
author={Huang, Han and Lin, Ziqian and He, Dongchen and Hong, Liang and Li, Yu},
journal={Bioinformatics},
volume={40},
number={Supplement\_1},
pages={i347--i356},
year={2024},
publisher={Oxford University Press}
}
@article{wei2022protein,
title={Protein--RNA interaction prediction with deep learning: structure matters},
author={Wei, Junkang and Chen, Siyuan and Zong, Licheng and Gao, Xin and Li, Yu},
journal={Briefings in bioinformatics},
volume={23},
number={1},
pages={bbab540},
year={2022},
publisher={Oxford University Press}
}
@article{lam2019deep,
title={A deep learning framework to predict binding preference of RNA constituents on protein surface},
author={Lam, Jordy Homing and Li, Yu and Zhu, Lizhe and Umarov, Ramzan and Jiang, Hanlun and H{\'e}liou, Am{\'e}lie and Sheong, Fu Kit and Liu, Tianyun and Long, Yongkang and Li, Yunfei and others},
journal={Nature communications},
volume={10},
number={1},
pages={4941},
year={2019},
publisher={Nature Publishing Group UK London}
}
@article{wei2024pronet,
title={ProNet DB: a proteome-wise database for protein surface property representations and RNA-binding profiles},
author={Wei, Junkang and Xiao, Jin and Chen, Siyuan and Zong, Licheng and Gao, Xin and Li, Yu},
journal={Database},
volume={2024},
pages={baae012},
year={2024},
publisher={Oxford University Press UK}
}
@article{han2022self,
title={Self-supervised contrastive learning for integrative single cell RNA-seq data analysis},
author={Han, Wenkai and Cheng, Yuqi and Chen, Jiayang and Zhong, Huawen and Hu, Zhihang and Chen, Siyuan and Zong, Licheng and Hong, Liang and Chan, Ting-Fung and King, Irwin and others},
journal={Briefings in Bioinformatics},
volume={23},
number={5},
pages={bbac377},
year={2022},
publisher={Oxford University Press}
}
@article{fan2022highly,
title={The highly conserved RNA-binding specificity of nucleocapsid protein facilitates the identification of drugs with broad anti-coronavirus activity},
author={Fan, Shaorong and Sun, Wenju and Fan, Ligang and Wu, Nan and Sun, Wei and Ma, Haiqian and Chen, Siyuan and Li, Zitong and Li, Yu and Zhang, Jilin and others},
journal={Computational and Structural Biotechnology Journal},
volume={20},
pages={5040--5044},
year={2022},
publisher={Elsevier}
}
This source code is licensed under the MIT license found in the LICENSE file in the root directory of this source tree.
Our framework and model training were inspired by:
We thank the authors of these works for providing excellent foundations for RNA-FM.
Thank you for using RNA-FM!
For issues or questions, open a GitHub Issue or consult the documentation. We welcome contributions and collaboration from the community.