Download model configurations and weights, and place them in model/pretrained/[Model Name]
.
RNAFM:
Download the pretrained model from:
https://proj.cse.cuhk.edu.hk/rnafm/api/download?filename=RNA-FM_pretrained.pth
RNABERT and RNAMSM:
Download weights from Link and Link. Referenced from RNAErnie.
RNAErnie:
We use the PyTorch version of the model provided by the authors:
(https://huggingface.co/LLM-EDA/RNAErnie/tree/main)
SpliceBERT:
Model weights are available on Zenodo.
DNABERT:
We use the popular DNA_bert_3.
DNABERT2:
Available at link.
GENA-LM:
Available at link.
UTRLM:
The model is available at this link.
Nucleotide Transformer:
We use the best-reported version: nucleotide-transformer-v2-500m-multi-species.
We are currently in the process of packaging and uploading all model weights to Google Drive for your convenience. The upload will take some additional time to complete.
All analyses were conducted on a cluster node with 32 CPU cores and 4 Nvidia Tesla A100 40G GPUs. At least one GPU is required for a single task.
A Linux system is required. Use conda and pip to manage dependencies:
conda env create -f environment_1019.yml
Datasets can be sourced from the manuscript's Data Availability sections. We are preparing a repository to release the code for building final datasets.
Essential data files are also available on Google Drive. Download and place them in ./dataset
.
- Datasets for ncRNA, m6a, and MRL are directly available.
- For splicing prediction, run
scripts/makedata_splice.sh
to generate the final dataset (~50GB).
Example script: scripts/cls/HPC_run_1.sh
.
Example script: scripts/m6A/HPC_run_1.sh
.
- Run
scripts/makedata_splice.sh
to create datasets. - Example script:
scripts/splice/HPC_run_1.sh
.
Example script: scripts/mrl/HPC_run_1.sh
.
-
Extract test results from program output and compile them into a table.
- Separate stdout and stderr for clarity:
bash scripts/run_splice_train_test_53.sh > output.txt 2>error_output.txt
- On Slurm clusters, stdout and stderr are automatically separated.
-
Convert output to a table using
parse_output.py
in theanalyzer
folder:cd analyzer python parse_output.py -i tables/m6a101_4_0.1.txt
Example output:
analyzer/tables/m6a101_4_0.1_collected_data.csv
.
The generated table serves as input for plotting. See analyzer/analyze.ipynb
for an example.
dataset
: Scripts and utilities for dataset creation and loading.evaluator
: Functions for model loading, training, and evaluation.logs
: Directory for log files.model
: Model definitions and implementations.scripts
: Reference scripts for running the project.
Main entry points: seq_cls.py
, m6a_cls.py
, splice_cls.py
, and mrl_pred.py
. Customize these scripts for specific tests.