biom-benchmark

Setup

Model Weights

Download model configurations and weights, and place them in model/pretrained/[Model Name].

RNAFM:
Download the pretrained model from:
https://proj.cse.cuhk.edu.hk/rnafm/api/download?filename=RNA-FM_pretrained.pth

RNABERT and RNAMSM:
Download weights from Link and Link. Referenced from RNAErnie.

RNAErnie:
We use the PyTorch version of the model provided by the authors:
(https://huggingface.co/LLM-EDA/RNAErnie/tree/main)

SpliceBERT:
Model weights are available on Zenodo.

DNABERT:
We use the popular DNA_bert_3.

DNABERT2:
Available at link.

GENA-LM:
Available at link.

UTRLM:
The model is available at this link.

Nucleotide Transformer:
We use the best-reported version: nucleotide-transformer-v2-500m-multi-species.

We are currently in the process of packaging and uploading all model weights to Google Drive for your convenience. The upload will take some additional time to complete.

Hardware Requirements

All analyses were conducted on a cluster node with 32 CPU cores and 4 Nvidia Tesla A100 40G GPUs. At least one GPU is required for a single task.

Software Environment

A Linux system is required. Use conda and pip to manage dependencies:

conda env create -f environment_1019.yml

Running Pipelines

Prepare Datasets

Datasets can be sourced from the manuscript's Data Availability sections. We are preparing a repository to release the code for building final datasets.

Essential data files are also available on Google Drive. Download and place them in ./dataset.

Datasets for ncRNA, m6a, and MRL are directly available.
For splicing prediction, run scripts/makedata_splice.sh to generate the final dataset (~50GB).

nRC Prediction

Example script: scripts/cls/HPC_run_1.sh.

m6A Prediction

Example script: scripts/m6A/HPC_run_1.sh.

Splicing Prediction

Run scripts/makedata_splice.sh to create datasets.
Example script: scripts/splice/HPC_run_1.sh.

MRL Prediction

Example script: scripts/mrl/HPC_run_1.sh.

Gather Results

Extract test results from program output and compile them into a table.
- Separate stdout and stderr for clarity:
```
bash scripts/run_splice_train_test_53.sh > output.txt 2>error_output.txt
```
- On Slurm clusters, stdout and stderr are automatically separated.
Convert output to a table using parse_output.py in the analyzer folder:
```
cd analyzer
python parse_output.py -i tables/m6a101_4_0.1.txt
```
Example output: analyzer/tables/m6a101_4_0.1_collected_data.csv.

The generated table serves as input for plotting. See analyzer/analyze.ipynb for an example.

Code Structure

dataset: Scripts and utilities for dataset creation and loading.
evaluator: Functions for model loading, training, and evaluation.
logs: Directory for log files.
model: Model definitions and implementations.
scripts: Reference scripts for running the project.

Main entry points: seq_cls.py, m6a_cls.py, splice_cls.py, and mrl_pred.py. Customize these scripts for specific tests.

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
analyzer		analyzer
dataset		dataset
evaluator		evaluator
model		model
scripts		scripts
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yaml		environment.yaml
environment_1019.yaml		environment_1019.yaml
evaluator.py		evaluator.py
m6a_cls.py		m6a_cls.py
mrl_pred.py		mrl_pred.py
pretrain.py		pretrain.py
seq_cls.py		seq_cls.py
splice_cls.py		splice_cls.py
task_emb.py		task_emb.py
trainer.py		trainer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

biom-benchmark

Setup

Model Weights

Hardware Requirements

Software Environment

Running Pipelines

Prepare Datasets

nRC Prediction

m6A Prediction

Splicing Prediction

MRL Prediction

Gather Results

Code Structure

About

Releases

Packages

Languages

License

ShenLab-Genomics/biombenchmark

Folders and files

Latest commit

History

Repository files navigation

biom-benchmark

Setup

Model Weights

Hardware Requirements

Software Environment

Running Pipelines

Prepare Datasets

nRC Prediction

m6A Prediction

Splicing Prediction

MRL Prediction

Gather Results

Code Structure

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages