This repository contains the code for 'ChromBERT: Uncovering Chromatin State Motifs in the Human Genome using a BERT-based Approach'. If you utilize our models or code, please reference our paper. We are continuously developing this repo, and welcome any issue reports.
This package offers the source codes for the ChromBERT model, which draws significant inspiration from DNABERT (Y. Ji et al., "DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome", Bioinformatics, 2021.) ChromBERT includes pre-trained models for promoter regions (2 kb upstream to 4 kb downstream of TSS) and whole-genome regions, covering both the 15-chromatin state system (127 cell types from the ROADMAP database) and the 18-chromatin state system (1699 cell types from the IHEC database). Fine-tuned models for gene expression classification and regression (15-chromatin state system) are also provided. For downstream analysis, ChromBERT offers a DTW-based motif clustering and visualization tool.
Utility functions for data preprocessing and analysis are available in the processing/chrombert_utils
directory. A Google Colab tutorial is provided for dataset preparation and curation, which we recommend completing before proceeding to the training stage in the training/examples
directory.
If you use this repository in your research, please cite our paper:
ChromBERT: Uncovering Chromatin State Motifs in the Human Genome Using a BERT-based Approach
Authors: Seohyun Lee, Che Lin, Chien-Yu Chen, and Ryuichiro Nakato
bioRxiv, July 26, 2024. DOI: 10.1101/2024.07.25.605219
- Operating System: Linux (Ubuntu 22.04 LTS recommended)
- Python: 3.11
- PyTorch: 2.6.0 (Check with
conda list | grep torch
to see your installed version)
We have tested and confirmed that the following configuration works well for running our model:
Component | Version / Info |
---|---|
CUDA Version | 12.4 |
cuDNN Version | 9.1.0 |
NVIDIA Driver Version | 550.78 |
GPU | Tested with NVIDIA RTX 6000 Ada Generation |
- Recommended GPU: NVIDIA RTX 6000 Ada Generation or higher with appropriate CUDA compatibility
- Memory: 251 GB RAM recommended. This recommendation is based on the memory requirements observed during testing, which includes processing large datasets and maintaining efficient model operations.
To ensure optimal performance and avoid dependency conflicts, we recommend setting up separate environments for data preprocessing and model training. For each environment, an environment.yml
file is provided for easy setup using Conda (or Mamba for faster resolution). Follow the instructions below to clone the repository and create the environment.
To download the source code to your local machine, execute:
$ git clone https://github.com/caocao0525/ChromBERT
$ cd ChromBERT
Follow these steps to create and activate an environment for data processing and analysis:
Using Conda:
$ conda env create -f environment.yml
$ conda activate chrombert # Activate the environment
# Note: The prompt will change to reflect the current environment name, shown as (chrombert)$
(chrombert)$ conda deactivate # Deactivate current environment
Or, using Mamba (if installed):
$ mamba env create -f environment.yml # Create environment from file
$ mamba activate chrombert
The chrombert_utils
package is essential for data preprocessing and downstream analysis related to Chromatin State Sequences. To install this package, ensure you are operating within the data processing environment by following these steps:
$ conda activate chrombert
(chrombert)$ cd processing
(chrombert)$ pip install -e .
Follow these steps to create and activate an environment specifically for training:
$ cd training
$ mamba env create -f environment.yml
$ conda activate chrombert_training # Activate the environment
# Note: The prompt will change to reflect the current environment name, shown as (chrombert_training)$
(chrombert_training)$ conda deactivate # Deactivate current environment
Next, in the chrombert_training
environment, install the packages for training as follows:
$ conda activate chrombert_training
(chrombert_training)$ cd training
(chrombert_training)$ pip install -e . --config-settings editable_mode=compat # for pip ≥ 25.0 compatibility with editable installs
We highly recommend using the Colab tutorial for preparing your pretraining and fine-tuning data:
For pre-training, fine-tuning, and to replicate our results, we recommend users download the ChromBERT.zip
file from the Zenodo link below:
For organized access, please store the downloaded file in an appropriate directory, such as training/examples/prom/pretrain_data
.
In this section, we provide procedures for the 4-mer dataset. However, users have the flexibility to change the value of k
by modifying the line export KMER=4
in each script to suit their specific requirements.
The pre-training script is located in the training/examples/prom/script_pre/
directory. Users can adjust the file names within the script should they alter the directory or the name of the training data files. The model outputs will be saved in the ../pretrain_result/
directory.
(chrombert_training) $ cd training/examples/prom/script_pre
(chrombert_training) $ bash run_pretrain.sh \
--train_file .../pretrain_data \
--test_file ../pretrain_data
Optional arguments:
Argument | Description | Default value |
---|---|---|
--train_file |
Path to training data file | ../pretrain_data/pretraining_small.txt |
--test_file |
Path to evaluation data file | Same as training file |
--max_steps |
Maximum number of training steps | 500 |
--learning_rate |
Learning rate for optimizer | 2e-4 |
--mlm_prob |
Masked Language Modeling probability | 0.025 |
--train_batch |
Training batch size per GPU | 5 |
--eval_batch |
Evaluation batch size per GPU | 3 |
Note: The default pretraining_small.txt
is a quick test dataset extracted from chromosome 1 of cell type E003.
Due to the large size of the IHEC pretraining dataset (1699 cell types), the process is divided into two steps:
- Splitting the full dataset into smaller shuffled chunks
- Running looped pretraining over each chunk
The scripts are located in the training/examples/prom_ihec/script_pre/
directory. Before running them, make sure to place the pretraining data file (promoter_ihec_all_4mer_wo_4R.txt
, downloaded from Zenodo) in the training/examples/prom_ihec/pretrain_data/
directory.
Step 1: Split the data into shuffled chunks
Run the following script to split the full dataset into chunks of 100,000 lines each. The shuffled and chunked files will be saved under ../pretrain_data/split_chunks/
.
(chrombert_training) $ cd training/examples/prom_ihec/script_pre
(chrombert_training) $ bash split_chunk.sh
This script shuffles the input file and splits it into evenly sized chunks for sequential training.
Step 2: Run looped pretraining over the chunks
Use the provided pretraining_loop.sh
script to sequentially train on each chunk of the shuffled data.
(chrombert_training) $ cd training/examples/prom_ihec/script_pre
(chrombert_training) $ bash pretraining_loop.sh
The model outputs for each chunk will be saved in the ../pretrain_result/
directory.
Following pre-training, the parameters are saved in the training/examples/prom/pretrain_result/
directory. To replicate our classification results, users should place the files train.tsv
and dev.tsv
directly in the examples/prom/ft_data/classification
directory. This location includes data for classifying promoter regions between genes that are highly expressed (RPKM > 50) and those that are not expressed (RPKM = 0).Note that our ChromBERT.zip
file offers promoter region fine-tuning data from 57 different cell types under the promoter_finetune_data
directory. Users are encouraged to properly place the required file.
(chrombert_training) $ cd training/examples/prom/script_ft
(chrombert_training) $ bash run_4mer_classification_finetune.sh
Optional arguments:
Argument | Description | Default |
---|---|---|
--model_path |
Path to the pre-trained model | ../pretrain_result |
--data_path |
Path to the fine-tuning dataset | ../ft_data/classification |
--output_path |
Directory to save fine-tuned model | ../ft_result/classification |
--epochs |
Number of training epochs | 10.0 |
--lr |
Learning rate | 2e-5 |
--batch_size |
Batch size for both training and evaluation | 32 |
To replicate our regression results, users should place the files train.tsv
and dev.tsv
—which contain sequence and log-transformed RPKM value pairs—directly in the training/examples/prom/ft_data/regression
directory.
(chrombert_training) $ cd training/examples/prom/script_ft
(chrombert_training) $ bash run_4mer_regression_finetune.sh
To obtain an attention matrix for the prediction result, execute the scripts in the following order: First, run run_4mer_pred.sh
in the training/examples/prom/script_pred
directory.
(chrombert_training) $ cd training/examples/prom/script_pred
(chrombert_training) $ bash run_4mer_pred.sh
Optional arguments:
Position | Argument | Description | Default |
---|---|---|---|
1 | KMER |
K-mer size used for the tokenizer | 4 |
2 | MODEL_PATH |
Path to the fine-tuned model | ../ft_result/classification |
3 | DATA_PATH |
Path to the input data for prediction | ../ft_data/classification |
4 | PREDICTION_PATH |
Directory to save prediction results | ../predict_result |
The identification of chromatin state motifs can be categorized into two phases: Motif Detection and Motif Clustering. During the Motif Detection phase, chromatin state sequences that have high attention scores and are uniquely associated with the class of interest (for example, the promoter region) are identified and organized into a dataframe. Subsequently, these sequences undergo clustering through Dynamic Time Warping (DTW) in the Motif Clustering phase, leading to the identification of the definitive chromatin state motifs.
(chrombert) $ cd training/motif/prom
(chrombert) $ bash ./motif_prom.sh
Executing the script as described above allows users to generate a init_df.csv
file in the ./result
directory. This file includes a comprehensive list of chromatin state sequences. To adjust settings such as the window size, minimum sequence length, and the minimum occurrence threshold, users can modify the script's arguments as demonstrated below:
(chrombert) $ bash ./motif_prom.sh --window_size 12 --min_len 5 --min_n_motif 2
Optional arguments:
Argument | Description | Default value |
---|---|---|
--window_size |
Sliding window size for motif scanning | 12 |
--min_len |
Minimum length of motifs to report | 5 |
--min_n_motif |
Minimum number of motif instances required | 2 |
--data_path |
Path to the input data directory | ../../examples/prom/ft_data |
--predict_path |
Path to the prediction results directory | ../../examples/prom/predict_result |
--motif_path |
Directory to save discovered motifs and plots | ./result |
For further assistance, the --help
option provides a detailed explanation of all available arguments, their default settings, and an illustrative example of how to use them:
(chrombert) $ bash ./motif_prom.sh --help
For motif clustering, we recommend using the "Motif Clustering" section in the Colab tutorial linked below:
This project is licensed under the Apache License, Version 2.0. See the LICENSE file for details.
Thank you for checking out ChromBERT. If you found this project useful, please consider starring it on GitHub to help it gain more visibility.