ChromBERT: Uncovering Chromatin State Motifs in the Human Genome Using a BERT-based Approach

This repository contains the code for 'ChromBERT: Uncovering Chromatin State Motifs in the Human Genome using a BERT-based Approach'. If you utilize our models or code, please reference our paper. We are continuously developing this repo, and welcome any issue reports.

This package offers the source codes for the ChromBERT model, which draws significant inspiration from DNABERT _{(Y. Ji et al., "DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome", Bioinformatics, 2021.)} ChromBERT includes pre-trained models for promoter regions (2 kb upstream to 4 kb downstream of TSS) and whole-genome regions, covering both the 15-chromatin state system (127 cell types from the ROADMAP database) and the 18-chromatin state system (1699 cell types from the IHEC database). Fine-tuned models for gene expression classification and regression (15-chromatin state system) are also provided. For downstream analysis, ChromBERT offers a DTW-based motif clustering and visualization tool.

Utility functions for data preprocessing and analysis are available in the processing/chrombert_utils directory. A Google Colab tutorial is provided for dataset preparation and curation, which we recommend completing before proceeding to the training stage in the training/examples directory.

Citation 📖

If you use this repository in your research, please cite our paper:

ChromBERT: Uncovering Chromatin State Motifs in the Human Genome Using a BERT-based Approach
Authors: Seohyun Lee, Che Lin, Chien-Yu Chen, and Ryuichiro Nakato

bioRxiv, July 26, 2024. DOI: 10.1101/2024.07.25.605219

1. System Requirements and Optimal Configurations 🎯

Software

Operating System: Linux (Ubuntu 22.04 LTS recommended)
Python: 3.11
PyTorch: 2.6.0 (Check with conda list | grep torch to see your installed version)

Verified Configurations

We have tested and confirmed that the following configuration works well for running our model:

Component	Version / Info
CUDA Version	12.4
cuDNN Version	9.1.0
NVIDIA Driver Version	550.78
GPU	Tested with NVIDIA RTX 6000 Ada Generation

Hardware

Recommended GPU: NVIDIA RTX 6000 Ada Generation or higher with appropriate CUDA compatibility
Memory: 251 GB RAM recommended. This recommendation is based on the memory requirements observed during testing, which includes processing large datasets and maintaining efficient model operations.

2. Installation with Environment Setup 📦

To ensure optimal performance and avoid dependency conflicts, we recommend setting up separate environments for data preprocessing and model training. For each environment, an environment.yml file is provided for easy setup using Conda (or Mamba for faster resolution). Follow the instructions below to clone the repository and create the environment.

2-1. Clone the ChromBERT repository

To download the source code to your local machine, execute:

$ git clone https://github.com/caocao0525/ChromBERT
$ cd ChromBERT

2-2. Setting up the data processing environment

Follow these steps to create and activate an environment for data processing and analysis:

Using Conda:

$ conda env create -f environment.yml
$ conda activate chrombert # Activate the environment

# Note: The prompt will change to reflect the current environment name, shown as (chrombert)$
(chrombert)$ conda deactivate # Deactivate current environment

Or, using Mamba (if installed):

$ mamba env create -f environment.yml # Create environment from file
$ mamba activate chrombert

The chrombert_utils package is essential for data preprocessing and downstream analysis related to Chromatin State Sequences. To install this package, ensure you are operating within the data processing environment by following these steps:

$ conda activate chrombert
(chrombert)$ cd processing
(chrombert)$ pip install -e .

2-3. Setting up the training environment

Follow these steps to create and activate an environment specifically for training:

$ cd training
$ mamba env create -f environment.yml 
$ conda activate chrombert_training # Activate the environment

# Note: The prompt will change to reflect the current environment name, shown as (chrombert_training)$
(chrombert_training)$ conda deactivate # Deactivate current environment

Next, in the chrombert_training environment, install the packages for training as follows:

$ conda activate chrombert_training
(chrombert_training)$ cd training
(chrombert_training)$ pip install -e . --config-settings editable_mode=compat  # for pip ≥ 25.0 compatibility with editable installs

3. Chromatin state data pre-processing 📊

We highly recommend using the Colab tutorial for preparing your pretraining and fine-tuning data:

4. Training ⏳

For pre-training, fine-tuning, and to replicate our results, we recommend users download the ChromBERT.zip file from the Zenodo link below:

For organized access, please store the downloaded file in an appropriate directory, such as training/examples/prom/pretrain_data. In this section, we provide procedures for the 4-mer dataset. However, users have the flexibility to change the value of k by modifying the line export KMER=4 in each script to suit their specific requirements.

4-1. Pre-training

4-1-1. 15-chromatin state system (ROADMAP data)

The pre-training script is located in the training/examples/prom/script_pre/ directory. Users can adjust the file names within the script should they alter the directory or the name of the training data files. The model outputs will be saved in the ../pretrain_result/ directory.

(chrombert_training) $ cd training/examples/prom/script_pre
(chrombert_training) $ bash run_pretrain.sh \
                        --train_file .../pretrain_data \
                        --test_file ../pretrain_data

Optional arguments:

Argument	Description	Default value
`--train_file`	Path to training data file	`../pretrain_data/pretraining_small.txt`
`--test_file`	Path to evaluation data file	Same as training file
`--max_steps`	Maximum number of training steps	`500`
`--learning_rate`	Learning rate for optimizer	`2e-4`
`--mlm_prob`	Masked Language Modeling probability	`0.025`
`--train_batch`	Training batch size per GPU	`5`
`--eval_batch`	Evaluation batch size per GPU	`3`

Note: The default pretraining_small.txt is a quick test dataset extracted from chromosome 1 of cell type E003.

4-1-2. 18-chromatin state system (IHEC data)

Due to the large size of the IHEC pretraining dataset (1699 cell types), the process is divided into two steps:

Splitting the full dataset into smaller shuffled chunks
Running looped pretraining over each chunk

The scripts are located in the training/examples/prom_ihec/script_pre/ directory. Before running them, make sure to place the pretraining data file (promoter_ihec_all_4mer_wo_4R.txt, downloaded from Zenodo) in the training/examples/prom_ihec/pretrain_data/ directory.

Step 1: Split the data into shuffled chunks

Run the following script to split the full dataset into chunks of 100,000 lines each. The shuffled and chunked files will be saved under ../pretrain_data/split_chunks/.

(chrombert_training) $ cd training/examples/prom_ihec/script_pre
(chrombert_training) $ bash split_chunk.sh

This script shuffles the input file and splits it into evenly sized chunks for sequential training.

Step 2: Run looped pretraining over the chunks

Use the provided pretraining_loop.sh script to sequentially train on each chunk of the shuffled data.

(chrombert_training) $ cd training/examples/prom_ihec/script_pre
(chrombert_training) $ bash pretraining_loop.sh

The model outputs for each chunk will be saved in the ../pretrain_result/ directory.

4-2. Fine-tuning

4-2-1. Classification

Following pre-training, the parameters are saved in the training/examples/prom/pretrain_result/ directory. To replicate our classification results, users should place the files train.tsv and dev.tsv directly in the examples/prom/ft_data/classification directory. This location includes data for classifying promoter regions between genes that are highly expressed (RPKM > 50) and those that are not expressed (RPKM = 0).Note that our ChromBERT.zip file offers promoter region fine-tuning data from 57 different cell types under the promoter_finetune_data directory. Users are encouraged to properly place the required file.

(chrombert_training) $ cd training/examples/prom/script_ft
(chrombert_training) $ bash run_4mer_classification_finetune.sh

Optional arguments:

Argument	Description	Default
`--model_path`	Path to the pre-trained model	`../pretrain_result`
`--data_path`	Path to the fine-tuning dataset	`../ft_data/classification`
`--output_path`	Directory to save fine-tuned model	`../ft_result/classification`
`--epochs`	Number of training epochs	`10.0`
`--lr`	Learning rate	`2e-5`
`--batch_size`	Batch size for both training and evaluation	`32`

4-2-2. Regression

To replicate our regression results, users should place the files train.tsv and dev.tsv—which contain sequence and log-transformed RPKM value pairs—directly in the training/examples/prom/ft_data/regression directory.

(chrombert_training) $ cd training/examples/prom/script_ft
(chrombert_training) $ bash run_4mer_regression_finetune.sh

4-3. Prediction

To obtain an attention matrix for the prediction result, execute the scripts in the following order: First, run run_4mer_pred.sh in the training/examples/prom/script_pred directory.

(chrombert_training) $ cd training/examples/prom/script_pred
(chrombert_training) $ bash run_4mer_pred.sh

Optional arguments:

Position	Argument	Description	Default
1	`KMER`	K-mer size used for the tokenizer	`4`
2	`MODEL_PATH`	Path to the fine-tuned model	`../ft_result/classification`
3	`DATA_PATH`	Path to the input data for prediction	`../ft_data/classification`
4	`PREDICTION_PATH`	Directory to save prediction results	`../predict_result`

5. Motif Detection and Clustering 🧬

The identification of chromatin state motifs can be categorized into two phases: Motif Detection and Motif Clustering. During the Motif Detection phase, chromatin state sequences that have high attention scores and are uniquely associated with the class of interest (for example, the promoter region) are identified and organized into a dataframe. Subsequently, these sequences undergo clustering through Dynamic Time Warping (DTW) in the Motif Clustering phase, leading to the identification of the definitive chromatin state motifs.

5-1. Motif Detection

(chrombert) $ cd training/motif/prom
(chrombert) $ bash ./motif_prom.sh

Executing the script as described above allows users to generate a init_df.csv file in the ./result directory. This file includes a comprehensive list of chromatin state sequences. To adjust settings such as the window size, minimum sequence length, and the minimum occurrence threshold, users can modify the script's arguments as demonstrated below:

(chrombert) $ bash ./motif_prom.sh --window_size 12 --min_len 5 --min_n_motif 2

Optional arguments:

Argument	Description	Default value
`--window_size`	Sliding window size for motif scanning	`12`
`--min_len`	Minimum length of motifs to report	`5`
`--min_n_motif`	Minimum number of motif instances required	`2`
`--data_path`	Path to the input data directory	`../../examples/prom/ft_data`
`--predict_path`	Path to the prediction results directory	`../../examples/prom/predict_result`
`--motif_path`	Directory to save discovered motifs and plots	`./result`

For further assistance, the --help option provides a detailed explanation of all available arguments, their default settings, and an illustrative example of how to use them:

(chrombert) $ bash ./motif_prom.sh --help

5-2. Motif Clustering

For motif clustering, we recommend using the "Motif Clustering" section in the Colab tutorial linked below:

License

This project is licensed under the Apache License, Version 2.0. See the LICENSE file for details.

Thank you for checking out ChromBERT. If you found this project useful, please consider starring it on GitHub to help it gain more visibility.

Name		Name	Last commit message	Last commit date
Latest commit History 281 Commits
colab		colab
processing		processing
training		training
LICENSE		LICENSE
README.md		README.md
abs_fig.png		abs_fig.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ChromBERT: Uncovering Chromatin State Motifs in the Human Genome Using a BERT-based Approach

Citation 📖

1. System Requirements and Optimal Configurations 🎯

Software

Verified Configurations

Hardware

2. Installation with Environment Setup 📦

2-1. Clone the ChromBERT repository

2-2. Setting up the data processing environment

2-3. Setting up the training environment

3. Chromatin state data pre-processing 📊

4. Training ⏳

4-1. Pre-training

4-1-1. 15-chromatin state system (ROADMAP data)

4-1-2. 18-chromatin state system (IHEC data)

4-2. Fine-tuning

4-2-1. Classification

4-2-2. Regression

4-3. Prediction

5. Motif Detection and Clustering 🧬

5-1. Motif Detection

5-2. Motif Clustering

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

caocao0525/ChromBERT

Folders and files

Latest commit

History

Repository files navigation

ChromBERT: Uncovering Chromatin State Motifs in the Human Genome Using a BERT-based Approach

Citation 📖

1. System Requirements and Optimal Configurations 🎯

Software

Verified Configurations

Hardware

2. Installation with Environment Setup 📦

2-1. Clone the ChromBERT repository

2-2. Setting up the data processing environment

2-3. Setting up the training environment

3. Chromatin state data pre-processing 📊

4. Training ⏳

4-1. Pre-training

4-1-1. 15-chromatin state system (ROADMAP data)

4-1-2. 18-chromatin state system (IHEC data)

4-2. Fine-tuning

4-2-1. Classification

4-2-2. Regression

4-3. Prediction

5. Motif Detection and Clustering 🧬

5-1. Motif Detection

5-2. Motif Clustering

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages