ChromaFold is a deep learning model that enables prediction of 3D contact maps from scATAC-seq data alone, by using pseudobulk chromatin accessibility and co-accessibility from scATAC-seq as well as predicted CTCF motif tracks as input features.
-
General
- python=3.8
- pytorch=1.11
- numpy=1.21
- pandas=1.4
- scipy=1.7
-
Visualization
- coolbox=0.3
- matplotlib=3.2
- seaborn=0.11
- tabix=1.11
-
R
For using ArchR
from R
for data preprocessing, you can create an R environment chromafold_env
following the steps in R_env.sh
.
For deploying ChromaFold, you can create a conda environment using the provided .yml file:
conda env create -f chromafold.yml
Raw data preparation
Sample raw and processed input data can be downloaded from https://drive.google.com/drive/folders/1p6dulb2z51NF_WA6RnAG4hHuUaKfFPrR?usp=sharing
a) Input data preparation
- Prepare CTCF motif data: CTCF motif data are extracted from the CTCF introduction from R package AnnotationHub. R scripts for generating motif of hg38 and mm10 can be found at process_input/ctcf_motif. We also provide ready-to-use CTCF motif score for hg38, hg19, mm10 in the google drive.
- Prepare scATAC data for inference: please refer to the full instructions at preprocessing_pipeline.
- A toy processed input folder can be found at data_subset which contains only chr19. A full version of processed input files can be found in the google drive.
b) Target data preparation
- Example raw Hi-C file for IMR-90 can be downloaded from ENCODE (https://www.encodeproject.org/files/ENCFF843MZF/@@download/ENCFF843MZF.hic).
- Prepare normalized Hi-C library for target: please refer to the full instructions at process input/hic_normalization (also shown below).
- HiCDC+ normalized training target for IMR-90 (all chromosomes) is available at google drive, and a subset of chr19 is available at data_subset.
Integration for training
-
Prepare Hi-C data for training
- Run process input/Process Input - Hi-C.ipynb.
- The juicer tools jar file can be downloaded from https://s3.amazonaws.com/hicfiles.tc4ga.com/public/juicer/juicer_tools_1.22.01.jar . If the juicer tool doesn't match your java system, please refer to an earlier versions of the juicer tools.
1. Run inference on germinal center B cell with ChromaFold
- Run inference on full chromosome without offset
python ./chromafold/inference.py --data-path ./data/processed_input/ -ct imr90 --model-path ./checkpoints/chromafold_CTCFmotif.pth.tar -chrom 19 -offset -2000000 --genome hg38
- Run inference only on regions with complete input information
python ./chromafold/inference.py --data-path ./data/processed_input/ -ct imr90 --model-path ./checkpoints/chromafold_CTCFmotif.pth.tar -chrom 19 -offset 0 --genome hg38
1. Training on 3 cell types
- Train model without co-accessibility component
python ./chromafold/train_bulkOnly.py --data-path ./data/processed_input/ -ct gm12878_hg38 umb_endo imr90
- Train model with co-accessibility
python ./chromafold/train.py --data-path ./data/processed_input/ -ct gm12878_hg38 umb_endo imr90 --mod-name bothInput
- Train deterministic model for full reproducibility
python ./chromafold/train_bulkOnly.py --data-path ./data/processed_input/ -ct gm12878_hg38 umb_endo imr90 --deterministic --mod-name deterministic
2. Training on 1 cell type
- Train model on HUVEC without co-accessibility component
python ./chromafold/train_bulkOnly.py --data-path ./data/processed_input/ -ct umb_endo