Analysis of Paired-Damage-seq datasets

Paired-Damage-seq is a single-cell multiomics sequencing technique for joint analysis of transcriptome with oxidative DNA damage and single-stranded DNA breaks.

Based on in situ labeling with the "base excision repair" proteins, damaged DNA sites, including 8-oxoG, AP sites, nicks and gaps, can be labelled with biotinylated dUTP and enabled the downstream targeted tagmentation with anti-biotin antibodies and protein A-Tn5 fusion protein. As a new member of the "Paired series" multiomics techniques, Paired-Damage-seq is also based on ligation-based combinatorial barcoding (first introduced by SPLiT-seq), offering ultra-high-throughput, cost-effective, single-cell indexing without the requirements for specific instruments.

Using paired transcriptome, we can perform computational "sorting" of cells and analyze the regions most frequently damaged across different cell types and states. Such selective genome vulnerability displays associations with loss of epigenetic memory over time, and could contribute to disease risks.

We used customized barcodes designs and the codes here are specifically for pre-processing of Paired-Damage-seq datasets. If you are using different sets of barcodes sequences, you may need to prepare your own barcode whitelist files.

Preparation

The codes and barcodes whitelist files used for analysis of Paired-Damage-seq were organized into four parts.

The pre-processing pipeline to extract cell barcodes, map to reference genome, and generate RNA cell-to-gene count matrix.
The codes for analysis of HeLa Paired-Damage-seq datasets.
The codes for analysis of mouse brain Paired-Damage-seq datasets.
The cell barcode reference and gene annotation reference (for generation of RNA count matrix) are also uploaded to the "resource" folder.

For additional resources, please find: Additional resources.

Note

Please have the following softwares installed before running the preprocessing pipeline.

Package Requirements

If you have previously set up the environment for analysis of Paired-Tag or SIMPLE-seq, you may not need to re-install all of them.

Name	Link
bowtie 1.x	http://bowtie-bio.sourceforge.net/index.shtml
bowtie 2.x	http://bowtie-bio.sourceforge.net/bowtie2/index.shtml
samtools (>1.3)	https://www.htslib.org
STAR	https://github.com/alexdobin/STAR
Trim_galore	https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/
FastQC (Optional)	https://www.bioinformatics.babraham.ac.uk/projects/fastqc/

Generally we do not limit to specific versions of the above softwares, as long as the parameters format in bash scripts here match with the installed versions.

An updated GCC complier is required.

A customized code was prepared for extracting & converting the cell barcodes from the "Paired" series datasets. To compile the code, plesae follow the following steps:

# Download and uncompressed the scripts
wget https://github.com/czhulab/Paired-Damage-seq/archive/refs/heads/main.zip
unzip main.zip

# Compile the "preprocessing"
cd Paired-Damage-seq-main/01.Preprocessing/preproc
sh make.sh

# Compile the "reachtools":
cd ../reachtools
sh make.sh

# Build the Cellular Barcodes reference for bowtie 1.x:
cd ../../resources
unzip Paired_Tag3_384_ID_ref.fa.zip
bowtie-build ./Paired_Tag3_384_ID_ref.fa ./Paired_Tag3_384_ID_ref

Important

We have updated the length and numbers of barcode combinations for the split-and-pool ligation, which is different from SPLiT-seq and Paired-Tag (2021). A preproc tool is provided for correctly extracting the cellular barcodes.
The reachtools is only used for generating cell-to-genes/bins count matrices.

Pre-processing of Paired-Damage-seq datasets

Step 1. Initial QC (Optional).

Similar to the other combinatorial barcoding-based "Paired series" datasets, a quick QC can be done with FastQC software. FastQC will give the summary for several key quality metrics of fastq files generated from Illumina bcl2fastq program. The key metrics are similar to the previous Paired-Tag datasets.

# If installed fastQC software.
fastqc Sample_1_R1.fq.gz
fastqc Sample_1_R2.fq..gz

Representative FastQC report from Paired-Tag dataset

The image below shows the "Per base sequence content" and "Adapter Concent" sections of the FastQC output file from a representative Paired-Tag library.

As shown in Read1 report, there is a high fraction of G base in the 2nd base, that is expected from the library construction (no trimming are needed for this part as Bowtie2 will handle it properly).
For read2, there are 3 base balanced regions (UMI and barcode) between 3 base-inbalenced linkers (as indicated in the image).
- If the linker regions does not show high fluctuation as in the representative image, that may indicate a low ligation efficiency.
For "Adapter Content" section, typically we will expect a low fraction of Nextera adaptor sequence (at 100th bp, 5%-20%; expect higher percentage if sequenced to 150 bp or longer) in Read2 library and negligible adaptor content from Read1 library.
Higher fraction of adaptors:
- In RNA library indicates: amount of N5-Tn5 is too high in 2nd adaptor tagging step of RNA library.
- In DNA library: tagmentation efficiency (antibody efficiency) are low, may expect low library complexity.

Step 2. Barcodes extraction, mapping, and matrix generation.

During this step, the shell script will perform barcodes extraction, reads cleaning & mapping, PCR duplicates removal, and generating the cell-counts matrix with the environment prepared in Preparation section.

Important

Don't forget to update the your paths to fastq files, barcode references, genome_reference ,preprocessing, reachtools folders, and gene_annotation file in the "01.Preprocessing/per_run.sh" bash script.
Cell barcodes whitelist barcode references, and gene_annotation for mm10 and hg38 are available in "resources" folder.

# Run this for individual sub-library

# DNA_ID and RNA_ID corresponding to the prefix of fastq files, for example:

# Sample_01_DNA_R1.fq.gz, Sample_01_DNA_R2.fq.gz, Sample_01_RNA_R1.fq.gz, Sample_01_RNA_R2.fq.gz
# DNA_ID = "Sample_01_DNA"
# RNA_ID = "Sample_01_RNA"


sh per_run.sh ${DNA_ID} ${RNA_ID}

For batch job submission, we prepared a simple perl script (01.Preprocessing/01.submit_run.pl) and an example sample table (sample_list.txt). Please modify this line with your batch submission script.

The key output files after this step includes:

01.rawdata/*combined_DNA.fq.gz: Extracted barcode reads with DNA restriction cutting sites, which are derived from PAT tagmentation.

01.rawdata/*combined_RNA.fq.gz: Extracted barcode reads with RNA restriction cutting sites, which are derived from reverse transcription.

01.rawdata/*combined_UND.fq.gz: Extracted barcode reads that cannot be assigned to DNA and RNA modalities, possible due to PCR/sequencing errors, and dimer fragments.

01.rawdata/*BC_cov.fq.gz: Converted fastq files with barcode IDs attached to ReadName lines.

02.trimmed/*BC_cov_trimmed.fq.gz: Cleaned reads that will be used for mapping. An additional optional QC can be performed on them.

03.mapping_mm10/*_mm10_sorted.bam: Mapped DNA and RNA bam files, before PCR duplicates removal.

03.mapping_mm10/*_mm10_sorted_rmdup.bam: Mapped DNA and RNA bam files, after PCR duplicates removal.

04.*mtx2/: Cell-counts matrix for individual sub-libraries, in 10X format.

Step 3. Pre-filtering, and merging matrices for sub-libraries.

In Paired-Damage-seq, we will aliquote the barcoded nuclei into sub-libraries containing 2-10k cells for library preparation and sequencing. The best approach is to QC & filtering sub-library pairs individually, and then merge them for downstream analyses.

The matrices files are in standard 10X format, you can use your own scripts to perform this task.
You can also use our previous Paired-Tag scripts for this step.

Filtering & merging matrices

We recommend to filter barcodes with low reads numbers before maerging sub-libraries. The same scripts in Paired-Tag are compatible with Paired-Damage-seq here.

Count & plot reads counts using R: plot_reads_numbers.R
Filter matrices pairs using perl: perl filt_mtx.pl
- Do not forget to modify the variables to specific files/prefix in the perl code.
- Metadata file is generated from the R code above.
Merge matrices using perl: perl merge_mtx.pl merge_list.txt
- Example merge_list.txt format is annotated in the script.

Caution

When merging sub-libraries, always use unique prefix for each DNA-RNA library pairs.
- The cells in different sub-libraries may have the same barcodes combinations (BC#3:#2:#1). Merging them without adding sub-library pair-specific prefix (BC#4) may results in barcodes conflicts.
- The PCR index will be used as the 4th barcode combination (adding BC#4 -> BC#4:#3:#2:#1) to give sufficient #s of barcodes.

Step 4. Downstream analyses.

We recommend to perform cell clustering on transcriptome profiles of Paired-Damage-seq datasets and then pesudobulk the DNA damage signals.

The computational tools for single-cell genomics are rapidly evoloving and it is impossible to recommend the best ones. Here are some of the softwares we used for these downstream analyses:

Seurat: https://satijalab.org/seurat/
Scanpy: https://scanpy.readthedocs.io/en/stable/
SnapATAC2: https://github.com/kaizhang/SnapATAC2
SEACell: https://github.com/dpeerlab/SEACells

The code we used to produce the presented figures are deposited to HeLa cell data and Mouse brain data. We do not have recommendations for specific versions of the packages and these codes and notebooks are for reference purpose only.

Note

Please check with the official documentations for the packages/softwares used in the above analysis.
The paths to our original files were kept for record purpose.

Additional resources

If you are interested in applying our methods or datasets, here are the links for some useful resources:

Read out publication: Paper
Download the dataset: GSE268567
We regularly maintain the step-by-step protocol: here.

Our other techniques:

Paired-seq: Paper, Protocol, Codes, Data
Paired-Tag: Paper, Protocol, Codes, Data
Droplet Paired-Tag: Paper, Protocol, Codes, Data
SIMPLE-seq: Paper, Protocol, Codes, Data

Please feel free to contact us if you have any questions or need anything else.

@ New York Genome Center & Weill Cornell Medicine

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
01.Preprocessing		01.Preprocessing
02.Analysis_HeLa		02.Analysis_HeLa
03.Analysis_Brain		03.Analysis_Brain
Protocol		Protocol
resources		resources
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Analysis of Paired-Damage-seq datasets

Preparation

Pre-processing of Paired-Damage-seq datasets

Additional resources

About

Uh oh!

Releases

Packages

Languages

License

czhulab/Paired-Damage-seq

Folders and files

Latest commit

History

Repository files navigation

Analysis of Paired-Damage-seq datasets

Preparation

Pre-processing of Paired-Damage-seq datasets

Additional resources

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages