HALE (Haplotype-Aware Long-read Error correction) is a haplotype-aware error correction tool designed for long reads. It has been primarily evaluated on PacBio HiFi data, with planned extensions to support ONT simplex reads in the future.
-
Linux OS (tested on RHEL 8 and Ubuntu 22.04)
-
rustup (Install using rustup, the recommended way to get Rust)
-
Python 3.1 or above (and conda) for data preprocessing
-
Make sure the following system packages are installed (Linux):
- build-essential, autoconf, libtool, pkg-config
Note: If you're using a Linux system, there's a good chance these system packages are already installed, especially if development tools have been previously set up.
The entire test workflow below will take about 5-6 minutes. Users can either run the commands one by one or copy the commands into an executable script.
# Install HALE
git clone https://github.com/at-cg/HALE.git
cd HALE && RUSTFLAGS="-Ctarget-cpu=native" cargo build -q --release
# Create conda env
conda env create --file scripts/hale-env.yml
conda activate hale
mkdir -p test_run && cd test_run/
# download small test dataset
wget -O HG002.chr19_10M_12M.fastq.gz https://zenodo.org/records/14048797/files/HG002.chr19_10M_12M.fastq.gz?download=1
# Get all read ids in a seperate file
seqkit seq -ni HG002.chr19_10M_12M.fastq.gz > HG002.chr19_10M_12M.read_ids
# Run all-vs-all overlap
../scripts/create_batched_alignments.sh HG002.chr19_10M_12M.fastq.gz HG002.chr19_10M_12M.read_ids 8 batch_alignments
# Run hale correct
../target/release/hale correct --read-alns batch_alignments -t 8 HG002.chr19_10M_12M.fastq.gz HG002.chr19_10M_12M_corrected.fa
For large inputs, users are recommended to increase the thread count depending on the number of the cores available for use. HALE takes about 16 minutes for all-vs-all overlap and 9 minutes for correction consuming ~50 GB RAM using 64 threads on a multicore Perlmutter CPU-based node to process 60x HiFi chr9 HG002 human genome dataset.
- Clone the repository:
git clone https://github.com/at-cg/HALE.git
- Compile the source code:
cd HALE
RUSTFLAGS="-Ctarget-cpu=native" cargo build -q --release
- Create conda env
conda env create --file scripts/hale-env.yml
conda activate hale
- minimap2 alignment and batching
scripts/create_batched_alignments.sh <input_fastq/input_fastq.gz> <read_ids> <num_of_threads> <directory_for_batches_of_alignments>
We use same parameters for minimap2 as HERRO
Note: Read ids can be obtained with seqkit: seqkit seq -ni <input_fastq/input_fastq.gz> > <read_ids>
- Error-correction
hale correct --read-alns <directory_for_batches_of_alignments> -t 64 <input_fastq/input_fastq.gz> <fasta_output>
Note: Flag -t
represent number of threads.
This work leverages components of HERRO, developed by Stanojevic et al. (2024). While we designed a new algorithm independent of HERRO's deep learning approach, we adopted key preprocessing steps such as Minimap2 alignment, windowing, and post-processing for consensus generation with minimal modifications.