8000 GitHub - at-cg/HALE
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

at-cg/HALE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Introduction

HALE (Haplotype-Aware Long-read Error correction) is a haplotype-aware error correction tool designed for long reads. It has been primarily evaluated on PacBio HiFi data, with planned extensions to support ONT simplex reads in the future.

Requirements

  • Linux OS (tested on RHEL 8 and Ubuntu 22.04)

  • rustup (Install using rustup, the recommended way to get Rust)

  • Python 3.1 or above (and conda) for data preprocessing

  • Make sure the following system packages are installed (Linux):

    • build-essential, autoconf, libtool, pkg-config

    Note: If you're using a Linux system, there's a good chance these system packages are already installed, especially if development tools have been previously set up.

Try HALE on Small Test Data

The entire test workflow below will take about 5-6 minutes. Users can either run the commands one by one or copy the commands into an executable script.

# Install HALE 
git clone https://github.com/at-cg/HALE.git
cd HALE && RUSTFLAGS="-Ctarget-cpu=native" cargo build -q --release

# Create conda env
conda env create --file scripts/hale-env.yml
conda activate hale

mkdir -p test_run && cd test_run/

# download small test dataset
wget -O HG002.chr19_10M_12M.fastq.gz https://zenodo.org/records/14048797/files/HG002.chr19_10M_12M.fastq.gz?download=1

# Get all read ids in a seperate file
seqkit seq -ni HG002.chr19_10M_12M.fastq.gz > HG002.chr19_10M_12M.read_ids

# Run all-vs-all overlap
../scripts/create_batched_alignments.sh HG002.chr19_10M_12M.fastq.gz HG002.chr19_10M_12M.read_ids 8 batch_alignments

# Run hale correct
../target/release/hale correct --read-alns batch_alignments -t 8 HG002.chr19_10M_12M.fastq.gz HG002.chr19_10M_12M_corrected.fa

For large inputs, users are recommended to increase the thread count depending on the number of the cores available for use. HALE takes about 16 minutes for all-vs-all overlap and 9 minutes for correction consuming ~50 GB RAM using 64 threads on a multicore Perlmutter CPU-based node to process 60x HiFi chr9 HG002 human genome dataset.

Installation

  1. Clone the repository:
git clone https://github.com/at-cg/HALE.git
  1. Compile the source code:
cd HALE
RUSTFLAGS="-Ctarget-cpu=native" cargo build -q --release
  1. Create conda env
conda env create --file scripts/hale-env.yml
conda activate hale

Usage

  1. minimap2 alignment and batching
scripts/create_batched_alignments.sh <input_fastq/input_fastq.gz> <read_ids> <num_of_threads> <directory_for_batches_of_alignments> 

We use same parameters for minimap2 as HERRO
Note: Read ids can be obtained with seqkit: seqkit seq -ni <input_fastq/input_fastq.gz> > <read_ids>

  1. Error-correction
hale correct --read-alns <directory_for_batches_of_alignments> -t 64 <input_fastq/input_fastq.gz> <fasta_output> 

Note: Flag -t represent number of threads.

Acknowledgement

This work leverages components of HERRO, developed by Stanojevic et al. (2024). While we designed a new algorithm independent of HERRO's deep learning approach, we adopted key preprocessing steps such as Minimap2 alignment, windowing, and post-processing for consensus generation with minimal modifications.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •  
0