strobealign

Strobealign is a fast short-read aligner. It achieves the speedup by using a dynamic seed size obtained from syncmer-thinned strobemers. Strobealign is multithreaded, implements alignment (SAM) and mapping (PAF), and benchmarked for SE and PE reads of lengths between 100-300bp. A preprint describing v0.4 is available here.

Current version is 0.6.

v0.6 implements:

Crucial bugfix to v0.5 causing occasional alignments to very long reference regions.
Several improvements such as identifying symmetrical hash collisions and testing reverse orientation. This leads to a further slight bump in alignment accuracy over previous versions, particularly for shorter read lengths.
Fixes reporting of template len filed in SAM output if deletion in alignment.

v0.5 implements:

Several improvements for downstream SNP andf INDEL calling. See benchmark below.
Option to report secondary alignments and more.
Option to set base level alignment parameters.
And more (See release notes)

INSTALLATION

You can acquire precompiled binaries for Linux and Mac OSx from the release page compiled with -O3 -mavx2.

It has been reported that strobealign is even faster if compliled with flag -march=skylake-avx512 for avx512 supported processors.

If you want to compile from the source, you need to have a newer g++ and zlib installed. Then do the following:

git clone https://github.com/ksahlin/StrobeAlign
cd StrobeAlign
# Needs a newer g++ version. Tested with version 8 and upwards.
g++ -std=c++14 main.cpp source/index.cpp source/xxhash.c source/ksw2_extz2_sse.c source/ssw_cpp.cpp source/ssw.c -lz -fopenmp -o strobealign -O3 -mavx2

Zlib linking

If you have zlib installed, and the zlib.h file is in folder /path/to/zlib/include and the libz.so file in /path/to/zlib/lib but you get

main.cpp:12:10: fatal error: zlib.h: No such file or directory
 #include <zlib.h>
          ^~~~~~~~
compilation terminated.

add -I/path/to/zlib/include -L/path/to/zlib/lib to the compilation, that is

g++ -std=c++14 -I/path/to/zlib/include -L/path/to/zlib/lib main.cpp source/index.cpp source/xxhash.c source/ksw2_extz2_sse.c source/ssw_cpp.cpp source/ssw.c -lz -fopenmp -o strobealign -O3 -mavx2

USAGE

Alignment

Strobealign comes with a parameter -r read_length that sets suitable seed parameters for the rough read length. Specifically, it sets parameters -k, -l and -u. If not specified, it defaults to 150. The value of r does not have to match the exact read length.

For alignment to SAM file:

strobealign -r <read_length> ref.fa reads.fa > output.sam

To report secondary alignments, set parameter -N [INT] for maximum of [INT] secondary alignments.

Mapping

For mapping to PAF file (option -x):

strobealign -r <read_length> -x ref.fa reads.fa > output.sam

VARIANT CALLING BENCHMARK

A small SNV and INDEL calling benchmark is provided below using bcftools to call SNPs and indels on a simulated repetitive genome based on alignments from strobealign, BWA-MEM, and minimap2. The genome is a 16.8Mbp sequence consisting of 500 concatenated copies of a 40kbp sequence which is mutated through substitutions (5%) and removing segments of size 1bp-1kbp (0.5%) along the oringinal 20Mbp string.

Then, 2 million paired-end reads (lengths 100, 150, 200, 250, 300) from a related genome with high variation rate: 0.5% SNVs and 0.5% INDELs. The challange is to find the right location of reads in the repetitive genome to predict the SNVs and INDELs in the related genome. In the in the genome where the reads are simulated from there is about 78k SNVs and INDELS, respectively. The precision (P), recall (R), and F-score are computed from these numbers. Results in table below.

In the experiments strobealign is in general the fastest tool, has the highest SNV precision, and highest precision, recall, and F-score for indels.

There are frequent indels in this dataset (every 200th bases on average) requiring calls to base level alignments for most reads. Between 65-85% of strobealign's runtime is spent on base level alignemetns with third-party SSW alignment module. The longer the reads the higher % of time is spent on base level alignment. Speed improvements to base-level alignment libraries will greatly reduce runtime on this dataset.

Read length	Tool	SNVs (P)	SNVs (R)	SNVs (F-score)	Indels (P)	Indels (R)	Indels (F-score)	Alignment time (s)
100	strobealign	97.8	93.5	95.6	55.5	41.1	47.3	455
	minimap2	91.4	94.3	92.8	55.2	39.1	45.8	605
	bwa_mem	93.7	95.9	94.8	55.3	30.0	38.9	1020

150	strobealign	96.5	92.7	94.6	55.2	46.3	50.3	369
	minimap2	89.8	94.6	92.1	54.9	44.8	49.3	902
	bwa_mem	96.0	96.0	96.0	55.0	39.6	46.1	1010

200	strobealign	97.2	94.1	95.6	55.3	45.8	50.1	515
	minimap2	88.1	96.7	92.2	55.0	44.7	49.3	1290
	bwa_mem	95.2	96.5	95.8	55.1	42.3	47.8	12623

250	strobealign	96.4	93.3	94.8	55.1	45.0	49.6	733
	minimap2	87.7	94.8	91.1	54.9	43.8	48.7	998
	bwa_mem	94.3	96.2	95.2	55.1	42.3	47.8	1593

300	strobealign	95.7	92.7	94.2	55.1	44.5	49.2	976
	minimap2	88.2	94.3	91.2	54.8	43.4	48.4	1046
	bwa_mem	93.7	96.4	95.0	54.9	42.0	47.6	1988

For the results, we ran

bcftools mpileup -O z --fasta-ref ref aligned.bam > aligned.vcf.gz
bcftools call -v -c -O v aligned.vcf.gz > aligned.variants.vcf.gz

# Split into SNP and INDELS
grep -v -E -e "INDEL;" aligned.variants.vcf.gz > aligned.variants.SNV.vcf
grep "#"  aligned.variants.vcf.gz > aligned.variants.INDEL.vcf
grep -E -e "INDEL;" aligned.variants.vcf.gz >> aligned.variants.INDEL.vcf

for type in SNV INDEL
do
	bcftools sort -Oz aligned.variants.$type.vcf.gz -o aligned.variants.sorted.$type.vcf.gz
	bcftools index aligned.variants.sorted.$type.vcf.gz
	bcftools isec --nfiles 2 -O u true_variants.sorted.$type.vcf.gz  aligned.variants.sorted.$type.vcf -p out_$type
done

CREDITS

Kristoffer Sahlin. Flexible seed size enables ultra-fast and accurate read alignment. bioRxiv, 2021. doi:10.1101/2021.06.18.449070. Preprint available here.

VERSION INFO

See release page

LICENCE

GPL v3.0, see LICENSE.txt.

Name		Name	Last commit message	Last commit date
Latest commit History 253 Commits
source		source
CMakeLists.txt		CMakeLists.txt
README.md		README.md
main.cpp		main.cpp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

strobealign

INSTALLATION

Zlib linking

USAGE

Alignment

Mapping

VARIANT CALLING BENCHMARK

CREDITS

VERSION INFO

LICENCE

About

Uh oh!

Releases 24

Packages

Uh oh!

Contributors 14

Uh oh!

Languages

License

ksahlin/strobealign

Folders and files

Latest commit

History

Repository files navigation

strobealign

INSTALLATION

Zlib linking

USAGE

Alignment

Mapping

VARIANT CALLING BENCHMARK

CREDITS

VERSION INFO

LICENCE

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 24

Packages 0

Uh oh!

Contributors 14

Uh oh!

Languages

Packages