Releases · ksahlin/strobealign

Improvements mainly for large repetitive genomes.

Introduces maximum limit on repetitive seeds before calling optimized merged match finder (optimized for repetitive reads). This reduces the computational time if the genome is large and repetitive, e.g., maize (2.4Gb), rye (7.8Gb), significantly.
Fixes sam header issue #22
Removes dependency on ksw2.

Major update in the implemented parallelization. The new parallel implementation allows a much more efficient interplay with reading input -> aligning -> writing output. This results in much better CPU usage as the number of threads increases. For example, I observed an almost a 2x speedup (50-30% reduced runtime) across four larger datasets when using 16 cores (SIM and GIAB 150bp and 250bp reads, see README benchmarks).

For reference, previous naive parallelization ran in sequential order: 1. Read batch of reads with one thread 2. Align batch input in parallel with OpenMP 3. Write output with one thread. New parallelization performs 1-3 across threads with mutex on input and output. Such types of parallelization are commonly applied in other tools.

This release also includes:

Implemented automatic inference of read length, which removes the need of specifying -r (as reported in #19)
Some minor bugfixes. For example, this bug is fixed.

This release has identical or near-identical alignments to the previous version v0.6.1 (same accuracy and SV calling stats across tested datasets)

Fixed a crucial runtime bug introduced in v0.6. (If mates within a read pair have different lengths and enter rescue mode)

The runtime bug revealed an issue in considering reverse complements (due to false symmetrical seed hits) in rescue mode. Therefore, this fix leads to a small increase in properly mapped read pairs.

Version 0.6 fixes a crucial bug introduced in v0.5 and has two additional bug fixes that improve accuracy. It is highly recommended to update to this version.

Crucial bugfix to v0.5 causing rare but occasional alignments to very long reference regions due to bug in coordinate. This becomes detrimental to speed.
Identifying symmetrical hash collisions and in those cases test the reverse orientation. This leads to a further slight bump in alignment accuracy over previous versions, particularly for shorter read lengths.
Fix to rare but occasional uninitialized joint alignment score S calculation that would cause suboptimal alignment
Fixes reporting of template len field in SAM output if deletion in alignment.

Added features, some improvements in alignment (accuracy), and minor bugfixes.

Added parameter -N [INT] to output secondary alignments
Base level alignment parameters can now be specified from command line -A -B -E -O
Improved MAPQ calculation: calculating them from alignments (if alignment mode) instead of from seeds.
Update default base-level alignment parameters for better alignments around indels.
Added Quality values, AS:i and NM:i tags to SAM output.

See INDEL/SNV calling benchmark in README.

Implemented bitpacking of reference ID and strobe offset. Gives about 15-20% better Peak memory. Use one int for both values. 24 bits for ref_id (can handle up to 2^24 = 16,777,216 unique references), and 8 bits for strobe offset (meaning 255+k as maximum seed length)
Implemented max_sites and max_score_droppoff as parameters instead of hardcoded

~10% faster indexing by skipping some unnecessary computations of unique seeds.
Several improvements to base level alignments using ssw. Fixed known bugs that happened occasionally with the flag, cigar string, and the NM flag in the sam file.
Changed to reporting Eq/X cigar strings instead of M.

Introduced a max seed size constraint when sampling seeds, only active in few regions where syncmers are sparsely sampled.
Parameter -r can now take any integer value.

Important bugfix [1] and added ssw for rescue alignment [2] since ksw is only for extension.

These fixes improve accuracy in paired-end alignment mode to v0.1. I also observe further increased speed (~15-20%) on all my test data sets aligning to hg38.

Major update of strobealign. This version comes with an improvement in accuracy (and the number of aligned reads) around lengths 100-125nt reads, and it is also faster than older versions for these lengths. Most notable changes:

Algorithm changes
- Using xxhash instead of no hash for strobes. Gives a better pseudorandom generation of hashes for linking.
- Linking strobes using bitcount( (h_1 ^ h_2) ^ q) which creates a skewed seed length distribution towards shorter seeds in the window. This improves mapping candidate read detection particularly for shorter reads (100nt).
Parameters
- Adding the option to customize sampling window of second strobe with -l and -u.
- Adding a parameter -r [INT] for approximate read length (default 150). This will make strobealign customize parameters -l -u, and -k
Also cuts the reference accessions at first space, which fixes issue #4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Releases: ksahlin/strobealign

v0.7.1

Uh oh!

v0.7

Uh oh!

v0.6.1

Uh oh!

v0.6

Uh oh!

v0.5

Uh oh!

v0.4

Uh oh!

v0.3

Uh oh!

v0.2.1

Uh oh!

v0.2

Uh oh!

v0.1

Uh oh!