8000 Slow cram performance due to hflush · Issue #883 · samtools/htslib · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content 8000
Slow cram performance due to hflush #883
Closed
@jkbonfield

Description

@jkbonfield

Ancient history for @jmarshall to maybe wrack his brain on.

Commit 6416ebf made the CRAM code use hfile instead of FILE pointers. In doing so, an fflush call was replaced by hflush.

Naively this seems like a sensible option and a simple one to one replacement. However, hflush does an fdatasync, which turns it into a synchronous write. The performance hit appears to be OS specific, but on my desktop running Ubuntu Bionic:

With no flush:

time samtools view -o /tmp/v4.cram ~/scratch/data/v4_demo/novaseq.bam

real	0m4.312s
user	0m3.798s
sys	0m0.200s

With fflush:

time samtools view -o /tmp/v4.cram ~/scratch/data/v4_demo/novaseq.bam

real	0m4.263s
user	0m3.759s
sys	0m0.157s

With hflush

time samtools view -o /tmp/v4.cram ~/scratch/data/v4_demo/novaseq.bam

real	0m8.631s
user	0m3.903s
sys	0m0.157s

Basically fflush is pretty much free, but we're half speed with hflush. I don't think this change was ever intended to operate this way.

I can see two solutions.

  1. Drop the flush entirely. I'm not sure what it's intending to be doing, but the original fflush dates way back to pre-htslib when all this code was in io_lib.

  2. Add a few hflush equivalent which is flush only, without the sync bit.

My instinct is to do 1.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0