8000 With cpus=1, handle TSV post-processing safely · Issue #90 · formbio/laava · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
With cpus=1, handle TSV post-processing safely #90
Open
@etal

Description

@etal

Observed error:

  Traceback (most recent call last):
    File "/opt/laava/summarize_alignment.py", line 1032, in <module>
      main(args)
    File "/opt/laava/summarize_alignment.py", line 896, in main
      subset_sam_by_readname_list(
    File "/opt/laava/summarize_alignment.py", line 55, in subset_sam_by_readname_list
      for row in csv.DictReader(per_read_f, delimiter="\t"):
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
...
  gzip.BadGzipFile: Not a gzipped file (b're')

That's because the .gz is only applied when cpus > 1, and cpus=1 follows a different code path and skips the aggregation+gzip steps.

Potential solutions:

  • Always use the multiprocessing path, even when cpus=1. (Least code, though inefficient.)
  • Fix the downstream issue(s) individually by checking for .gz extensions. (Perpetuates the inconsistency.)
  • Gzip the intermediate "chunks" as well, so that they are also valid .tsv.gz, and handle them correctly in the aggregation step when cpus>1. (Requires more code changes with little benefit.)
  • Run gzip directly on the generated .tsv files when cpus=1. (Straightforward but requires more special-case code.)

The first option seems best because all this chunking and iteration deserves to be rewritten and having less code is better for that.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0