With cpus=1, handle TSV post-processing safely

Observed error:

  Traceback (most recent call last):
    File "/opt/laava/summarize_alignment.py", line 1032, in <module>
      main(args)
    File "/opt/laava/summarize_alignment.py", line 896, in main
      subset_sam_by_readname_list(
    File "/opt/laava/summarize_alignment.py", line 55, in subset_sam_by_readname_list
      for row in csv.DictReader(per_read_f, delimiter="\t"):
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
...
  gzip.BadGzipFile: Not a gzipped file (b're')

That's because the .gz is only applied when cpus > 1, and cpus=1 follows a different code path and skips the aggregation+gzip steps.

Potential solutions:

Always use the multiprocessing path, even when cpus=1. (Least code, though inefficient.)
Fix the downstream issue(s) individually by checking for .gz extensions. (Perpetuates the inconsistency.)
Gzip the intermediate "chunks" as well, so that they are also valid .tsv.gz, and handle them correctly in the aggregation step when cpus>1. (Requires more code changes with little benefit.)
Run gzip directly on the generated .tsv files when cpus=1. (Straightforward but requires more special-case code.)

The first option seems best because all this chunking and iteration deserves to be rewritten and having less code is better for that.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions