Open
Description
Observed error:
Traceback (most recent call last):
File "/opt/laava/summarize_alignment.py", line 1032, in <module>
main(args)
File "/opt/laava/summarize_alignment.py", line 896, in main
subset_sam_by_readname_list(
File "/opt/laava/summarize_alignment.py", line 55, in subset_sam_by_readname_list
for row in csv.DictReader(per_read_f, delimiter="\t"):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
...
gzip.BadGzipFile: Not a gzipped file (b're')
That's because the .gz is only applied when cpus > 1, and cpus=1 follows a different code path and skips the aggregation+gzip steps.
Potential solutions:
- Always use the multiprocessing path, even when cpus=1. (Least code, though inefficient.)
- Fix the downstream issue(s) individually by checking for .gz extensions. (Perpetuates the inconsistency.)
- Gzip the intermediate "chunks" as well, so that they are also valid .tsv.gz, and handle them correctly in the aggregation step when cpus>1. (Requires more code changes with little benefit.)
- Run gzip directly on the generated .tsv files when cpus=1. (Straightforward but requires more special-case code.)
The first option seems best because all this chunking and iteration deserves to be rewritten and having less code is better for that.