8000 output vcf files from phasing contain many repetitive header lines · Issue #30 · twolinin/longphase · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

output vcf files from phasing contain many repetitive header lines #30

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service an 8000 d privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ilivyatan opened this issue Nov 12, 2023 · 11 comments
Open

Comments

@ilivyatan
Copy link

Hi,
This is a really great tool and works very fast, thanks!
We've been trying it out in different ways and many times the resulting phased VCF files contain multiple lines of headers in the middle of the files.
For example, these lines appear 913 times in one file!
##FORMAT=<ID=PS,Number=1,Type=Integer,Description="Phase set identifier">
##longphaseVersion=1.5.1
##commandline="phase -s data/27909_pass.wf_snp.vcf.gz --sv-file data/27909_pass.wf_sv.vcf.gz --mod-file mod_27909.vcf -b data/27909_pass_merged.bam -r data/GRCh38.no_alt_analysis_set.fasta -t 16 -o phased_27909_mod --indels --ont "

This also occurs with joint phasing of just SNV and SV vcf files.
Why do comments appear in the middle of the output VCF files?

@twolinin
Copy link
Owner

Hi @ilivyatan,

Can you provide the header for the 27909_pass.wf_snp.vcf.gz?

Thanks

@ilivyatan
Copy link
Author

head_27909.vcf.txt

@ilivyatan
Copy link
Author

Any updates? Is the header readable?

@twolinin
Copy link
Owner

Hi @ilivyatan

The header file you provided checks out fine. In the next version, additional conditions will be added to try to prevent issues with repetitive headers.

@ilivyatan
Copy link
Author

The attached file is just the excerpt of the header up until the first column header line:
#chrom start end.... line

AFTER that line and after the data starts there are 913 additional header-like lines INTERLEAVED within, and running all throughout the file.
We just parsed them out and carried on with the VCF, but they shouldn't be there.

@twolinin
Copy link
Owner
  1. Based on your command, which VCF file has repetitive headers? phased_27909_mod.vcf or phased_27909_mod_mod.vcf
  2. Can you use grep -n "#" phased_27909_mod.vcf to provide the problematic headers?

@ilivyatan
Copy link
Author

It's in phased_27909_mod.vcf
The (grep) output you requested is attached :
phased_27909_mod_with_interleaved_comments.vcf.txt

@twolinin
Copy link
Owner

I noticed that the following line is missing in the phased header file you provided. Could you please confirm this for me?

#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	27909_pass

By the way, could you test with the latest version, v1.5.2, to see if the same issue?

wget https://github.com/twolinin/longphase/releases/download/v1.5.2/longphase_linux-x64.tar.xz
tar -xJf longphase_linux-x64.tar.xz

@SamStudio8
Copy link

We have also encountered this problem. I've taken a look at SnpParser::writeLine and the conditions required to reach the line that writes out the longphase headers appear to be:

So it looks as though longphase will write out these header lines to resultVcf any time it finds # ANYWHERE in the record. Indeed we've directly observed one longphaseVersion header for each input line that contains a # in the INFO field:

$ zgrep -c 'longphaseVersion' tmp_chr2.vcf.gz
78
$ zgrep -v '^#' tmp_chr2.vcf.gz | grep -c '#'
76

I'd suggest if looking for header lines that these checks specifically look for #/## at the beginning of the input string.

@ythuang0522
Copy link
Collaborator

Hi @SamStudio8

Thank you very much for digging into this issue.
We should limit it to the beginning of each line as you suggested.
It will be fixed in the next version together with other features in two weeks.

Thanks,
Yao-Ting

@ythuang0522
Copy link
Collaborator

@SamStudio8 @ilivyatan The v1.7 release has resolved this repeated header issue. We forgot to mention this issue in the release notes, but the SnpParser::writeLine has been rewritten as suggested by Sam.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants
0