forked from samtools/bcftools
-
Notifications
You must be signed in to change notification settings - Fork 0
sync #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
starskyzheng
wants to merge
514
commits into
starskyzheng:develop
Choose a base branch
from
samtools:develop
base: develop
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
sync #1
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
ab5de54
to
2191405
Compare
- -i/-e filtering expression containing missing values must not crash when subfields are queried on the fly - parsing of subfield arrays with missing values now done correctly Resolves #2098
When both DP and AD values are present, VAF is calculated as AD/DP. This could result in VAF values bigger than 1 and access beyond array boundaries. Fixes #2102
This also - adds a check for cases where the --file-list is empty - adds a synonynomous option --force-no-index to --no-index Resolves #2100
The main function sets mplp.bsmpl = bam_smpl_init() which allocates memory. This is freed at the end of main_mpileup. Unfortunately there are many ways of exiting this function early which don't do this free. Restructured it slightly so the free happens. It's not a serious problem, but it did prevent make check from working when using -fsanitize=address. This wasn't spotted via the CI as it doesn't have a PTY and hence it avoids running the usage tests.
If the filtering expression queried keys from a file, for example -i 'INFO/TAG=@strings_expected' the program would segfault if INFO/TAG was not present in the first VCF record. This is now fixed by making sure NULL is not passed to khash_str2int_has_key. Fixes #2111
For an alignment that doesn't have an indel but is aligned against reads that do have an indel, the indel quality comes from the BAM quality. However we already have indelQ assigned, so this avoids changing to BAM qual if indelQ is zero as that is a special case for a read aligning to multiple indel "types" (lengths) with equal score. This avoids excess AD numbers for poorly chosen alignments. Fixes #2113
CONS_CUTOFF_DEL and CONS_CUTOFF_INC are now 35% instead of 40% (also DEL is new as it previously shared the score with SNPs). Plus when computing the ratio we no longer apply this as a percentage to the entire depth, but as a percentage of the alignments overlapping any STRs spanning this region (so meaningful alignments) that aren't already allocated to the "type" being diagnosed. Both of these make it slightly more likely to emit a second consensus sequence potentially containing indel variations. This fixes the specific problem raised in #2117, but like all heuristics it benefits some cases and harms others. We will always have some cases that fall just either side of the thresholds and look odd. Overall it seems a slim benefit and the work to track the number of meaningful spanning alignments may have potential value in subsequent updates. Fixes #2117
This applies where there are zero observed indels in a sample (but they are in other samples). It's odd how this appears to be a pretty rare change, as we'd expect a significant change to FP, but it's tiny. Genotype assignment error changes a lot, but only in one of the 3 samples I tested. On HG002, I see the following differences to calling rates. Previous All QUAL>=100 InDel TP 11823 / 11798 InDel FP 5374 / 4898 InDel GT 296 / 291 InDel FN 115 / 140 New All QUAL>=100 InDel TP 11822 / 11795 InDel FP 5313 / 4805 InDel GT 80 / 74 InDel FN 116 / 143 This was HG002 called in conjunction with HG003 and HG004, but not as a trio (so no pedegree supplied). Oddly despite that HG002 is much more accurate than HG003 and HG004, with GT assignment error rates an order of magnitude higher. This PR makes them a bit higher still (maybe another 20%). I cannot explain either of these, but perhaps it's simply down to the accuracy of the truth set as HG002 is by far the most widely curated of the three. Either that or my analysis has a flaw somewhere. Fixes #2130.
This didn't work in the past because update_from_fai() opened, read the header from, and then closed the input file, preventing the reheader_* functions from accessing the rest of the file contents when the input is a stream. As all the reheader_* functions read the header into a kstring, it's possible to make streaming work by passing this kstring into update_from_fai() and adjusting it to work directly on that copy of the data. As update_from_fai() no longer needs to write a temporary file, args_t::rm_tmpfile and args_t::tmp_prefix can be removed. The -T option is ignored as it's no longer needed, but is still accepted for compatibility. The init_tmp_prefix() function is still used by some other bcftools subcommands, so is left in place for now.
When setting value by determining index from the genotype, we face the problem of how to interpret truncating arrays. Say we have TAG defined as Number=. and GT:TAG 1/1:0,1,2 0/0:0 Then when querying we expect the following expression to evaluate for the second sample as -i 'TAG[1:1]="."' .. true -i 'TAG[1:GT]="."' .. false The problem is that the implementation truncates the number of fields, filling usually fewer than the original number of per-sample values. This is fixed by adding an exception that makes the code aware of this. Fixes #2133
Using --write-index defaults to --write-index=csi as before, but we can now do --write-index=tbi to get TBI indices instead. This takes precedence over auto-detection based on filename suffix when using the filename##idx##indexname nomenclature. The exception to defaulting to CSI is bcftools isec, which defaults to TBI. This disparity was there before and this PR doesn't change that behaviour. Fixes #2008
Note this is the only subcommand which defaults to writing TBI for VCF.gz. Everything else defaults to CSI. I'm unsure why this is, but I haven't changed it in this PR.
Previously only the last file had the index written.
This was previously segmentation faulting as not being a string means it didn't set the key field. We now report it as a parse error.
Given --write-index now takes an optional argument, and optional long options are --long-opt=arg and short options are -larg, I chose to also accept -l=arg for the sake of consistency and ease of documentation. The standard -larg still works too. If we wish to stick strictly to the normal conventions, then this is a trivial change in version.c (and a search and replace in the documentation). Also fix formatting bug in bcftools merge man page section leading to a lot of underlined text. Fixes #2139
It stated "Missing the --fa-ref option", but the option is "--fasta-ref".
Red Hat's ExtUtils::Embed returns CLAGS & LDFLAGS that produce a PIE position-independent executable. However libhts.a and the rest of bcftools's *.o files are not compiled as PIC, so linking fails as these other objects use relocations that are invalid for PIE. While bcftools could link against libhts.so and compile the rest of its objects with -fpic, the logistics are non-trivial. So it's easier to omit the redhat-hardened-cc1 and redhat-hardened-ld specs that set up for building a PIE executable. Fixes #1322.
Requires the htslib update samtools/htslib#1912 Resolves #2395
…gins Verbosity values bigger than 3 are passed to the underlying HTSlib library so that the user can investigate network issues and other problems occurring at the library level. Resolves #2235
Now is also supports sample negation as advertised in the manual page, e.g. `-s ^sample1,sample2` to include all samples but sample1 and sample2 Resolves #2380
On little-endian hosts, `int` here caused incorrect results at positions beyond the range of int. On big-endian hosts, it caused incorrect results at all positions.
Calling bgzf_close() (and hts_close()) frees the file pointer, so the errcode field cannot be accessed afterwards. Report errors as per errno instead as is done in vcfconvert.c. Also report the right file pointer's errcode after bgzf_write() failure.
When the gVCF contains overlapping blocks, they would trigger an infinite loop in the program and it would never finish Resolves #2410
Prior to C23, declaring a variable after a case label is an error. Introduce a function encapsulating --verbosity parsing to avoid needing a local variable within this option parsing case. Similarly in vcfmerge.c, avoid declaring a variable after a goto label.
MINGW x64 works fine, but UCRT-x64 fails one of the setGT tests. Specifically, `-n c:././.` gets silently rewritten before main() so the argv element contains `-n c;.\.\.` instead. This is a pain, and arguably we should provide a less problematic CLI alternative for this as c: was always going to trip up Windows boxes. However for now we can work around it by not having it as a separate argument and letting getopt do the tokenisation of argv into option and optarg for us by removing the space. This doesn't fix the bug obviously, but it makes it pass tests if using the UCRT environment.
This is to allow renaming samples from a list of samples on command line, rather than from a file of sample names. Unfortunately, the existing option `-s, --samples` conflicts with the rest of bcftools, therefore this is added as -n, --samples-list LIST New sample names given as a comma-separated list -N, --samples-file FILE New sample names in a file, see the man page for details The old option remains valid but is not advertised in the usage page. Resolves #2383
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.