8000 sync by starskyzheng · Pull Request #1 · starskyzheng/bcftools · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

sync #1

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 514 commits into
base: develop
Choose a base branch
from
Open

sync #1

wants to merge 514 commits into from

Conversation

starskyzheng
Copy link
Owner

No description provided.

@pd3 pd3 force-pushed the develop branch 2 times, most recently from ab5de54 to 2191405 Compare February 13, 2023 10:34
pd3 and others added 27 commits February 12, 2024 14:32
- -i/-e filtering expression containing missing values must not crash when
  subfields are queried on the fly

- parsing of subfield arrays with missing values now done correctly

Resolves #2098
When both DP and AD values are present, VAF is calculated as
AD/DP. This could result in VAF values bigger than 1 and access
beyond array boundaries.

Fixes #2102
This also
- adds a check for cases where the --file-list is empty
- adds a synonynomous option --force-no-index to --no-index

Resolves #2100
The main function sets mplp.bsmpl = bam_smpl_init() which allocates
memory.  This is freed at the end of main_mpileup.  Unfortunately
there are many ways of exiting this function early which don't do this
free.  Restructured it slightly so the free happens.

It's not a serious problem, but it did prevent make check from
working when using -fsanitize=address.  This wasn't spotted via the CI
as it doesn't have a PTY and hence it avoids running the usage tests.
If the filtering expression queried keys from a file, for example

    -i 'INFO/TAG=@strings_expected'

the program would segfault if INFO/TAG was not present in the first
VCF record. This is now fixed by making sure NULL is not passed to
khash_str2int_has_key.

Fixes #2111
For an alignment that doesn't have an indel but is aligned against
reads that do have an indel, the indel quality comes from the BAM
quality.  However we already have indelQ assigned, so this avoids
changing to BAM qual if indelQ is zero as that is a special case for a
read aligning to multiple indel "types" (lengths) with equal score.

This avoids excess AD numbers for poorly chosen alignments.

Fixes #2113
CONS_CUTOFF_DEL and CONS_CUTOFF_INC are now 35% instead of 40% (also
DEL is new as it previously shared the score with SNPs).

Plus when computing the ratio we no longer apply this as a percentage
to the entire depth, but as a percentage of the alignments overlapping
any STRs spanning this region (so meaningful alignments) that aren't
already allocated to the "type" being diagnosed.

Both of these make it slightly more likely to emit a second consensus
sequence potentially containing indel variations.  This fixes the
specific problem raised in #2117, but like all heuristics it benefits
some cases and harms others.  We will always have some cases that fall
just either side of the thresholds and look odd.

Overall it seems a slim benefit and the work to track the number of
meaningful spanning alignments may have potential value in subsequent
updates.

Fixes #2117
This applies where there are zero observed indels in a sample (but
they are in other samples).  It's odd how this appears to be a pretty
rare change, as we'd expect a significant change to FP, but it's
tiny.  Genotype assignment error changes a lot, but only in one of the
3 samples I tested.

On HG002, I see the following differences to calling rates.

Previous     All     QUAL>=100
InDel TP   11823 /   11798
InDel FP    5374 /    4898
InDel GT     296 /     291
InDel FN     115 /     140

New          All     QUAL>=100
InDel TP   11822 /   11795
InDel FP    5313 /    4805
InDel GT      80 /      74
InDel FN     116 /     143

This was HG002 called in conjunction with HG003 and HG004, but not as
a trio (so no pedegree supplied).  Oddly despite that HG002 is much
more accurate than HG003 and HG004, with GT assignment error rates an
order of magnitude higher.  This PR makes them a bit higher still
(maybe another 20%).  I cannot explain either of these, but perhaps
it's simply down to the accuracy of the truth set as HG002 is by far
the most widely curated of the three.  Either that or my analysis has
a flaw somewhere.

Fixes #2130.
This didn't work in the past because update_from_fai() opened,
read the header from, and then closed the input file, preventing
the reheader_* functions from accessing the rest of the file
contents when the input is a stream.  As all the reheader_*
functions read the header into a kstring, it's possible to make
streaming work by passing this kstring into update_from_fai() and
adjusting it to work directly on that copy of the data.

As update_from_fai() no longer needs to write a temporary file,
args_t::rm_tmpfile and args_t::tmp_prefix can be removed.  The -T
option is ignored as it's no longer needed, but is still accepted
for compatibility.  The init_tmp_prefix() function is still used
by some other bcftools subcommands, so is left in place for now.
When setting value by determining index from the genotype, we face the problem
of how to interpret truncating arrays. Say we have TAG defined as Number=. and
     GT:TAG   1/1:0,1,2  0/0:0
Then when querying we expect the following expression to evaluate for the second
sample as
     -i 'TAG[1:1]="."'  .. true
     -i 'TAG[1:GT]="."' .. false
The problem is that the implementation truncates the number of fields, filling
usually fewer than the original number of per-sample values. This is fixed by
adding an exception that makes the code aware of this.

Fixes #2133
Using --write-index defaults to --write-index=csi as before, but we
can now do --write-index=tbi to get TBI indices instead.  This takes
precedence over auto-detection based on filename suffix when using the
filename##idx##indexname nomenclature.

The exception to defaulting to CSI is bcftools isec, which defaults to
TBI.  This disparity was there before and this PR doesn't change that
behaviour.

Fixes #2008
Note this is the only subcommand which defaults to writing TBI for
VCF.gz.  Everything else defaults to CSI.  I'm unsure why this is, but
I haven't changed it in this PR.
Previously only the last file had the index written.
This was previously segmentation faulting as not being a string means
it didn't set the key field. We now report it as a parse error.
Given --write-index now takes an optional argument, and optional long
options are --long-opt=arg and short options are -larg, I chose to
also accept -l=arg for the sake of consistency and ease of
documentation.  The standard -larg still works too.

If we wish to stick strictly to the normal conventions, then this is a
trivial change in version.c (and a search and replace in the
documentation).

Also fix formatting bug in bcftools merge man page section leading to
a lot of underlined text.

Fixes #2139
It stated "Missing the --fa-ref option", but the option is "--fasta-ref".
Red Hat's ExtUtils::Embed returns CLAGS & LDFLAGS that produce a
PIE position-independent executable. However libhts.a and the rest
of bcftools's *.o files are not compiled as PIC, so linking fails
as these other objects use relocations that are invalid for PIE.

While bcftools could link against libhts.so and compile the rest of
its objects with -fpic, the logistics are non-trivial. So it's easier
to omit the redhat-hardened-cc1 and redhat-hardened-ld specs that set
up for building a PIE executable. Fixes #1322.
pd3 and others added 30 commits April 10, 2025 09:58
…gins

Verbosity values bigger than 3 are passed to the underlying HTSlib library
so that the user can investigate network issues and other problems occurring
at the library level.

Resolves #2235
Now is also supports sample negation as advertised in the manual
page, e.g. `-s ^sample1,sample2` to include all samples but sample1
and sample2

Resolves #2380
On little-endian hosts, `int` here caused incorrect results at positions
beyond the range of int. On big-endian hosts, it caused incorrect results
at all positions.
Calling bgzf_close() (and hts_close()) frees the file pointer, so the
errcode field cannot be accessed afterwards. Report errors as per errno
instead as is done in vcfconvert.c.

Also report the right file pointer's errcode after bgzf_write() failure.
When the gVCF contains overlapping blocks, they would trigger
an infinite loop in the program and it would never finish

Resolves #2410
Prior to C23, declaring a variable after a case label is an error.
Introduce a function encapsulating --verbosity parsing to avoid needing
a local variable within this option parsing case.

Similarly in vcfmerge.c, avoid declaring a variable after a goto label.
MINGW x64 works fine, but UCRT-x64 fails one of the setGT tests.
Specifically, `-n c:././.` gets silently rewritten before main() so
the argv element contains `-n c;.\.\.` instead.  This is a pain, and
arguably we should provide a less problematic CLI alternative for this
as c: was always going to trip up Windows boxes.

However for now we can work around it by not having it as a separate
argument and letting getopt do the tokenisation of argv into option
and optarg for us by removing the space.

This doesn't fix the bug obviously, but it makes it pass tests if
using the UCRT environment.
This is to allow renaming samples from a list of samples on command line,
rather than from a file of sample names. Unfortunately, the existing
option `-s, --samples` conflicts with the rest of bcftools, therefore
this is added as

    -n, --samples-list LIST    New sample names given as a comma-separated list
    -N, --samples-file FILE    New sample names in a file, see the man page for details

The old option remains valid but is not advertised in the usage page.

Resolves #2383
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

0