Abstract
Variant calling has been widely used for genotyping and for improving the consensus accuracy of long-read assemblies. Variant calls are commonly hard-filtered with user-defined cutoffs. However, it is impossible to define a single set of optimal cutoffs, as the calls heavily depend on the quality of the reads, the variant caller of choice and the quality of the unpolished assembly. Here, we introduce Merfin, a k-mer based variant-filtering algorithm for improved accuracy in genotyping and genome assembly polishing. Merfin evaluates each variant based on the expected k-mer multiplicity in the reads, independently of the quality of the read alignment and variant caller’s internal score. Merfin increased the precision of genotyped calls in several benchmarks, improved consensus accuracy and reduced frameshift errors when applied to human and nonhuman assemblies built from Pacific Biosciences HiFi and continuous long reads or Oxford Nanopore reads, including the first complete human genome. Moreover, we introduce assembly quality and completeness metrics that account for the expected genomic copy numbers.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
£14.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
£169.00 per year
only £14.08 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
HG002 variant-call data were downloaded from https://data.nist.gov/od/id/mds2-2336 (SEX9X, NFT0L, 23O09 and QUE7Q). Sequencing data and assemblies for CHM13, HG002 and VGP genomes are available at https://github.com/marbl/CHM13, https://github.com/human-pangenomics/HG002_Data_Freeze_v1.0 and https://vgp.github.io/.
Source data used for generating all figures in this manuscript are available at https://github.com/gf777/misc/tree/master/merfin/paper/figures.
The K* tracks for HiFi and Illumina of the CHM13 are browsable in the associated UCSC browser (http://genome.ucsc.edu/cgi-bin/hgTracks?db=hub_2395475_t2t-chm13-v1.1). All variant calls used in the genotyping benchmarks, k-mer databases, fitted histogram tables and K* tracks are available to download at https://s3-us-west-2.amazonaws.com/human-pangenomics/index.html?prefix=publications/MERFIN_2021/ with a step-by-step guideline available at https://github.com/arangrhie/merfin/wiki/Best-practices-for-Merfin. All data are publicly open for download with no restrictions.
Code availability
A stable release and the source code for Merfin and examples from this work are available under Apache License 2.0 at GitHub (https://github.com/arangrhie/merfin) and Zenodo (https://doi.org/10.5281/zenodo.5527270)52. The only dependency is the k-mer counter Meryl, which comes with the release. Merfin can be run in five modes (1) the -filter mode scores each variant or variants within distance k and their combinations by error k-mers for improved genotyping; (2) the -completeness mode generates completeness metrics; (3) the -dump mode computes KC, KR, K* for each base in the assembly along with QV and QV* for each sequence; (4) the -hist mode provides a K* histogram and genome-wide QV and QV* averages; and (5) the -polish mode scores each variant or variants within distance k and their combinations by the K* for polishing. Merfin is fully parallelized using OpenMP. A combination of bash and Rscript used for data analysis and visualization is available at https://github.com/gf777/misc/tree/master/merfin/paper/figures. A Code Ocean capsule of the package is provided (https://doi.org/10.24433/CO.2292349.v1)53.
References
Olson, N. D. et al. precisionFDA Truth Challenge v2: calling variants from short- and long-reads in difficult-to-map regions. Preprint at bioRxiv https://doi.org/10.1101/2020.11.13.380741 (2021).
Koboldt, D. C. Best practices for variant calling in clinical sequencing. Genome Med. 12, 91 (2020).
Guo, Y., Ye, F., Sheng, Q., Clark, T. & Samuels, D. C. Three-stage quality control strategies for DNA re-sequencing data. Brief. Bioinform. 15, 879–889 (2014).
Giani, A. M., Gallo, G. R., Gianfranceschi, L. & Formenti, G. Long walk to genomics: history and current approaches to genome sequencing and assembly. Comput. Struct. Biotechnol. J. 18, 9–19 (2020).
Rhie, A. et al. Towards complete and error-free genome assemblies of all vertebrate species. Nature 592, 737–746 (2021).
Wenger, A. M. et al. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat. Biotechnol. 37, 1155–1162 (2019).
Watson, M. & Warr, A. Errors in long-read assemblies can critically affect protein prediction. Nat. Biotechnol. 37, 124–126 (2019).
Nurk, S. et al. HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads. Genome Res. 30, 1291–1305 (2020).
Walker, B. J. et al. Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement. PLoS ONE 9, e112963 (2014).
Chin, C.-S. et al. Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nat. Methods 10, 563–569 (2013).
Vaser, R., Sović, I., Nagarajan, N. & Šikić, M. Fast and accurate de novo genome assembly from long uncorrected reads. Genome Res. 27, 737–746 (2017).
McKenna, A. et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010).
Garrison, E. & Marth, G. Haplotype-based variant detection from short-read sequencing. Preprint at arXiv https://arxiv.org/abs/1207.3907 (2012).
Poplin, R. et al. A universal SNP and small-indel variant caller using deep neural networks. Nat. Biotechnol. 36, 983–987 (2018).
Li, H. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics 27, 2987–2993 (2011).
Rhie, A., Walenz, B. P., Koren, S. & Phillippy, A. M. Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies. Genome Biol. 21, 245 (2020).
Mapleson, D., Garcia Accinelli, G., Kettleborough, G., Wright, J. & Clavijo, B. J. KAT: a K-mer analysis toolkit to quality control NGS datasets and genome assemblies. Bioinformatics 33, 574–576 (2017).
Kundu, R., Casey, J. & Sung, W.-K. HyPo: super fast and accurate polisher for long read genome assemblies. Preprint at bioRxiv. https://doi.org/10.1101/2019.12.19.882506 (2019).
Jain, C. et al. Weighted minimizer sampling improves long read mapping. Bioinformatics 36, i111–i118 (2020).
Jain, C., Rhie, A., Hansen, N., Koren, S. & Phillippy, A. M. Long read mapping to repetitive reference sequences using Winnowmap2. Nat. Methods (2022).
Phillippy, A. M., Schatz, M. C. & Pop, M. Genome assembly forensics: finding the elusive mis-assembly. Genome Biol. 9, R55 (2008).
Nurk, S. et al. The complete sequence of a human genome. Science 376, eabj6987 https://doi.org/10.1126/science.abj6987 (2022).
Vollger, M. R. et al. Segmental duplications and their variation in a complete human genome. Science 376, eabj6965 https://doi.org/10.1126/science.abj6965 (2022).
Gershman, A. et al. Epigenetic patterns in a complete human genome. Science 376, eabj5089 https://doi.org/10.1126/science.abj5089 (2022).
Mc Cartney, A. M. et al. Chasing perfection: validation and polishing strategies for telomere-to-telomere genome assemblies. Nat. Methods (2022).
Jarvis, E.D. et al. Automated assembly of high-quality diploid human reference genomes. Preprint at bioRxiv https://doi.org/10.1101/2022.03.06.483034 (2022).
Krusche, P. et al. Best practices for benchmarking germline small-variant calls in human genomes. Nat. Biotechnol. 37, 555–560 (2019).
Ranallo-Benavidez, T. R., Jaron, K. S. & Schatz, M. C. GenomeScope 2.0 and Smudgeplot for reference-free profiling of polyploid genomes. Nat. Commun. 11, 1432 (2020).
Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat. Methods 18, 170–175 (2021).
Huddleston, J. et al. Discovery and genotyping of structural variation from long-read haploid genome sequence data. Genome Res. 27, 677–685 (2017).
Miga, K. H. et al. Telomere-to-telomere assembly of a complete human X chromosome. Nature 585, 79–84 (2020).
Zook, J. M. et al. Extensive sequencing of seven human genomes to characterize benchmark reference materials. Sci. Data 3, 160025 (2016).
Kolmogorov, M., Yuan, J., Lin, Y. & Pevzner, P. A. Assembly of long, error-prone reads using repeat graphs. Nat. Biotechnol. 37, 540–546 (2019).
Koren, S. et al. De novo assembly of haplotype-resolved genomes with trio binning. Nat. Biotechnol. https://doi.org/10.1038/nbt.4277 (2018).
Li, H. et al. A synthetic-diploid benchmark for accurate variant-calling evaluation. Nat. Methods 15, 595–597 (2018).
Wagner, J. et al. Curated variation benchmarks for challenging medically relevant autosomal genes. Nat. Biotechnol. https://doi.org/10.1038/s41587-021-01158-1 (2022).
Chin, C.-S. et al. Phased diploid genome assembly with single-molecule real-time sequencing. Nat. Methods 13, 1050–1054 (2016).
O’Leary, N. A. et al. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 44, D733–D745 (2016).
NCBI. Gnomon - the NCBI eukaryotic gene prediction tool. https://www.ncbi.nlm.nih.gov/genome/annotation_euk/gnomon/ (2017).
Fofanov, Y. et al. How independent are the appearances of n-mers in different genomes? Bioinformatics 20, 2421–2428 (2004).
Zhao, X., Weber, A. M. & Mills, R. E. A recurrence-based approach for validating structural variation using long-read sequencing technology. Gigascience 6, 1–9 (2017).
Benjamini, Y. & Speed, T. P. Summarizing and correcting the GC content bias in high-throughput sequencing. Nucleic Acids Res. 40, e72 (2012).
Yang, C. et al. Evolutionary and biomedical insights from a marmoset diploid genome assembly. Nature https://doi.org/10.1038/s41586-021-03535-x (2021).
Cheng, H. et al. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat. Methods 18, 170–175 (2021).
Altemose, N. et al. Complete genomic and epigenetic maps of human centromeres. Science 376, eabl4178 https://doi.org/10.1126/science.abl4178 (2022).
Robinson, J. T. et al. Integrative genomics viewer. Nat. Biotechnol. 29, 24–26 (2011).
Morgulis, A., Gertz, E. M., Schäffer, A. A. & Agarwala, R. WindowMasker: window-based masker for sequenced genomes. Bioinforma. Oxf. Engl. 22, 134–141 (2006).
Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990).
Kapustin, Y., Souvorov, A., Tatusova, T. & Lipman, D. Splign: algorithms for computing spliced alignments with identification of paralogs. Biol. Direct 3, 20 (2008).
Lowe, T. M. & Eddy, S. R. tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res. 25, 955–964 (1997).
Nawrocki, E. P. et al. RFAM 12.0: updates to the RNA families database. Nucleic Acids Res. 43, D130–D137 (2015).
Formenti, G., Rhie, A. & Walenz, B. arangrhie/merfin: Merfin v1.0. Zenodo https://doi.org/10.5281/zenodo.5527270 (2021).
Formenti, G., Rhie, A. & Walenz, B. Merfin - improved Arrow polishing through vcf filtering and Genomescope2 kmer models. Code Ocean https://doi.org/10.24433/CO.2292349.v1 (2022).
Acknowledgements
We thank T. Rhyker Ranallo-Benavidez and M.C. Schatz for the useful discussion on adapting Genomescope2 models. We also thank the communities of the T2T, HPRC and VGP consortia for their constant support. G.F. and E.D.J. were supported by Rockefeller University and HHMI funds. A.R., B.P.W., S.K. and A.M.P. were supported by the Intramural Research Program of the National Human Genome Research Institute, National Institutes of Health (NIH) (1ZIAHG200398). The work of F.T.-N. was supported by the Intramural Research Program of the National Library of Medicine, NIH. K.S. was supported by NIH/NHGRI (R01HG010485, U41HG010972, U01HG010961, U24HG011853 and OT2OD026682). E.W.M. was partially supported by the German Federal Ministry of Education and Research (01IS18026C). Part of this work used the computational resources of the NIH HPC Biowulf cluster (https://hpc.nih.gov).
Author information
Authors and Affiliations
Contributions
A.R., G.F., B.P.W. and E.W.M. implemented Merfin. G.F. and A.R. performed the validation analyses. K.S. performed the GIAB variant-calling analysis on HG002. F.T. generated the gene annotations for the VGP genomes. S.K. contributed to the conceptual development. G.F. and A.R. wrote the manuscript. G.F., A.R., E.D.J. and A.M.P. conceived the study. All authors reviewed, edited and approved the manuscript.
Corresponding authors
Ethics declarations
Competing interests
S.K. has received travel funds to speak at symposia organized by Oxford Nanopore. The remaining authors declare no competing interests.
Peer review
Peer review information
Nature Methods thanks Chuan-Le Xiao and the other, anonymous, reviewer for their contribution to the peer review of this work. Primary Handling editor: Lin Tang, in collaboration with the Nature Methods team. Peer reviewer reports are available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Flowchart diagram of each mode in Merfin.
Text inside gray boxes on the top represents input files required (solid) or optional (dashed) for Merfin. a, genotyping (-filter) and polishing (-polish) modes. b, K* histogram (-hist) and K* completeness (-completeness) modes. Steps listed in bullet points are marked in gray if it is only applicable in -polish (a) or -completeness (b) mode.
Extended Data Fig. 2 Genome-wide density distribution of the K* using Illumina k-mers.
When the assembly is in agreement with the raw data, the K* is normally distributed with mean 0 and the smaller the standard deviation the higher the agreement. T2T-CHM13v1.0 shows a less dispersed distribution of the K* compared to v0.7.
Extended Data Fig. 3 A region of negative K* highlighting sequencing bias.
An example of low coverage in both HiFi and Illumina reads associated with high guanine content, and specifically a GA-rich repeat (heatmap). GA bias has been reported in PacBio HiFi data, and results in gaps in the assembly that in CHM13 were filled with Nanopore data22. The K* both from HiFi and Illumina k-mers (top tracks) recapitulate the coverage drop. Nanopore coverage appears less affected. Position Chr. 12:~129,862,000 bp.
Extended Data Fig. 4 The K* can identify issues in the assembly at the base level.
a, 40 bp window with K* close to 0, highlighting perfect agreement of the assembly with the raw reads. Position Chr18:~7,000,000 bp. b, A region of negative K* in coincidence with two heterozygous indels. Position Chr1:~105,008,350 bp.
Extended Data Fig. 5 Coverage titration experiment and impact on QV*.
The QV* is only marginally influenced by the coverage of the dataset being considered.
Extended Data Fig. 6 Haplotype phasing before and after polishing with Merfin.
In both parental assemblies, the haplotypes remained fully phased, and the size of the blocks substantially increased compared to the unpolished version (a,b) after polishing with Merfin (c,d). A theoretical human genome size of 3.1 Gbp was used to normalize NG* values.
Extended Data Fig. 7 VGP assembly pipeline.
Compared to the previous v1.6, the introduction of Merfin in v1.7 (green) resulted in a minimal change of the workflow, but in a generalized improvement in QV scores and gene annotations. Pipeline available at https://github.com/VGP/vgp-assembly.
Extended Data Fig. 8 Phase block analysis of zebra finch pseudo-haploid assembly.
a, Phase blocks in the primary assembly after mapping the reads to both the primary and alternate assemblies. b, Phase blocks in the primary assembly after mapping the reads to the primary only. c, Phase blocks in the alternate assembly after mapping the reads to both the primary and alternate assemblies. d, Phase blocks in the alternate assembly. In all cases, the application of Merfin filtering minor heterozygous variants (green) leads to block sizes better or comparable to prior polishing methods alone (blue). Unpolished assembly in gray. Results of Merfin without filtering in red. A genome size of ~1.03 Gbp derived from Genomescope2 was used to normalize NG* values.
Extended Data Fig. 9 Effect of merfin correction on the kinetochore scaffold 1 (KNL1) annotation.
a, Deleterious presence of an extra A around position 1,321,620 of scaffold_7 (red box) in the polished, non-merfin-corrected sequence is indicated by a 1-base gap in the alignments of zebra finch PacBio IsoSeq SRR8695295.20794.1 and KNL1 transcripts from three other Passeriformes songbirds. This insertion causes a disruption in the frame and a premature stop codon in the translated sequence (see amino acid sequence in red). b, Corresponding span in the merfin-corrected assembly, with gapless alignments of the IsoSeq read and Passeriformes transcripts, and uninterrupted translation.
Supplementary information
Rights and permissions
About this article
Cite this article
Formenti, G., Rhie, A., Walenz, B.P. et al. Merfin: improved variant filtering, assembly evaluation and polishing via k-mer validation. Nat Methods 19, 696–704 (2022). https://doi.org/10.1038/s41592-022-01445-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41592-022-01445-y
This article is cited by
-
Bat genomes illuminate adaptations to viral tolerance and disease resistance
Nature (2025)
-
Constructing telomere-to-telomere diploid genome by polishing haploid nanopore-based assembly
Nature Methods (2024)
-
Genome assembly in the telomere-to-telomere era
Nature Reviews Genetics (2024)
-
Technology-enabled great leap in deciphering plant genomes
Nature Plants (2024)
-
Low-input PacBio sequencing generates high-quality individual fly genomes and characterizes mutational processes
Nature Communications (2024)