- Research
- Open access
- Published:
Mem-based pangenome indexing for k-mer queries
Algorithms for Molecular Biology volume 20, Article number: 3 (2025)
Abstract
Pangenomes are growing in number and size, thanks to the prevalence of high-quality long-read assemblies. However, current methods for studying sequence composition and conservation within pangenomes have limitations. Methods based on graph pangenomes require a computationally expensive multiple-alignment step, which can leave out some variation. Indexes based on k-mers and de Bruijn graphs are limited to answering questions at a specific substring length k. We present Maximal Exact Match Ordered (MEMO), a pangenome indexing method based on maximal exact matches (MEMs) between sequences. A single MEMO index can handle arbitrary-length queries over pangenomic windows. MEMO enables both queries that test k-mer presence/absence (membership queries) and that count the number of genomes containing k-mers in a window (conservation queries). MEMO’s index for a pangenome of 89 human autosomal haplotypes fits in 2.04 GB, 8.8\(\times\) smaller than a comparable KMC3 index and 11.4\(\times\) smaller than a PanKmer index. MEMO indexes can be made smaller by sacrificing some counting resolution, with our decile-resolution HPRC index reaching 0.67 GB. MEMO can conduct a conservation query for 31-mers over the human leukocyte antigen locus in 13.89 s, 2.5\(\times\) faster than other approaches. MEMO’s small index size, lack of k-mer length dependence, and efficient queries make it a flexible tool for studying and visualizing substring conservation in pangenomes.
Introduction
There is a growing availability of pangenomes such as the Human Pangenome Reference Consortium (HPRC, t=94), the Vertebrate Genomes Project (VGP, t=16), and a recent pangenome for Arabidopsis thaliana (t=69) where t is the number of genomes in the pangenome [1–3]. Pangenomes enable new ways of studying and visualizing variation, as well as the degree to which genomic sequences are conserved [4]. K-mers have proven to be a powerful tool for factoring, representing and indexing genomes. They have been used to power genome-wide association studies in plants such as barley and soybean [5–7], to identify single-copy genes in pangenomes [5, 8], and to study sequence conservation [9].
However, to date, indexing methods based on k-mers or de Bruijn Graphs require the value k to be set at index building time, limiting future queries to use that value of k only. PanKmer [9] supports only 31-mer queries, whereas length-31 substrings may not be the correct resolution at which to understand conservation across all genes or pangenomes [10]. Further, k-mer-based indexes can be large; e.g. an index consisting of a separate 31-mer KMC3 database [11] for each haplotype in the HPRC requires 1.26 TB.
Alternative methods for indexing pangenomes also have drawbacks. Graph-based methods start with a computationally difficult reference-graph construction step. Accurate multiple alignments are difficult to create, requiring that some difficult repetitive sequences be masked first (e.g. the “dna-brnn” regions of the HPRC [1]).
Here we depart from the idea of a fixed length-k index by expanding on the notion of a “sequence landscape,” a vector of lengths of half-maximal exact matches between a query and reference text [12–14]. When computed between two sequences, a sequence landscape is equivalent to a vector of matching statistics (MS) lengths, from which maximal exact matches (MEMs) can be derived [15]. While MSs and MEMs have been applied successfully to classification [16, 17], they have not yet been used to index pangenomes.
We present a maximal exact match (MEM)-based compressed indexing approach called MEMO (Maximal Exact Match Ordered), along with new concepts enabling both lossless and lossy pangenome indexes. We first introduced MEMO as a Workshop on Algorithms in Bioinformatics (WABI) 2024 proceedings paper. MEMO indexes MEMs between a “pivot” genome selected from the pangenome with respect to all the other genomes. MEMO can then answer any-length k-mer queries for k-mers drawn from the pivot.
MEMO builds from a few methodological principles. First, the MEMs indexed by MEMO are sufficient for answering any k-mer membership or conservation query for any k as long as the k-mer is from the pivot. Second: it is also sufficient to store only the intervals representing overlaps between consecutive MEMs, helping to reduce index size. Third: we also use a variation on MEMs called “order-MEMs”, obtained by re-sorting the values in the matching statistics vectors. Order-MEMs permute the MSs to a hierarchy in which lower order order-MEMs are contained within higher order order-MEMs. This ordering speeds up conservation queries both by enabling early stopping, the ability to return a correct answer after examining some but not all of the orders, and by enabling lossy compression as described below.
We also introduce two ideas to further reduce index size with lossy compression. The first builds on the use of order-MEMs; once arranged as orders, we can discard some orders—and potentially a large fraction of the MEM intervals—as long as the user is satisfied with coarse-grained answers to conservation queries. A coarse-grained answer does not convey the exact number of genomes in which a substring occurs, but could instead return e.g. the largest percentage such that the substring occurs in at least 10%, 20%, ..., 90% or 100% of the genomes (a “decile conservation query”). The second idea builds on the observation that if we limit the user to making k-mer queries where k is greater than a threshold length l, we can reduce the index size further by discarding MEMs with length \(\le l\). If we operate using MEM overlaps rather than MEMs themselves, we can apply an inverted version of that principle; i.e. we can enable k-mer queries for k less than a threshold length while discarding MEM overlaps greater than that threshold.
Finally, we offer the practical insight that indexes over pangenomic MEMs—as well as the variants discussed here—are quite compressible, due both to the inherent repetitiveness of pangenomes, and to the inherent inefficiency of how offsets and lengths are stored in BED files. This compression is not exploited when intervals are simply placed in BED files, nor is it well exploited by standard compression approaches compressing BED files, such as gzip or tabix [18]. Approaches that use columnar compression such as Apache Parquet [19] are key to achieving the needed degree of compression while still enabling fast queries.
Here we test MEMO by comparing its index size and query speed to existing k-mer-based methods for membership and conservation queries. We find that MEMO consistently yields the smallest index size, sometimes orders of magnitude smaller than those from comparable approaches. We show how MEMO’s index scales well to large pangenomes, and that its lossy and lossless strategies for reducing index size were effective, ultimately fitting the HPRC index for coarse-grained conservation queries in less than 1 GB. Finally, we demonstrate MEMO’s utility in visualizing and exploring sequence conservation in a pangenome by visualizing sequence conservation in the region of the human genome containing the Human Leukocyte Antigen (HLA) genes.
Preliminaries
Basic definitions
A genome is a string sequence \(S = S[1..n] = S[1]S[2] \ldots S[n]\) with length \(|S| = n\) characters drawn from an alphabet \(\Sigma = \{A,T,C,G\}\). A pangenome is a collection of t genomes \(G = [G_1, G_2,..., G_t]\). We refer to a substring of S as \(S[i..j] = S[i]S[i+1] \ldots S[j]\) where \(i \ge 1\), \(j \le n\), and \(i \le j\). We denote a k-mer of S as a k-length substring \(S[i..i+k-1]\) and all k-mers of S as \(S[i..i+k-1]\), \(i \in [1,n-k+1]\), where \(k \le n\).
Genome intervals are zero-indexed, half-open intervals of a string. These intervals, with contig information, are stored in a text file called a Browser Extensible Data format (BED file). For example, substring S[10..15], where S is the DNA sequence of chromosome 1, is represented by the interval [9, 15) and written in a tab-delimited row of a BED file as “chr1 9 15“. The intervals of a BED file can be annotated with additional metadata by adding subsequent values in each row after the offsets.
Membership and conservation queries
Given a pangenome G and string S, a k-mer membership query on G returns t-number of length-n vectors of true/false values indicating the presence/absence of the k-mer in each genome. A k-mer conservation query on G returns a single length-n vector of the number of genomes that the k-mer occurs in, which are integers in [0, t].
Matching statistics and maximal exact matches
MSs are half-maximal exact matches between two strings, a pattern P[1..m] and text T[1..n], that cannot be extended in one direction without introducing a mismatch or reaching the end of a string (Fig. 1A). We define the MSs of string P as an array \(\textrm{MS}[1..m]\) where \(\textrm{MS}[j]\) is the length of the longest suffix of P[1..j] occurring in T. We note that this definition of MS is reversed with respect to how it is defined in some other work. This is because some algorithms for computing MS naturally work in right-to-left direction [15]. For simplicity, we will define and discuss \(\textrm{MS}[1..m]\) as though it is computed left-to-right. By definition, successive values of \(\textrm{MS}[1..m]\) have the sawtooth property:
Lemma 1
\(\textrm{MS}[j] - \textrm{MS}[j-1] \le 1 \,, \,\,j \in (2,m]\,\,\,\)
Proof
Follows from [15]. \(\square\)
For a pangenome G, we define a matching-statistics matrix L[1..t][1..m] such that \(L[i] = \textrm{MS}[1..m]\) with respect to \(P = G_1\) and \(T = G_i\). A full vector of matching statistics can be more concisely represented as a vector of MEMs. A MEM is an exact substring match between a pattern P[1..m] and text T[1..n] that cannot be extended left or right without introducing a mismatch or reaching the end of a string. MEMs can be derived from MSs. The length \(\textrm{MS}[j]\) at position j is the length of a MEM with right-hand extreme at offset j if and only if \(j = m\) or:
In other words, a one-to-one mapping exists between MEM and “peaks“ of the sawtooth (Fig. 1A), where the above expression defines a peak.
Methods
Methods overview
MEMO index outline. A Numbers below the pattern (i.e. pivot genome) are matching statistic (MS) lengths with respect to the Text (i.e. other genome). Triangles represent MS positions and lengths. MS peak lengths are bolded; these correspond to maximal exact matches (MEMs). B Presence/absence of the Pattern’s k-mers depicted as green and red rectangles. Grey triangles represent overlaps between consecutive MEMs. C Order-MEM creation from a pangenome. Top left: Sequences of anchor genome (\(G_1\)) and genomes (\(G_2\)-\(G_5\)). Top right: MSs matrix of match lengths between \(G_1\) and \(G_2\)-\(G_5\). Bottom: Order-MEMs found from MSs. MEM and order triangles of \(G_2\)-\(G_5\) are colored distinctly. Retained match-lengths capturing the landscape are bolded; redundant match-lengths are translucent and discarded in the MEMO indexes. Peaks of a single order are composed of MEMs from varying genomes. K-mer presence/absence of an example query are depicted in green and red. The order-MEM k-mer query enables early stoppage, as depicted by the k-mer in the dotted outline
MEMO is a pangenome index enabling arbitrary-length k-mer membership and conservation queries. Both membership and conservation queries are limited to k-mers that occur in a particular genome, called the “pivot.” The index works by pre-computing and indexing the MEMs between one of the genomes (the “pivot”) and all the others. The user may pose queries using only k-mers from the pivot; i.e. the user specifies the interval within the pivot containing the query k-mers. Thus, for MEMO, a membership query will always return at least one “true” value (for the pivot) and the conservation query result will always return values \(\ge 1\). The index answers such queries by examining whether and how the query intervals overlap the indexed MEM intervals.
Some pangenomes will have a natural choice of pivot. For example, the T2T-CHM13 assembly is the most complete [20]. But this is also a limitation of the MEMO approach; when there is more than one natural pivot, the user may need to build multiple indexes for multiple choices of pivot.
To build the index, we first compute vectors of matching statistics (MSs), also called sequence landscapes [12], between the pivot and the other genomes (Fig. 1A). This yields a matrix of MSs, where the rows are genomes and columns are pivot coordinates (Fig. 1C, bottom). MEMs map one-to-one to peaks in the matrix, i.e. instances where one MS is not less than the MS to its right. The presence or absence of a k-mer from pivot offset i is determined by asking whether it falls entirely within a MEM (Fig. 1C, left middle). MEMO implements a complementary approach, which considers whether a k-mer contains an overlap region between consecutive MEMs (Fig. 1B) with overhangs on either side, in which case the k-mer is not present.
To answer conservation queries, MEMO uses a rearranged version of MSs, whereby the MS matrix is first sorted along its column axis. After this, a row no longer represents MSs with respect to a particular genome, but instead represents “order” MSs with respect to the entire pangenome (Fig. 1C, right). We call MEMs derived from the reordered matrix order-MEMs. A k-mer fully contained in an ith order-MEM occurs in at least i other genomes in the pangenome.
For all index types, MEM intervals are indexed as a columnar-compressed BED file, with an extra column containing an identifier for the genome of origin (for membership queries) or a number indicating the rank of the order statistic (for conservation queries).
MEMO index
Preparing for the MEMO index
A MEMO index is derived from an initial set of matching statistics (MS) vectors. On a collection of genomes \(G = [G_1, G_2,..., G_t]\), MEMO factors the index-building process into \((t-1)\)-pairwise comparisons between the pivot genome \(P=G_1\) and each of the other \((t-1)\)-genomes. MEMO uses MONI to compute these MS vectors [15]. Specifically, MEMO builds a MONI index over each genome and its reverse-complement sequence, appending \(\$\) to the end of each sequence in order to mark boundaries. MEMO then queries the pivot against each index to find MSs. Although MEMO uses MONI, any tool capable of finding all MEMs or matching statistics could be used, such as SPUMONI [16] or MUMmer [21].
MEMO index with genome annotation
MEMO computes and stores the overlaps between consecutive MEMs, which we call “overlap-MEMs” for the conservation query. These are illustrated as the gray triangles in Fig. 1B. Overlap-MEMs are stored in a columnarly-compressed BED file. For example, a row in the BED file “chr1 8 10 3“ specifies positions 8 up to and including 9 on chromosome 1 of pivot genome \(G_1\) is an overlap between consecutive MEMs to genome \(G_3\), as visualized by the 3rd green overlap triangle of \(G_3\) on the left side of Fig. 1C. Overlaps between adjoining consecutive MEMs, where one MEM ends exactly where the next MEM starts, are stored as an interval with the same value for its start and end.
Storing order-overlap-MEMs as a BED file. The overlaps of consecutive MEMs are stored by their zero-based, half-open genomic position relative to the query contig. Columns represent contig, start, end, and order of each order-overlap-MEM. White triangles represent order-MEM intervals. Gray triangles represent the overlapping interval between consecutive order-MEM intervals. Gray points represent adjoining consecutive MEMs that bookend start and end. These points are stored with the same value as the start and end of an interval. Overlap-MEMs are stored in a similar manner as order-overlap-MEMs but with intervals annotated by genome ID instead of order
MEMO index with order annotation
MEMO indexes and performs conservation queries with respect to order maximal exact matches (order-MEMs) found from order matching statistics (order-MSs). We define order-MSs matrix \(\textrm{O}[1..t][1..m]\) as the result of sorting matching statistics matrix L along its columns in descending order, such that \(\textrm{O}[1][j] \ge \textrm{O}[2][j] \ge \ldots \ge \textrm{O}[t][j]\) (Fig. 1C). The rows of order-MSs matrix \(\textrm{O}\) also have a sawtooth property similar to Lemma 1.
Lemma 2
\(\textrm{O}[i][j] - \textrm{O}[i][j-1] \le 1\) for \(i \in [1,t]\), \(j \in (1,m]\,\,\,\)
Proof
By sorted order of \(\textrm{O}\), there can be at most i entries in \(\textrm{O}[1..t][j-1]\) strictly greater than \(\textrm{O}[i][j-1]\). Let \(\pi\) be the permutation that sorted \(\textrm{L}[1..t][j-1]\), such that \(\textrm{L}[i][j-1] = \textrm{O}[\pi (i)][j-1]\). Equivalently, \(\pi ^{-1}\) maps positions in \(\textrm{O}[1..t][j-1]\) back to positions in \(\textrm{L}[1..t][j-1]\). Then for any \(i' < i\), Lemma 1 ensures \(\textrm{O}[i'][j-1] + 1 \ge L[\pi ^{-1}(i)][j]\). Since these values are then sorted in \(\textrm{O}[1..t][j]\) it follows that there are at most i entries in \(\textrm{O}[1..t][j]\) strictly greater than \(\textrm{O}[i][j-1] + 1\), guaranteeing that \(\textrm{O}[i][j] \le \textrm{O}[i][j-1] + 1\), and hence \(\textrm{O}[i][j] - O[i][j-1] \le 1\). \(\square\)
We define order-MEMs as MEMs found from the rows of \(\textrm{O}\). That is, \(\textrm{O}[i][j]\) at position j is a ith order-MEM if and only if \(j = m\) or:
For conservation queries, MEMO computes and indexes overlaps between consecutive order-MEMs. These order-MEM overlaps are encoded as genomic intervals in a BED file, similar to Methods 3.2.2, but with an interval’s annotation set to its order (i.e. row in the \(\textrm{O}\) matrix) instead of its genome ID. For example, a row in the BED file “chr1 4 6 2“ specifies positions 4 up to and including 5 of chromosome 1 of pivot genome \(G_1\) share consecutive 2nd-order overlap-MEMs (Fig. 2).
MEM peaks and order-MEM peaks of \(\textrm{O}\) do not have a one-to-one correspondence. As depicted in the match lengths at the bottom of Fig. 1C, the 10th column of the MS matrix contains one length-3 blue peak while the corresponding 10th column of \(\textrm{O}\) contains two peaks–one length-3 peak and one length-2 peak. For visual explanation, ties of equal length MSs are resolved by genome priority, but the colors of \(\textrm{O}\) in Fig. 1C are not indexed. The originating genome is discarded in the MEMO conservation index. The original MS landscape cannot be reconstructed from the order-MEMs, but the color-originating genome information is superfluous for an exact conservation query.
Quantile-sampled conservation indexes
Conservation queries do not always need to be answered at full resolution. The HPRC pangenome includes around 90 human haplotypes; the difference in conservation level between a k-mer that occurs in exactly 71 genomes versus one that occurs in exactly 72 is not large, and may not be relevant to the scientific question. Also, such small relative differences would be hard to distinguish in a visualization.
For situations where a coarser resolution is sufficient, we propose a lossy-compression strategy called quantile sampling. Say that is sufficient for the user to learn whether a k-mer is present in at least \(x\%\) of genomes, where x is a multiple of 10, i.e. count deciles. We subsample rows of the \(\textrm{O}\) matrix to include only those rows representing thresholds into the next-highest decile. For instance, for the 89 haplotypes of the HPRC, we would sample the 8th row in \(\textrm{O}\) corresponding to the boundary from <10% to over 10% of the haplotypes. Likewise, we would sample rows 17, 26, ..., 89 of \(\textrm{O}\), since these represent order-MEMs present at the boundaries of 20%, 30%, ..., 100% of the haplotypes. Rows of \(\textrm{O}\) not sampled in this way are discarded.
Consecutive order-MEM overlaps are stored in a BED file, similar to Methods 3.2.3, but are annotated by the decile threshold number of genomes. For example, the order-MEMs of the 17th row in \(\textrm{O}\) are annotated with 9, capturing that rows 9–17 represent the 10% decile of the 89 total haplotypes. Note that this scheme is easily adapted to any set of quantiles, e.g. quartiles, percentiles, etc.
Columnar-compressed index
MEMO indexes are compressed and indexed using PyArrow, a Python API for Apache Arrow. The index is stored in an Apache Parquet file [19]. A Parquet file is organized into chunks of rows (MEMs), where rows within a chunk are laid out and compressed in a columnar fashion, i.e. with data items arranged in column-major order. This yields a better compression ratio than if the data were indexed in row-major order, since it brings the values most likely to be redundant (e.g. MEM starting coordinates) into closer proximity.
Parquet supports efficient column-wise queries, compression, and decompression. Columns of a MEM index are compressed using the ZSTD codec. Rows are factored into blocks such that a single block occupies about 0.5 GB (Fig. 3).
While other compression and indexing methods, such as bgzip and tabix [22], could be used instead of Parquet and ZSTD, we found this combination to provide excellent compression and speed in practice, as seen in the Results.
Queries
Say we have already computed the matching statistics for pivot genome \(G_1\) with respect to another genome \(G_2\). Given the matching statistics, we can collect the start and end coordinates of all of the MEMs, storing these in an interval-based data structure. By the definition of a MEM, we know that a k-mer \(G_1[i..i+k]\), is present in \(G_2\) if and only if and only if the interval is entirely contained in a MEM. This is true regardless of k; that is, we can answer arbitrary-length presence/absence queries for \(G_1\)’s substrings with respect to \(G_2\) using only the MEM intervals.
Given an array of all the MEMs in order according to their starting coordinate, we can derive a second array of “overlap-MEMs.” Specifically, from a consecutive pair of MEMs \([i,i+\ell )\), \([j,j+\ell ')\) where \(j > i\), we derive a single overlap-MEM \([j,i+\ell )\). We can perform presence/absence queries with respect to overlap-MEMs:
Lemma 3
Consider a k-mer interval from \(G_1\), \([i,i+k)\). If there exists an overlap-MEM interval \([j,j+\ell )\) such that \(i < j\) and \(i+k>j+\ell\), then the k-mer is not present in \(G_2\).
Proof
By definition of a MEM, we know that the k-mer at interval \([i, i+k)\) is present in \(G_2\) if and only if it is entirely contained within a MEM. Also by definition of a MEM, a MEM interval cannot contain another MEM interval. If there exists an overlap-MEM \([j, j+\ell )\) such that \(i < j\) and \(i+k>j+\ell\), then the k-mer interval \([i, i+k)\) is contained in neither the leftmost nor the rightmost of the two MEMs that created the overlap-MEM. Because these MEMs were consecutive, no other MEM exists that could span the k-mer interval. \(\square\)
Therefore, we have two distinct ways to test for the presence/absence of a k-mer from \(G_1\) that both yield the same answer: (a) we can test whether the k-mer interval is entirely contained in a MEM, in which case it is present, or (b) we can test whether the k-mer interval spans an overlap-MEM including “overhang” on both sides, in which case it is not present. In practice, we generalize the query from pivot \(G_1\) to genome \(G_2\) to all \((t-1)\)-genomes (\(G_2,..., G_t\)) of the pangenome. A single index file contains all the MEM intervals—each interval annotated by document ID or order.
We apply the substring presence/absence query on overlap-MEMs for the k-mer membership query:
Lemma 4
For \(G_1[i..j]\) and length k, yield \(G_1[x..x+k-1]\) in \(G_n\) for \(x \in [i, j-k]\), \(n \in [2,t]\)
MEMO can compute arbitrary k-mer presence/absence on any substring q of the pivot genome \(G_1\) on the t-genome membership index M. Membership intervals m of M are overlap-MEMs defined by a chromosome, start, end, and genome_id: (m.chrm, m.start, m.end, \(m.genome\_id\)). Membership intervals are accessed from the Parquet-compressed BED file by Parquet filters for interval start and end [19]. The query region q is defined by a chromosome, start, and end: (q.chrm, q.start, q.end). \(M_{int}\) is a subset of the membership intervals that intersect with the query region, i.e. intervals such that m.chrm = q.chrm and either m.start \(\le\) q.start < m.end or q.start < m.start < q.end.
Likewise, MEMO applies the substring presence/absence query on order-MEMs for the conservation query problem:
Lemma 5
For \(G_1[i:j]\) and length k, yield \(\sum _{n=2}^{t} (G_1[x:x+k-1]\) in \(G_n)\) for \(x \in [i, j-k]\)
MEMO uses a similar algorithm as its membership query (Algorithm 1) for the conservation query, but outputs the last order a k-mer is present in. Conservation intervals c of C are defined by a chromosome, start, end, and order: (c.chrm, c.start, c.end, c.order). Similar to \(M_{int}\), the intervals \(C_{int}\) are a subset of the conservations intervals that intersect with the query region.
Given the intervals intersecting the query region, both algorithms run in time \(\mathcal {O}(t * q.len)\) as there can be at most one MEM at each position of the query, for each genome.
Interval-length filtered MEMO index
While the MEMO indexes for membership and conservation allow arbitrary-length k-mer queries, not all query k-mer lengths may be used in practice (i.e. many k-mer indexing tools only allow \(k <= 31\)). Constraining the permitted query k-mer lengths allows further reduction in MEMO index size.
MEMs with length \(\le l\) are unnecessary when querying k-mer queries where k is greater than a threshold length l as a k-mer with \(k > l\) will never be entirely contained within an interval of length \(\le l\). Overlap-MEMs follow an inverted scheme—discarding overlap-MEMs with length \(> l\) restricts k-mer queries to \(k \le l+2\). A k-mer must fully span an interval to be marked absent hence the threshold \(l+2\) represents the limit of a k-mer extending beyond the interval start and end. The BED intervals of the uncompressed MEMO membership and conservation indexes can be filtered for index-size reduction and then compressed and queried as usual for the constrained k-mer lengths.
KMC3 index and query
We use KMC3 as a comparison to MEMO’s membership and conservation queries [11]. For the KMC3 membership index, we created a KMC3 database for each genome in the pangenome with the count of each canonical k-mer present transformed to 1. The KMC3 membership query uses samtools faidx to isolate the query substring from the query FASTA and KMC3 API’s GetCountersForRead function to query each k-mer in the substring against each KMC3 database.
The KMC3 database for the conservation query is constructed by taking the union/sum of each of the genome-specific KMC3 databases. That is, the count associated with a k-mer in the joined database is the sum of presence/absence values in each genome. This straightforwardly provides answers to conservation queries.
Results
We compared MEMO index sizes and query speeds to k-mer-based indexes built with PanKmer [9] and KMC3 [11]. PanKmer is a recently published tool for reference-free pangenome analysis and stores presence/absence values of all 31-mers across the total genome collection for each genome [9]. KMC3 is a more generic, standard k-mer counting tool and is very efficient in practice–used as the backbone in kmer-db’s index [23]. We adapted KMC3 for the pangenome membership and conservation queries as described in Methods 3.5. While not the only tools able to compute membership and conservation, these tools capture a recently published approach for pangenome analysis (PanKmer) and an established approach to counting k-mers for membership and conservation queries (KMC3). We abbreviate the KMC3 membership and conservation queries as KMC3-M and KMC3-C, respectively. Likewise, we abbreviate the MEMO’s membership and conservation queries as MEMO-M and MEMO-C. We also indexed and evaluated MEMO-C for approximate conservation counts to the nearest decile threshold and refer to this as MEMO-DC (“DC” standing for “decile conservation”). We refer to PanKmer’s conservation query as PanKmer; PanKmer cannot perform the membership query.
We compared how these methods scale to two pangenomes: a human pangenome and a vertebrate pangenome. The human pangenome is composed of the autosomal chromosomes from 88 haplotypes from the Human Pangenome Reference Consortium (HPRC) and T2T-CHM13 [1, 20]. We refer to this genome collection as the HPRC pangenome, even though T2T-CHM13 is not part of the HPRC Year 1 data freeze release. The HPRC pangenome stored in raw FASTA format is 254.46 GB. We selected T2T-CHM13 as the MEMO pivot as T2T-CHM13 is the most complete genome in the HPRC pangenome. We also replicate analyses with GRCh38 as the pivot and find results to be similar to using T2T-CHM13 as the pivot (Table 7). We refer to the vertebrate pangenome, composed of 16 high-quality vertebrate genomes (26.80 GB) from the Vertebrate Genomes Project’s initial release, as the VGP pangenome [24]. We selected the blenny genome as the MEMO pivot as blenny (Blennioidei), a suborder of fish, is the lexicographically smallest genome name in the VGP pangenome. Finally, we visualized sequence conservation from MEMO-C output across the human leukocyte antigen (HLA) locus of the HPRC pangenome as anchored to T2T-CHM13.
Indexing
The MEMO indexes were substantially smaller than equivalent k-mer-based indexes. The MEMO index for the HPRC pangenome, using T2T-CHM13 as the pivot genome, was roughly 2 GB. The MEMO-M index was by far the smallest: 539.2\(\times\) smaller than the equivalent KMC3-M index. The MEMO-C index was 11.4\(\times\) and 8.8\(\times\) smaller than the equivalent PanKmer and KMC3-C indexes respectively (Table 1).
MEMO index creation can be resource intensive. KMC index creation on the HPRC and VGP datasets was the fastest and used the least memory (Tables 3, 4). PanKmer index creation time was comparable to MEMO. The bulk of MEMO index creation time and memory was from MONI index creation and querying to find MSs. The total elapsed MEMO index creation time can be ameliorated by running MONI in parallel across each genome of the collection.
We separately measured the size of the compressed files produced using the MEMO Parquet strategy versus the strategy of using block-based bgzip compression and tabix indexing [18]. Parquet compression using the ZSTD codec yielded index sizes roughly 4\(\times\) smaller than those produced by bgzip and tabix (Table 2). Columnar compression yielded better compression ratios than traditional row-based compression. The compression of HPRC MEMO-M and MEMO-C indexes was roughly 40\(\times\) with Parquet and 10\(\times\) with bgzip. A columnar compression scheme, like Parquet, can take advantage of the MEMO output BED-file structure—large blocks of contiguous rows having the same contig name in the first column and non-decreasing genomic start positions in the second column. Compressing the interleaved intervals in a single file yielded better overall compression than compressing a file for each order or genome. Additionally, indexing the overlapping intervals between consecutive MEMs and order-MEMs yielded a better compression ratio compared to indexing the MEM intervals themselves.
Pangenome scaling
MEMO enables approaches to reduce index sizes for large pangenomes. Although MEMO has a larger scaling factor than KMC3 and PanKmer for the HPRC pangenome, MEMO has comparable scaling to the VGP pangenome and can incorporate additional subsetting to reduce index size.
Across 9 to 89 HPRC haplotypes, MEMO index sizes roughly increase 6.5\(\times\), but are likely to remain under 4 GB for a large number of haplotypes. For the HPRC pangenome, KMC3 indexes scale 1.2\(\times\) for KMC3-C and 9.9\(\times\) for KMC3-M. A new KMC3 database must be made for each genome for the membership query; these together compose the KMC3-M index. PanKmer scales roughly 1.6\(\times\) (Fig. 4a, Table 5). Across 4 to 16 vertebrate genomes from the VGP pangenome, MEMO indexes scale 3.8\(\times\); KMC3 indexes scale 3.6\(\times\); and PanKmer index scale 4.0\(\times\) (Fig. 4b, Table 6). K-mer-based indexes scale poorly to diverse sets of genomes as a k-mer table must store each k-mer in the union of sequences. On the other hand, MEMO stores genome coordinates that are efficiently compressed.
The MEMO-C index sizes can further be reduced by leveraging the rank-ordered design to yield approximate conservation counts. Subsetting indexed orders to the deciles of 89 haplotypes, reduces the MEMO-C HPRC index to 0.87 GB (Table 1). Order subsetting allows the potential for small index sizes in larger pangenomes while still capturing pangenome sequence divergence. Subsetting the KMC3-C database to yield counts to the nearest decile yields no reduction in the index size since all k-mers must still be stored. PanKmer API does not have any functionality to reduce index size.
Following the overlap-MEM interval filtering scheme described in 3.4, constraining the permitted k-mer lengths for the queries allows further reduction in MEMO index size. In practice for the HPRC pangenome, restricting queries to \(k \le 31\) (and so discarding overlap-MEMs length \(> 29\)) reduces the number of MEM overlaps indexed by MEMO-M by 13.74% and the number of order-MEM overlaps indexed by MEMO-C by 19.91%. Removing these larger intervals allows for better compression and results in index sizes of 1.74 GB and 1.19 GB, respectively. Subsetting order and intervals allow MEMO two opportunities to reduce index size for larger pangenomes—approaches that are incompatible with k-mer-based indexes. Combining interval filtering for k-mer queries \(k \le 31\) and decile subsetting on the HPRC dataset, the MEMO-DC index \(k \le 31\) is 0.66 GB 1.
Querying pangenome membership & conservation
MEMO queries are faster and more memory efficient than equivalent queries on k-mer-based indexes. MEMO queries 31-mer conservation across the human leukocyte antigen (HLA) locus on T2T-CHM13, a highly variable 3.75 Mbp region on Chromosome 6, in 13.89 s–\(-\)2.6\(\times\) and 365.3\(\times\) faster than KMC3-C and PanKmer. KMC3-C and PanKmer peak memory usage is 5.1\(\times\) and 2.2\(\times\) more than MEMO-C (Table 1). The HRPC decile conservation MEMO-DC index exhibits further query speed and memory savings. Compared to KMC3-M, MEMO-M is 107.2\(\times\) faster and uses 5.3\(\times\) less peak memory. As the MEMO query runtime is proportional to the number of overlap-MEM intervals, the runtime is roughly constant across varying-k for the same query region. On the other hand, to vary the length-k k-mer query, KMC3 indexes require re-indexing. PanKmer can only index 31-mers.
Sequence conservation plot from 31-mers anchored to the T2T-CHM13 HLA locus across the HPRC haplotypes (n=88). The user specifies a target region on the pivot, a length-k, and a histogram bin count to visualize the proportion of genomes containing the k-mer at each position of the query. The white area above the stacked bars represents the proportion of k-mers found across all 89 genomes. (Top left) Zoomed-in on the HLA delta block highlights a region of low sequence diversity. (Top right) Decile resolution of the Delta block demonstrates that MEMO-DC yields a plot that’s largely indistinguishable from the full-fidelity plot made by MEMO-C
MEMO allows exploring visualizations of sequence conservation from varying-k. From MEMO-C, we visualized 31-mer conservation of the HLA locus of T2T-CHM13 across the HPRC pangenome (Fig. 5), as inspired by Panagram [25]. The HLA locus sequence conservation plot captures known regions of high single nucleotide polymorphism density [26, 27]. Zooming onto the HLA delta block, we found that the conservation decile count approximation of MEMO-DC yields a similar sequence conservation plot as the full MEMO-C resolution, yet with a 2.3\(\times\) smaller index. MEMO and KMC yield the same membership and conservation counts on the HLA region. PanKmer conservation counts are off by a few but still capture the general conservation trends across the HLA region. While conservation plots can be generated from KMC3-C and PanKmer, visualizations made using these tools will generally be limited to a fixed value of k. Visualizing different resolutions of genome conservation benefits from varying k-mer lengths; chromosome-wide exploration can be sufficiently captured with large values of k, while gene-level visualization may require smaller k. KMC3 and Pankmer’s slower query speeds restrict practical interactive exploration, while MEMO-C’s faster query times allow interactive visualization and exploration.
Conclusion
We developed MEMO, a small MEM-based pangenome index that efficiently answers arbitrary-length k-mer membership and conservation queries. By using matching statistics as the basis for finding MEMs, we derived the related notion of order-MEMs, which are derived from matching statistics that have first been sorted across genomes. These ideas effectively generalize MEMs and matching statistics to the pangenomic context while enabling extremely small indexes.
MEMO’s fast query speed enables visual exploration of sequence conservation, especially in complex regions where the freedom to vary the k-mer length used can help to better understand distinct patterns of sequence conservation.
Indexing the overlapping intervals between consecutive MEMs and order-MEMs yielded a better compression ratio compared to indexing the MEMs themselves. Columnar compression using Parquet and ZSTD yielded roughly 4\(\times\) better compression than commonly used bgzip and tabix. These observation could have wider significance in bioinformatics; switching to columnar compression may yield improved compression in other contexts.
MEMO’s chief limitation is the fact that a single pivot genome must be selected at index construction time. Although pangenomes typically do have a natural pivot—i.e. a genome that has a higher quality assembly or annotation compared to the others—there could also be situations where no natural pivot exists. The VGP project is an example of this. In the future, it will be important to consider designing multi-pivot generalizations of MEMO, which could possibly benefit even more from the inherent redundancy of the pangenome.
While MEMO uses MONI to find matching statistics, MONI is not tailored to our problem. Instead, the profile document array of Ahmed et al. could be used in the future [28]. MEMO demonstrates the potential of MEM-based indexes over k-mer-based indexes for compressed indexes and fast flexible queries on large pangenomes.
Availability of data and materials
No datasets were generated or analysed during the current study.
References
...Wang T, Antonacci-Fulton L, Howe K, Lawson HA, Lucas JK, Phillippy AM, Popejoy AB, Asri M, Carson C, Chaisson MJP, Chang X, Cook-Deegan R, Felsenfeld AL, Fulton RS, Garrison EP, Garrison NA, Graves-Lindsay TA, Ji H, Kenny EE, Koenig BA, Li D, Marschall T, McMichael JF, Novak AM, Purushotham D, Schneider VA, Schultz BI, Smith MW, Sofia HJ, Weissman T, Flicek P, Li H, Miga KH, Paten B, Jarvis ED, Hall IM, Eichler EE, Haussler D. The human pangenome project: a global resource to map genomic diversity. Nature. 2022;604(7906):437–46.
Rhie A, Nurk S, Cechova M, Hoyt SJ, Taylor DJ, Altemose N, Hook PW, Koren S, Rautiainen M, Alexandrov IA, Allen J, Asri M, Bzikadze AV, Chen NC, Chin CS, Diekhans M, Flicek P, Formenti G, Fungtammasan A, Garcia Giron C, Garrison E, Gershman A, Gerton JL, Grady PGS, Guarracino A, Haggerty L, Halabian R, Hansen NF, Harris R, Hartley GA, Harvey WT, Haukness M, Heinz J, Hourlier T, Hubley RM, Hunt SE, Hwang S, Jain M, Kesharwani RK, Lewis AP, Li H, Logsdon GA, Lucas JK, Makalowski W, Markovic C, Martin FJ, Mc Cartney AM, McCoy RC, McDaniel J, McNulty BM, Medvedev P, Mikheenko A, Munson KM, Murphy TD, Olsen HE, Olson ND, Paulin LF, Porubsky D, Potapova T, Ryabov F, Salzberg SL, Sauria MEG, Sedlazeck FJ, Shafin K, Shepelev VA, Shumate A, Storer JM, Surapaneni L, Taravella Oill AM, Thibaud-Nissen F, Timp W, Tomaszkiewicz M, Vollger MR, Walenz BP, Watwood AC, Weissensteiner MH, Wenger AM, Wilson MA, Zarate S, Zhu Y, Zook JM, Eichler EE, O’Neill RJ, Schatz MC, Miga KH, Makova KD, Phillippy AM. The complete sequence of a human Y chromosome. Nature. 2023;621(7978):344–54.
Lian Q, Huettel B, Walkemeier B, Mayjonade B, Lopez-Roques C, Gil L, Roux F, Schneeberger K, Mercier R. A pan-genome of 69 Arabidopsis thaliana accessions reveals a conserved genome structure throughout the global species range. Nat Genet. 2024;56(5):982–91.
Sherman RM, Salzberg SL. Pan-genomics in the human genome era. Nat Rev Genet. 2020;21(4):243–54.
Jayakodi M, Padmarasu S, Haberer G, Bonthala VS, Gundlach H, Monat C, Lux T, Kamal N, Lang D, Himmelbach A, Ens J, Zhang XQ, Angessa TT, Zhou G, Tan C, Hill C, Wang P, Schreiber M, Boston LB, Plott C, Jenkins J, Guo Y, Fiebig A, Budak H, Xu D, Zhang J, Wang C, Grimwood J, Schmutz J, Guo G, Zhang G, Mochida K, Hirayama T, Sato K, Chalmers KJ, Langridge P, Waugh R, Pozniak CJ, Scholz U, Mayer KFX, Spannagl M, Li C, Mascher M, Stein N. The barley pan-genome reveals the hidden legacy of mutation breeding. Nature. 2020;588(7837):284–9.
Lemay MA, Ronne M, langer R, Belzile F. k-mer-based GWAS enhances the discovery of causal variants and candidate genes in soybean. Plant Genome. 2023;16(4):20374.
Kim JH, Park JS, Lee CY, Jeong MG, Xu JL, Choi Y, Jung HW, Choi HK. Dissecting seed pigmentation-associated genomic loci and genes by employing dual approaches of reference-based and k-mer-based GWAS with 438 Glycine accessions. PLoS One. 2020;15(12):0243085.
Gupta PK. GWAS for genetics of complex quantitative traits: genome to pangenome and SNPs to SVs and k-mers. BioEssays. 2021;43(11):2100109.
Aylward AJ, Petrus S, Mamerto A, Hartwick NT, Michael TP. PanKmer: k-mer based and reference-free pangenome analysis. Bioinformatics. 2023. https://doi.org/10.1093/bioinformatics/btad621.
Nasko DJ, Koren S, Phillippy AM, Treangen TJ. RefSeq database growth influences the accuracy of k-mer-based lowest common ancestor species identification. Genome Biol. 2018;19(1):165.
Kokot M, Dlugosz M, Deorowicz S. KMC 3: counting and manipulating k-mer statistics. Bioinformatics. 2017;33(17):2759–61.
Clift B, Haussler D, McConnell R, Schneider TD, Stormo GD. Sequence landscapes. Nucleic Acids Res. 1986;14(1):141–58 (Accessed 2023-07-24).
Chang WI, Lawler EL. Sublinear approximate string matching and biological applications. Algorithmica. 1994;12(4):327–44.
Shariat B, Movahedi NS, Chitsaz H, Boucher C. HyDA-Vista: towards optimal guided selection of k-mer size for sequence assembly. BMC Genomics. 2014;15(Suppl 10):9.
Rossi M, Oliva M, Langmead B, Gagie T, Boucher C. MONI: a pangenomic index for finding maximal exact matches. J Comput Biol. 2022;29(2):169–87.
Ahmed O, Rossi M, Kovaka S, Schatz MC, Gagie T, Boucher C, Langmead B. Pan-genomic matching statistics for targeted nanopore sequencing. iScience. 2021;24(6): 102696.
Ahmed OY, Rossi M, Gagie T, Boucher C, Langmead B. SPUMONI 2: improved classification using a pangenome index of minimizer digests. Genome Biol. 2023;24(1):122.
Li H. Tabix: fast retrieval of sequence features from generic TAB-delimited files. Bioinformatics. 2011;27(5):718–9.
The Apache Software Foundation: Parquet. GitHub 2024.
Nurk S, Koren S, Rhie A, Rautiainen M, Bzikadze AV, Mikheenko A, Vollger MR, Altemose N, Uralsky L, Gershman A, Aganezov S, Hoyt SJ, Diekhans M, Logsdon GA, Alonge M, Antonarakis SE, Borchers M, Bouffard GG, Brooks SY, Caldas GV, Chen NC, Cheng H, Chin CS, Chow W, Lima LG, Dishuck PC, Durbin R, Dvorkina T, Fiddes IT, Formenti G, Fulton RS, Fungtammasan A, Garrison E, Grady PGS, Graves-Lindsay TA, Hall IM, Hansen NF, Hartley GA, Haukness M, Howe K, Hunkapiller MW, Jain C, Jain M, Jarvis ED, Kerpedjiev P, Kirsche M, Kolmogorov M, Korlach J, Kremitzki M, Li H, Maduro VV, Marschall T, McCartney AM, McDaniel J, Miller DE, Mullikin JC, Myers EW, Olson ND, Paten B, Peluso P, Pevzner PA, Porubsky D, Potapova T, Rogaev EI, Rosenfeld JA, Salzberg SL, Schneider VA, Sedlazeck FJ, Shafin K, Shew CJ, Shumate A, Sims Y, Smit AFA, Soto DC, Sović I, Storer JM, Streets A, Sullivan BA, Thibaud-Nissen F, Torrance J, Wagner J, Walenz BP, Wenger A, Wood JMD, Xiao C, Yan SM, Young AC, Zarate S, Surti U, McCoy RC, Dennis MY, Alexandrov IA, Gerton JL, O’Neill RJ, Timp W, Zook JM, Schatz MC, Eichler EE, Miga KH, Phillippy AM. The complete sequence of a human genome. Science. 2022;376(6588):44–53.
Marçais G, Delcher AL, Phillippy AM, Coston R, Salzberg SL, Zimin A. MUMmer4: a fast and versatile genome alignment system. PLoS Comput Biol. 2018;14(1):1005944.
Li H. Tabix: fast retrieval of sequence features from generic TAB-delimited files. Bioinformatics. 2011;27(5):718–9. https://doi.org/10.1093/bioinformatics/btq671. (Accessed 2024-05-08).
Deorowicz S, Gudyś A, Długosz M, Kokot M, Danek A. Kmer-db: instant evolutionary distance estimation. Bioinformatics. 2019;35(1):133–6. https://doi.org/10.1093/bioinformatics/bty610. (Accessed 2023-10-24).
Rhie A, McCarthy SA, Fedrigo O, Damas J, Formenti G, Koren S, Uliano- Silva M, Chow W, Fungtammasan A, Kim J, Lee C, Ko BJ, Chaisson M, Gedman GL, Cantin LJ, Thibaud-Nissen F, Haggerty L, Bista I, Smith M, Haase B, Mountcastle J, Winkler S, Paez S, Howard J, Vernes SC, Lama TM, Grutzner F, Warren WC, Balakrishnan CN, Burt D, George JM, Biegler MT, Iorns D, Digby A, Eason D, Robertson B, Edwards T, Wilkinson M, Turner G, Meyer A, Kautt AF, Franchini P, Detrich HW, Svardal H, Wagner M, Naylor GJP, Pippel M, Malinsky M, Mooney M, Simbirsky M, Hannigan BT, Pesout T, Houck M, Misuraca A, Kingan SB, Hall R, Kronenberg Z, Dunn C, Ning Z, Hastie A, Lee J, Selvaraj S, Green RE, Putnam NH, Gut I, Ghurye J, Garrison E, Sims Y, Collins J, Pelan S, Torrance J, Tracey A, Wood J, Dagnew RE, Guan D, London SE, Clayton DF, Mello CV, Friedrich SR, Lovell PV, Osipova E, Al-Ajli FO, Secomandi S, Kim H, Theofanopoulou C, Hiller M, Zhou Y, Harris RS, Makova KD, Medvedev P, Hoffman J, Masterson P, Clark K, Martin F, Howe K, Flicek P, Walenz BP, Kwak W, Clawson H, Diekhans M, Nassar L, Paten B, Kraus RHS, Crawford AJ, Gilbert MTP, Zhang G, Venkatesh B, Murphy RW, Koepfli KP, Shapiro B, Johnson WE, Di Palma F, Marques-Bonet T, Teeling EC, Warnow T, Graves JM, Ryder OA, Haussler D, O'Brien SJ, Korlach J, Lewin HA, Howe K, Myers EW, Durbin R, Phillippy AM, Jarvis ED. Towards complete and error-free genome assemblies of all vertebrate species. Nature. 2021;592(7856):737–46.
Jenike K, Kovaka S, Oh S, Hwang S, Ramakrishnan S, Langmead B, Lippman Z, Schatz MC. Panagram: interactive, alignment-free pan-genome browser. San Francisco: GitHub; 2023.
Shiina T, Hosomichi K, Inoko H, Kulski JK. The HLA genomic loci map: expression, interaction, diversity and disease. J Hum Genet. 2009;54(1):15–39.
Kulski JK, Suzuki S, Shiina T. Human leukocyte antigen super-locus: nexus of genomic supergenes, SNPs, indels, transcripts, and haplotypes. Hum Genome Var. 2022;9(1):49.
Ahmed OY, Rossi M, Boucher C, Langmead B. Efficient taxa identification using a pangenome index. Genome Res. 2023;33(7):1069–77.
Acknowledgements
We thank Christina Boucher for helpful conversations.
Funding
This work was carried out at the Advanced Research Computing at Hopkins (ARCH) core facility (rockfish.jhu.edu), supported by the National Science Foundation (NSF) grant OAC 1920103. S.H. was funded by the Johns Hopkins University, XDBio Program. N.B. was funded by the Johns Hopkins University Computer Science PhD Fellowship. O.A. and B.L. was funded by R01HG011392 and R35GM139602 to B.L. K.J., S.K., and M.S. was funded by NSF grant 2,216,612, NIH grant U01CA253481, and HFSP award RGP0025/2021 to M.S.
Author information
Authors and Affiliations
Contributions
S.H. and B.L. conceived the method. S.H. wrote the software and ran the experiments. N.K.B. contributed to the theoretical results. S.H. and B.L. wrote and revised the manuscript. All authors contributed applications to pangenome conservation. All authors read, reviewed, and approved the final manuscript.
Corresponding author
Ethics declarations
Competing interests
B.L. is founder of InOrder Labs LLC.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix A
Appendix A
See Fig. 6
MEMO conservation query on \(G_{1}\) [3, 10), half-open interval with query sequence CAGCTCA. (Left) Sequences for pangenome with \(G_{1}\) pivot. A matrix with rows capture presence/absence 3-mer values across each order. A.argmax, the per-column row with the first 1, is the conservation count for the 3-mers in the query region. (Right) Visual representation of 3-mer presence/absence across each order. 3-mers anchored to the same position relative to the query pivot subsequence follow the early stoppage behavior–once a k-mer is marked absent in an order, all k-mers in that position are also absent in subsequent lower orders
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Hwang, S., Brown, N.K., Ahmed, O.Y. et al. Mem-based pangenome indexing for k-mer queries. Algorithms Mol Biol 20, 3 (2025). https://doi.org/10.1186/s13015-025-00272-y
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s13015-025-00272-y