8000 GitHub - karasikov/cobs: COBS - Compact Bit-Sliced Signature Index (for Genomic k-Mer Data or q-Grams)
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
/ cobs Public
forked from bingmann/cobs

COBS - Compact Bit-Sliced Signature Index (for Genomic k-Mer Data or q-Grams)

License

Notifications You must be signed in to change notification settings

karasikov/cobs

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Compact Bit-Sliced Signature Index (COBS)

COBS (COmpact Bit-sliced Signature index) is a cross-over between an inverted index and Bloom filters. Our target application is to index k-mers of DNA samples or q-grams from text documents and process approximate pattern matching queries on the corpus with a user-chosen coverage threshold. Query results may contain a number of false positives which decreases exponentially with the query length and the false positive rate of the index determined at construction time. COBS' compact but simple data structure outperforms other indexes in construction time and query performance with Mantis by Pandey et al. in second place. However, unlike Mantis and other previous work, COBS does not need the complete index in RAM and is thus designed to scale to larger document sets.

cobs-architecture

More information about COBS is presented in our research paper: Timo Bingmann, Phelim Bradley, Florian Gauger, and Zamin Iqbal. COBS: a Compact Bit-Sliced Signature Index arXiv:1905.09624, May 2019.

Installation and First Steps

Installation

COBS requires CMake, a C++17 compiler or the Boost.Filesystem library.

To download and install COBS run:

git clone --recursive https://github.com/bingmann/cobs.git
mkdir cobs/build
cd cobs/build
cmake ..
make -j4

and optionally run make test to check the build.

Building an Index

COBS can read FASTA files (*.fa, *.fasta, *.fa.gz, *.fasta.gz), FASTQ files (*.fq, *.fastq, *.fq.gz., *.fastq.gz), McCortex files (*.ctx), or text files (*.txt).

You can either recursively scan a directory for all files matching any of these files, or pass a *.list file which lists all paths COBS should index.

To check the document list to be indexed, run for example

src/cobs doc-list tests/data/fasta/

To construct a compact COBS index from these seven example documents run

src/cobs compact-construct tests/data/fasta/ example.cobs_compact

Check --help for many options. Maybe the most important is --canonicalize to enable k-mer DNA canonicalization.

Query an Index

COBS has a simple command line query tool:

src/cobs query -i example.cobs_compact CTDNIETFYCTNSYRYENVPRPIYVWVLFQEDEWHGYR

or a fasta file of queries with

src/cobs query -i example.cobs_compact -f query.fa

Experimental Results

In our paper we compare COBS against seven other k-mer indexing software packages. These are the main results, scaling by number of documents in the index, and in the second diagram shown per document.

cobs-experiments-scaling cobs-experiments-scaling-per-documents

About

COBS - Compact Bit-Sliced Signature Index (for Genomic k-Mer Data or q-Grams)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • C++ 93.3%
  • Perl 4.6%
  • CMake 2.1%
0