8000 GitHub - vkayy/vkdb: A time series database engine in C++.
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

vkayy/vkdb

Repository files navigation

logo

A time series database engine built in C++ with minimal dependencies.

C++ contributors last update open issues license stars forks

Note

Development has slowed for now, as I'm a little busier with work and university!

Warning

vkdb is currently in the early stages of development and is not yet ready for daily use!

vkdb is a hobbyist time series database engine built with a focus on simplicity and modernity. Motivated by unfamiliar architectures and endless optimisation opportunities, this project is far from commercial, and is defined by a pursuit of challenge.

Table of contents

Internals

Running locally (not needed)

Usage

Working with vq

License

Authors

Credits

back to top

Internals

Database engine

Architecture

vkdb is built on log-structured merge (LSM) trees. In their simplest form, these have an in-memory layer and a disk layer, paired with a write-ahead log (WAL) for persistence of in-memory changes.

When you instantiate a vkdb::Database, all of the prior in-memory information (in-memory layer, metadata, etc.) will be loaded in if the database already exists, and if not, a new one is set up. This persists on disk until you clear it via vkdb::Database::clear.

It's best to make all interactions via vkdb::Database, or the vkdb::Table type via vkdb::Database::getTable, unless you just want to play around with vq (more on the playground here).

Note

Make sure the $HOME environment variable is set correctly, as all database files will be stored in .vkdb within your home directory. Only tamper with this directory if moving databases between machines!

database engine internals

back to top

Compaction

The LSM tree uses time-window compaction to efficiently organise and merge SSTables across different layers (C0-C7). Each layer has a specific time window size and maximum number of SSTables.

Layer Time Window Max. SSTables
C0 Overlapping 32
C1 1 day 2,048
C2 1 week 1,024
C3 1 month 512
C4 3 months 256
C5 6 months 128
C6 1 year 64
C7 3 years 32

When the memtable fills up, it's flushed to C0 as an SSTable. C0 acts as a buffer for the later layers, and when it exceeds its SSTable limit, all the SSTables are merged into C1 at once, with each SSTable spanning a 1-day window.

When any other layer exceeds its SSTable limit, only its oldest, excess SSTables are merged with the next layer's SSTables based on the layer's time window. For example, if C1 has too many SSTables:

  1. The oldest SSTables from C1 are selected.
  2. Any overlapping SSTables in C2 are identified based on 1-week time windows.
  3. The selected SSTables are merged into new SSTables in C2.
  4. Original SSTables are removed after successful merge.

compaction internals

This time-window compaction strategy enables:

  • Fast queries, as SSTables beyond C0 are disjoint and only intersecting ranges need to be scanned.
  • Efficient storage, as older data is consolidated into larger chunks whilst recent data stays granular.
  • Reduced write amplification, with C0 as a buffer and merges occurring on progressively larger time windows.

back to top

Query processing

Lexing is done quite typically, with enumerated token types and line/column number stored for error messages. Initially, I directly executed queries as string streams, but that was a nightmare for robustness.

In terms of parsing, vq has been constructed to have an LL(1) grammar—this meant I could write a straightforward recursive descent parser for the language. This directly converts queries to an abstract syntax tree (AST) with std::variant.

Finally, the interpreter makes quick use of the AST via the visitor pattern, built into C++ with std::variant (mentioned earlier) and std::visit. This ended up making the interpreter (and pretty-printer) very satisfying to write.

database engine internals

back to top

Running locally (not needed)

Installation

First, clone the project and cd into the directory.

git clone https://github.com/vkayy/vkdb.git && cd vkdb

Then, simply run the build script.

./build.sh

back to top

Tests

You can use the -t flag to run the tests.

./build.sh -t

back to top

Benchmarks

You can also use the -b flag to run the benchmarks.

./build.sh -b

back to top

Examples

Finally, you can use the -e flag to run any of the examples.

./build.sh -e <filename>

back to top

Usage

Setup

Add this to your CMakeLists.txt file—it lets you use vkdb by fetching the most recent version into your project's build.

include(FetchContent)
FetchContent_Declare(
    vkdb
    GIT_REPOSITORY https://github.com/vkayy/vkdb.git
    GIT_TAG        main
)
FetchContent_MakeAvailable(vkdb)
target_link_libraries(${PROJECT_NAME} vkdb)

back to top

Interface

Simply include the database header, and you'll have access to the database API.

#include <vkdb/database.h>

int main()  {
  vkdb::Database db{"example-db"};
  db.createTable("example-table");
  // ...
}

Caution

Do not instantiate multiple databases with the same name, nor a single database with the name interpreter_default (more on this database here). As these instances have in-memory components, this can cause unexpected behaviour if they (and they likely will) become out-of-sync.

back to top

Table management

You can manipulate tables with the database API, both with methods or queries.

db.createTable("sensor_data")
  .addTagColumn("location")
  .addTagColumn("type");

db.run("REMOVE TAGS type FROM sensor_data;")

Important

When a table has been populated, it can no longer have its tag columns modified unless you call vkdb::Table::clear.

back to top

General queries

With the database API, you can run queries via strings, files, and the REPL.

test_db
  .run("CREATE TABLE temp TAGS tag1, tag2;")
  .runFile(std::filesystem::current_path() / "../examples/vq_setup.vq")
  .runPrompt()
  .clear();

back to top

With the table API, you can run queries via the query builder.

auto sum{table_replay.query()
  .whereTimestampBetween(0, 999)
  .whereMetricIs("metric")
  .whereTagsContain({"tag1", "value1"})
  .sum()
};

back to top

Playground

You can also play around with vq by running vkdb::VQ::run...(). This operates on a reserved database called interpreter_default.

#include <vkdb/vq.h>

int main() {
  vkdb::VQ::runPrompt();
}
vq-playground.png
The vq playground REPL.

This is generally for experimental purposes—there's not much to gain from it in practice besides having a playground.

back to top

Mock data

Feel free to use vkdb::random<>. Any arithmetic type (with no cv- or ref-qualifiers) can be passed in as a template argument, and you can optionally pass in a lower and upper bound (inclusive).

auto random_int{vkdb::random<int>(-100'000, 100'000)};
auto random_double{vkdb::random<double>(-10.0, 10.0)};

back to top

Working with vq

Table management

Here are some table management queries.

CREATE TABLE climate TAGS region, season;

DROP TABLE devices;

ADD TAGS host, status TO servers;

REMOVE TAGS host FROM servers;

back to top

Data manipulation

Here are some data manipulation queries.

SELECT DATA status FROM sensors ALL;

SELECT AVG temperature FROM weather BETWEEN 1234 AND 1240 WHERE city=london, unit=celsius;

PUT temperature 1234 23.5 INTO weather TAGS city=paris, unit=celsius;

DELETE rainfall 1234 FROM weather TAGS city=tokyo, unit=millimetres;

back to top

Errors

There are two kinds of errors you can get—parse errors and runtime errors, occurring at the named points in time for self-explanatory reasons.

vq-errors.png
A parse error and a runtime error in the REPL.

back to top

EBNF

Here's the EBNF grammar encapsulating vq.

<expr> ::= {<query> ";"}+

<query> ::= <select_query> | <put_query> | <delete_query> | <create_query>  | <drop_query> | <add_query> | <remove_query> | <tables_query>

<select_query> ::= "SELECT" <select_type> <metric> "FROM" <table_name> <select_clause>

<select_type> ::= "DATA" | "AVG" | "SUM" | "COUNT" | "MIN" | "MAX"

<select_clause> ::= <all_clause> | <between_clause> | <at_clause>

<all_clause> ::= "ALL" {<where_clause>}?

<between_clause> ::= "BETWEEN" <timestamp> "AND" <timestamp> {<where_clause>}?

<at_clause> ::= "AT" <timestamp> {<where_clause>}?

<where_clause> ::= "WHERE" <tag_list>

<put_query> ::= "PUT" <metric> <timestamp> <value> "INTO" <table_name> {"TAGS" <tag_list>}?

<delete_query> ::= "DELETE" <metric> <timestamp> "FROM" <table_name> {"TAGS" <tag_list>}?

<create_query> ::= "CREATE" "TABLE" <table_name> {"TAGS" <tag_list>}?

<drop_query> ::= "DROP" "TABLE" <table_name>

<add_query> ::= "ADD" "TAGS" <tag_columns> "TO" <table_name>

<remove_query> ::= "REMOVE" "TAGS" <tag_columns> "FROM" <table_name>

<tables_query> ::= "TABLES"

<tag_list> ::= <tag> {"," <tag>}*

<tag> ::= <tag_key> "=" <tag_value>

<tag_columns> ::= <tag_key> {"," <tag_key>}*

<tag_key> ::= <identifier>

<tag_value> ::= <identifier>

<metric> ::= <identifier>

<table_name> ::= <identifier>

<timestamp> ::= <number>

<value> ::= <number>

<identifier> ::= <char> {<char> | <digit>}*

<number> ::= {"-"}? <digit>+ {"." <digit>+}?

<char> ::= "A" | ... | "Z" | "a" | ... | "z" | "_"

<digit> ::= "0" | "1" | ... | "9"

back to top

Authors

Vinz Kakilala (me)

back to top

License

Distributed under the MIT License. See LICENSE for more information.

back to top

Credits

Used MurmurHash3 for the Bloom filters. Fast, uniform, and deterministic.

back to top

About

A time series database engine in C++.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

0