8000 feat: experiment - multi database support via Ibis by peter-gy · Pull Request #681 · uwdata/mosaic · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

feat: experiment - multi database support via Ibis #681

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

peter-gy
Copy link
Contributor
@peter-gy peter-gy commented Feb 7, 2025

Important

The sole purpose of this pull request is to serve as a concrete point of discussion regarding issue #399 and to evaluate whether using SQLGlot through Ibis to achieve multi-database support is a viable option while introducing minimal differences in both the number of files and lines of code. It is not at all meant to be a suggestion to be added to the codebase directly.

What problem do we aim to address?

As outlined in #399, it would be beneficial for Mosaic to support backends other than DuckDB. Currently, the only non-DuckDB backends supported by Mosaic are those accessed through DuckDB's ATTACH statement, namely MySQL, PostgreSQL, and SQLite.

Proposal Overview

This pull request demonstrates how Ibis allows the use of backends like ClickHouse, PostgreSQL, DataFusion, and even Polars, which is a purely dataframe-based backend, while preserving 100% native support for the default DuckDB backend.

The main idea is to translate the queries from Mosaic clients into Ibis expressions, allowing Ibis to act as a bridge between Mosaic and approximately 20 backends. This approach eliminates the need to maintain adapters or connectors on a per-backend basis within Mosaic's codebase.

CleanShot 2025-02-07 at 19 50 25@2x

Under the hood, Ibis converts its native internal representation (similar to Mosaic's SQL AST) into backend-specific queries using SQLGlot expressions. It also supports converting arbitrary SQL into Ibis expressions, enabling flows such as:

graph LR
    A[SQL in DuckDB dialect] --> B[Ibis Expression]
    B --> C[SQL in ClickHouse dialect]
    B --> D[SQL in PostgreSQL dialect]
    B --> E[SQL in DataFusion dialect]
    B --> F[Polars expression]
Loading

In fact, this is precisely the flow implemented in this pull request, as Mosaic clients express their data needs using the Mosaic SQL AST, which in turn gets converted into SQL in DuckDB dialect, resulting in:

graph LR
    X[Mosaic Client Data Need] --> Y[Mosaic SQL AST]
    Y --> A

    subgraph Ibis
        A[SQL in DuckDB dialect] --> B[Ibis Expression]
        B --> C[SQL in ClickHouse dialect]
        B --> D[SQL in PostgreSQL dialect]
        B --> E[SQL in DataFusion dialect]
        B --> F[Polars expression]
    end
Loading

More on Ibis internals: https://ibis-project.org/posts/does-ibis-understand-sql/

Testing this experiment

Backend Setup via Docker

PostgreSQL

docker run --rm \
  --name mosaic-postgres \
  -e POSTGRES_PASSWORD=mosaic \
  -e POSTGRES_USER=postgres \
  -e POSTGRES_DB=postgres \
  -p 5432:5432 \
  postgres:15

ClickHouse

docker run --rm \
  --name mosaic-clickhouse \
  -p 8123:8123 \
  clickhouse/clickhouse-server:latest
  1. Switch to this branch.
  2. Run uv sync to pull core Ibis dependencies and database.
  3. Start the Vite development server using npm run dev, and choose Socket or REST connector.
  4. Start the server with a specific backend specified as the connection string.
  • PostgreSQL: npm run server -- postgres://postgres:mosaic@localhost:5432/postgres
  • ClickHouse: npm run server -- clickhouse://
  • DataFusion: npm run server -- datafusion://
  • Polars: npm run server -- polars://
  • DuckDB: npm run server -- duckdb://

Note

Due to the limitations outlined below, many of the example specs will not be fully functional when using backends other than DuckDB

The following screencast demonstrates how various backends can be used to serve Mosaic clients with data.

CleanShot.2025-02-07.at.17.11.44.mp4

Points to Consider

Expression Compatibility

This pull request presents a quick and preliminary proof of concept. Many specifications still only work with DuckDB, as clients express their data needs using DuckDB-specific SQL syntax that is unsupported by SQLGlot, or they use native DuckDB functions that SQLGlot does not recognize. To fully leverage Ibis and the operations it supports (https://ibis-project.org/backends/support/matrix), the Mosaic AST should ideally be translated into a representation that allows for direct reconstruction of Ibis expressions on the Python side. This would simplify the process to:

graph LR
    X[Mosaic Client Data Need] --> Y[Mosaic SQL AST]
    Y --> B[Ibis Expression]

    subgraph Ibis
        B --> C[SQL in ClickHouse dialect]
        B --> D[SQL in PostgreSQL dialect]
        B --> E[SQL in DataFusion dialect]
        B --> F[Polars expression]
    end
Loading

In other words, the current flow from Mosaic SQL AST to DuckDB to Ibis Expression to the respective backend—where the DuckDB to Ibis segment is managed by SQLGlot—is fragile. Directly converting from Mosaic SQL AST to Ibis Expression would enhance multi-backend compatibility. However, this would require a significant refactoring of the Mosaic SQL package, and selecting an Ibis internal representation as the lingua franca for Mosaic queries would restrict usage to the Python server, as there are no Ibis bindings available for Rust or Node.js.

Data Loading

The introduced data loading mechanism is quite basic. Based on the observed use cases, the Mosaic Coordinator calls exec to perform table/schema creation and deletion operations, while the query function retrieves data from existing tables. We take advantage of this by using a dedicated in-memory DuckDB connection to load data according to the specified expression, followed by self.backend.create_table(table_name, table.to_pyarrow(), overwrite=True) to create the same table in the target backend.

Building serialized Ibis expressions on the client side and deserializing them on the server side would be beneficial, as Ibis supports dedicated data loading methods (read_csv, read_parquet, read_json, read_delta) and can execute more complex expressions (including raw backend-native SQL) to create tables.

Ibis as a Meta-Connector

The changes introduced in this pull request present a rough solution for how the default DuckDB backend could coexist with the Ibis meta-backend within the same server package. However, it could also be more aligned with Mosaic's philosophy of composable components to develop this solution as a standalone connector, making less radical adjustments to other packages on an as-needed basis.

@jheer
Copy link
Member
jheer commented Feb 12, 2025

Thanks for preparing this!! It is a great proof-of-concept, and I certainly understand the current workarounds for data loading. I love how minimal the changes were to get this working.

Thinking further ahead, one obvious limitation of this approach is the requirement of using a local Python server as a proxy to other backends. This is viable in some contexts, but ultimately we would like to support multiple databases directly from JavaScript as well. It would be particularly great if there is a SQLGlot equivalent available in either JS or WASM... or, perhaps learning from SQLGlot to expand our existing mosaic-sql package with support for other SQL dialects.

@peter-gy
Copy link
Contributor Author

Thank you for the feedback.

one obvious limitation of this approach is the requirement of using a local Python server as a proxy to other backends [...] ultimately we would like to support multiple databases directly from JavaScript as well

Do you see the dependence on a server written in Python instead of JavaScript as the limitation here or the sheer presence of any serverful proxy component between the browser and backends other than DuckDB? My understanding is that since databases such as Postgres, BigQuery, etc. rely on binary protocols and dedicated drivers, short of a per-backend WASM implementation it is not an option to establish direct connection to them from a browser tab via JavaScript.

It would be particularly great if there is a SQLGlot equivalent available in either JS or WASM...

SQLGlot itself is available in WASM through Pyodide, as is Ibis: https://ibis-project.org/tutorials/browser/repl.
Similar JS-based packages I am aware of include https://github.com/knex/knex and https://github.com/taozhi8833998/node-sql-parser, but SQLGlot seems to be by far the most comprehensive solution for converting between SQL dialects. It strictly focuses on the AST, and is not coupled with backend-specific execution logic.

The way I see it, to support multiple databases, roughly speaking we need to solve two main problems. (1) convert Mosaic AST into a backend-specific expression and (2) execute this expression against the chosen backend. For (1) we could do everything on the client-side, as only symbolic manipulation would be going on for which a browser environment with JS / WASM is suitable. Regarding (2), for backends unlike DuckDB, without native WASM support I see the presence of a proxy as necessary but I might be missing use cases here.

@jheer
Copy link
Member
jheer commented Feb 13, 2025

The way I see it, to support multiple databases, roughly speaking we need to solve two main problems. (1) convert Mosaic AST into a backend-specific expression and (2) execute this expression against the chosen backend. For (1) we could do everything on the client-side, as only symbolic manipulation would be going on for which a browser environment with JS / WASM is suitable. Regarding (2), for backends unlike DuckDB, without native WASM support I see the presence of a proxy as necessary but I might be missing use cases here.

At least for those backends that are (1) network accessible and (2) can return results in a Mosaic-friendly format (for now, that's primarily Apache Arrow), I would want to retain the ability to have the browser connect to the (presumably preexisting) backend directly and not have to setup an intermediate server to proxy between the browser and the backend. (This is not about JS vs. Python per se, it's about the need for an additional intermediate service layer.)

@peter-gy
Copy link
Contributor Author

Thanks for the clarification, I see what you mean. ClickHouse seems to be a database similar to the kind you described, as it provides an HTTP interface and can return results in Arrow format. I'll think about how a generic integration could be designed along those lines.

An additional data point to consider around using Mosaic with multiple databases is that the Apache Arrow Flight protocol is already supported by backends such as Dremio, Apache Doris, Postgres, InfluxDB and hopefully many more in the future. This might be interesting as https://github.com/Query-farm/duckdb-airport-extension enables the use of Arrow Flight in DuckDB, allowing direct connections to Arrow Flight-compatible backends through DuckDB, even if no DuckDB ATTACH support is implemented in the core DuckDB codebase. However, similar to the ATTACH-based approach, this approach using the duckdb-airport-extension would also not be available via DuckDB WASM because the extension is not (yet? Query-farm/airport#12) distributed for WebAssembly architectures.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0