-
-
Notifications
You must be signed in to change notification settings - Fork 75
feat: experiment - multi database support via Ibis #681
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: experiment - multi database support via Ibis #681
Conversation
Thanks for preparing this!! It is a great proof-of-concept, and I certainly understand the current workarounds for data loading. I love how minimal the changes were to get this working. Thinking further ahead, one obvious limitation of this approach is the requirement of using a local Python server as a proxy to other backends. This is viable in some contexts, but ultimately we would like to support multiple databases directly from JavaScript as well. It would be particularly great if there is a SQLGlot equivalent available in either JS or WASM... or, perhaps learning from SQLGlot to expand our existing mosaic-sql package with support for other SQL dialects. |
Thank you for the feedback.
Do you see the dependence on a server written in Python instead of JavaScript as the limitation here or the sheer presence of any serverful proxy component between the browser and backends other than DuckDB? My understanding is that since databases such as Postgres, BigQuery, etc. rely on binary protocols and dedicated drivers, short of a per-backend WASM implementation it is not an option to establish direct connection to them from a browser tab via JavaScript.
SQLGlot itself is available in WASM through Pyodide, as is Ibis: https://ibis-project.org/tutorials/browser/repl. The way I see it, to support multiple databases, roughly speaking we need to solve two main problems. (1) convert Mosaic AST into a backend-specific expression and (2) execute this expression against the chosen backend. For (1) we could do everything on the client-side, as only symbolic manipulation would be going on for which a browser environment with JS / WASM is suitable. Regarding (2), for backends unlike DuckDB, without native WASM support I see the presence of a proxy as necessary but I might be missing use cases here. |
At least for those backends that are (1) network accessible and (2) can return results in a Mosaic-friendly format (for now, that's primarily Apache Arrow), I would want to retain the ability to have the browser connect to the (presumably preexisting) backend directly and not have to setup an intermediate server to proxy between the browser and the backend. (This is not about JS vs. Python per se, it's about the need for an additional intermediate service layer.) |
Thanks for the clarification, I see what you mean. ClickHouse seems to be a database similar to the kind you described, as it provides an HTTP interface and can return results in Arrow format. I'll think about how a generic integration could be designed along those lines. An additional data point to consider around using Mosaic with multiple databases is that the Apache Arrow Flight protocol is already supported by backends such as Dremio, Apache Doris, Postgres, InfluxDB and hopefully many more in the future. This might be interesting as https://github.com/Query-farm/duckdb-airport-extension enables the use of Arrow Flight in DuckDB, allowing direct connections to Arrow Flight-compatible backends through DuckDB, even if no DuckDB |
Important
The sole purpose of this pull request is to serve as a concrete point of discussion regarding issue #399 and to evaluate whether using SQLGlot through Ibis to achieve multi-database support is a viable option while introducing minimal differences in both the number of files and lines of code. It is not at all meant to be a suggestion to be added to the codebase directly.
What problem do we aim to address?
As outlined in #399, it would be beneficial for Mosaic to support backends other than DuckDB. Currently, the only non-DuckDB backends supported by Mosaic are those accessed through DuckDB's
ATTACH
statement, namely MySQL, PostgreSQL, and SQLite.Proposal Overview
This pull request demonstrates how Ibis allows the use of backends like ClickHouse, PostgreSQL, DataFusion, and even Polars, which is a purely dataframe-based backend, while preserving 100% native support for the default DuckDB backend.
The main idea is to translate the queries from Mosaic clients into Ibis expressions, allowing Ibis to act as a bridge between Mosaic and approximately 20 backends. This approach eliminates the need to maintain adapters or connectors on a per-backend basis within Mosaic's codebase.
Under the hood, Ibis converts its native internal representation (similar to Mosaic's SQL AST) into backend-specific queries using SQLGlot expressions. It also supports converting arbitrary SQL into Ibis expressions, enabling flows such as:
In fact, this is precisely the flow implemented in this pull request, as Mosaic clients express their data needs using the Mosaic SQL AST, which in turn gets converted into SQL in DuckDB dialect, resulting in:
More on Ibis internals: https://ibis-project.org/posts/does-ibis-understand-sql/
Testing this experiment
Backend Setup via Docker
PostgreSQL
ClickHouse
uv sync
to pull core Ibis dependencies and database.npm run dev
, and chooseSocket
orREST
connector.npm run server -- postgres://postgres:mosaic@localhost:5432/postgres
npm run server -- clickhouse://
npm run server -- datafusion://
npm run server -- polars://
npm run server -- duckdb://
Note
Due to the limitations outlined below, many of the example specs will not be fully functional when using backends other than DuckDB
The following screencast demonstrates how various backends can be used to serve Mosaic clients with data.
CleanShot.2025-02-07.at.17.11.44.mp4
Points to Consider
Expression Compatibility
This pull request presents a quick and preliminary proof of concept. Many specifications still only work with DuckDB, as clients express their data needs using DuckDB-specific SQL syntax that is unsupported by SQLGlot, or they use native DuckDB functions that SQLGlot does not recognize. To fully leverage Ibis and the operations it supports (https://ibis-project.org/backends/support/matrix), the Mosaic AST should ideally be translated into a representation that allows for direct reconstruction of Ibis expressions on the Python side. This would simplify the process to:
In other words, the current flow from
Mosaic SQL AST
toDuckDB
toIbis Expression
to the respective backend—where theDuckDB to Ibis
segment is managed by SQLGlot—is fragile. Directly converting fromMosaic SQL AST
toIbis Expression
would enhance multi-backend compatibility. However, this would require a significant refactoring of the Mosaic SQL package, and selecting an Ibis internal representation as the lingua franca for Mosaic queries would restrict usage to the Python server, as there are no Ibis bindings available for Rust or Node.js.Data Loading
The introduced data loading mechanism is quite basic. Based on the observed use cases, the Mosaic Coordinator calls
exec
to perform table/schema creation and deletion operations, while thequery
function retrieves data from existing tables. We take advantage of this by using a dedicated in-memory DuckDB connection to load data according to the specified expression, followed byself.backend.create_table(table_name, table.to_pyarrow(), overwrite=True)
to create the same table in the target backend.Building serialized Ibis expressions on the client side and deserializing them on the server side would be beneficial, as Ibis supports dedicated data loading methods (
read_csv
,read_parquet
,read_json
,read_delta
) and can execute more complex expressions (including raw backend-native SQL) to create tables.Ibis as a Meta-Connector
The changes introduced in this pull request present a rough solution for how the default DuckDB backend could coexist with the Ibis meta-backend within the same server package. However, it could also be more aligned with Mosaic's philosophy of composable components to develop this solution as a standalone connector, making less radical adjustments to other packages on an as-needed basis.