dbt-dremio

This project is developed during my spare time, along side my lead dev position at MAIF-VIE, and aims to provide a competitive alternative solution for our current ETL stack.

dbt-dremio

dbt's adapter for dremio

If you are reading this documentation, I assume you already know well both dbt and dremio. Please refer to their respective documentation.

Installation

dbt dependencies :

dbt-core>=0.18.0,
pyodbc>=4.0.27

dremio dependency :

latest dremio's odbc driver

os dependency :

odbc (unixodbc-dev on linux)

pip install dbt-dremio

Relation types

In dbt's world, A dremio dataset can be either a view or a table. A dremio reflection - a dataset materialization with a refresh policy - will be mapped to a dbt materializedview.

Databases

As Dremio is a federation tool, dbt's queries can span locations and so, in dremio's adapter, databases are first class citizens. There are two kinds of dataset locations : sources and spaces. Sources are mostly input locations, and spaces output ones, with exceptions :

location	can create table	can drop table	can create/drop view
source	if CTAS (`CREATE TABLE AS`) is allowed on this source	if `DROP TABLE` is allowed on this source	no
space	only in the user's home space, and by manually uploading files in the UI	only in the UI	yes
distributed shared storage (`$scratch` source)	yes	yes	no

As you can see, using the SQL-DDL interface, the location type implies the relation type, so materialization implementations do not have to take care of possible relation type mutations.

The UI allows dots in a space's name : the adapter does not handle that correctly.

Schemas

In dremio, schemas are recursive, like filesystem folders : dbt.internal."my very strange folder's name", and dots are not allowed in sub-folder's names. For each database, there is a root schema, known as no_schemaby the adapter. So, in order to materialize a model at the root folder of the track17space, one will configure it as :

+database: track17
+schema: no_schema

Please note that because dremio has no CREATE SCHEMA command yet, all schemas must be created before in the UI or via the API (and by the way dremio is loved by all oracle dbas)

Rendering of a relation

Because dremio accepts almost any string character in the objects' names, the adapter will double quote each part of the database.schema.identifier tryptic with the following rules concerning schema :

if schema is equal to no_schema, the schema will not be included, leading to a simple "database"."identifier" being rendered
if schema spans multiple folders, each folder's name will be double quoted, leading to "database"."folder"."sub-folder"."sub-sub-folder"."identifier".

Sources

Environments

A same dremio installation could handle several data environments. In order to group sources by environment, you can use the undocumented target.profile_name or the adapter specific environment configuration to map environments between dremio and dbt :

dremio's side: prefix all the sources' names of a specific environment prd with the environment's name, for example : prd_crm, prd_hr, prd_accounting
dbt's side: prefix all source's database configs like this : {{target.environment}}_crm or {{target.profile_name}}_crm

That way you can configure seperately input sources and output target database.

Materializations

Dremio's SQL specificities

Given that :

tables and views cannot coexist in the same schema
there are no transactions (at DDL level),
there is no ALTER TABLE|VIEW RENAME TOcommand
you can CREATE OR REPLACE a view, but only CREATE a table,
data can only be added in tables with a CTAS,

I tried to keep things secure setting up a kind of logical interface between the dbt model and its implementation in dremio. So :

almost all materializations create a view as interface, so they could coexist in a same space,
each new version of the model's underlying data is first stored in a new table, and then referenced atomically (via a CREATE OR REPLACE VIEW) by the interface view. The table containing the old version of the underlying data can then be dropped : a kind of pedantic blue/green deployement at model's level.
the coexistence of old and new data versions helps overcoming the lack of SQL-DML commands, see for example the incremental implementation.

This could change in near future, as Dremio's CEO Tomer Shiran posted in discourse that Apache Iceberg could be included by the end of the year, bringing INSERT [OVERWRITE] to dremio, challenging a well known cloud datawarehouse in the same temperatures...

Seed

adapter's specific configuration	type	required	default
materialization_database	CTAS/DROP TABLE allowed source's name	no	`$scratch`
materialization_schema		no	`target.environment` (we don't want the environments to share the same underlying table)

CREATE TABLE AS
SELECT *
FROM VALUES()[,()]

As dremio does not support query's bindings, the python value is converted as string, quoted and casted in the sql type.

View

CREATE OR REPLACE VIEW AS
{{ sql }}

Table

adapter's specific configuration	type	required	default
materialization_database	CTAS/DROP TABLE allowed source's name	no	`$scratch`
materialization_schema		no	`target.environment`
partition	the list of partitioning columns	no
sort	the list of sorting columns	no

CREATE TABLE [HASH PARTITION BY] [LOCALSORT BY] AS
{{ sq }}

Incremental

adapter's specific configuration	type	required	default
materialization_database	CTAS/DROP TABLE allowed source's name	no	`$scratch`
materialization_schema		no	`target.environment`
partition	the list of partitioning columns	no
sort	the list of sorting columns	no

As we still have the old data when new data is created, the new table is filled with :

{% if full_refresh_mode or old_relation is none %}
    {% set build_sql = sql %}
{% else %}
    {% set build_sql %}
	    with increment as (
		    {{ sql }}
	    )
	    select *
	    from increment
	    union all
	    select *
	    from {{ old_table }}
	    {%- if unique_key is not none %}
		where {{ unique_key }} not in (
		    select {{ unique_key }}
		    from increment
		    )
	    {% endif %}
	{% endset %}
{% endif %}

Raw Reflection

A raw reflection is a dataset materialization, with a refresh policy, handled internally by dremio.

adapter's specific configuration	type	required	default
dataset	the related model's name	yes
display	the list of columns	no	all columns
partition	the list of partitioning columns	no
sort	the list of sorting columns	no
distribute	the list of distributing columns	no

Aggregate Reflection

An aggregate reflection is a dataset materialization, containing pre aggregated measures on dimensions, with a refresh policy, handled internally by dremio.

adapter's specific configuration	type	required	default
dataset	the related model's name	yes
dimensions	the list of dimension columns	no	all columns whom type is not in 'float', 'double' and 'decimal'
measures	the list of measurement columns	no	all columns whom type is in 'float', 'double' and 'decimal'
partition	the list of partitioning columns	no
sort	the list of sorting columns	no
distribute	the list of distributing columns	no

File

This materialization creates a table without a view interface. It's an easy way to automate the export of a dataset (in parquet format).

Connection

Be careful to provide the right odbc driver's name in the adapter specific driver attribute, the one you gave to your dremio's odbc driver installation.

Managed or unmanaged target

Thanks to Ronald Damhof's article, I wanted to have a clear separation between managed environments (prod, preprod...) and unmanaged ones (developers' environments). So there are two distinct targets : managed and unmanaged.

In an unmanaged environment, if no target database is provided, all models are materialized in the user's home space, under the target schema.

In a managed environment, target and custom databases and schemas are used as usual. If no target database is provided, target.environment will be used as the default value.

You will find in the macros' directory an environment aware implementation for custom database and schema names.

track17:
  outputs:
    unmanaged:
      type: dremio
      threads: 2
      driver: Dremio ODBC Driver 64-bit
      host: veniseverte.fr
      port: 31010
      environment: track17
      schema: dbt
      user: fabrice_etanchaud
      password: fabricesecretpassword
    managed:
      type: dremio
      threads: 2
      driver: Dremio ODBC Driver 64-bit
      host: veniseprovencale.fr
      port: 31010
      environment: track17
      database: '@dremio'
      schema: no_schema
      user: dremio
      password: dremiosecretpassword
  target: unmanaged

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
.github/workflows		.github/workflows
dbt		dbt
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

dbt-dremio

Installation

Relation types

Databases

Schemas

Rendering of a relation

Sources

Environments

Materializations

Dremio's SQL specificities

Seed

View

Table

Incremental

Raw Reflection

Aggregate Reflection

File

Connection

Managed or unmanaged target

About

Uh oh!

Releases

Packages

Languages

License

wol4aravio/dbt-dremio

Folders and files

Latest commit

History

Repository files navigation

dbt-dremio

Installation

Relation types

Databases

Schemas

Rendering of a relation

Sources

Environments

Materializations

Dremio's SQL specificities

Seed

View

Table

Incremental

Raw Reflection

Aggregate Reflection

File

Connection

Managed or unmanaged target

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages