VeloDB Cloud
Developer Guide
Ecological Integration
DBT VeloDB Adapter

DBT VeloDB Adapter

DBT (Data Build Tool) (opens in a new tab) is a component that focuses on T (Transform) in ELT (extraction, loading, transformation) - "transforming data". The dbt-velodb adapter is developed based on dbt-core 1.3.0 and depends on mysql -connector-python driver performs data conversion on VeloDB.

Version support

VeloDB/Apache Dorispythondbt-core
>=2.0.0 / >=1.1.0>=3.8,<=3.10>=1.3.0

Dbt-velodb adapter uses

Install dbt-velodb adapter

Install using pip:

pip install dbt-velodb

The installation behavior will install all dbt running dependencies by default, you can use the following command to check and verify:

dbt --version

If the system does not recognize the dbt command, you can create a soft link:

ln -s /usr/local/python3/bin/dbt /usr/bin/dbt

Init dbt-velodb adapter

dbt init 

A query command line will appear, enter the corresponding configuration as follows to initialize a dbt project:

nameDefaultsmeaning
projectItem name
projectItem name
databaseEnter the corresponding number to select the adapter (select doris)
hostSelect the host of VeloDB Cloud Warehouse
port9030query_port of VeloDB Cloud Warehouse
schemaIn dbt-velodb, it is equivalent to database, library name
usernameVeloDB username
passwordThe password of VeloDB, if not set, just press Enter
threads1Parallelism in dbt-velodb (setting a parallelism that does not match the cluster capability will increase the risk of dbt running failure)

Run dbt-velodb adapter

For relevant dbt operation documents, please refer to here (opens in a new tab) . Go to the project directory just created and execute the default dbt model:

dbt run 

You can see that two models are running: my_first_dbt_model and my_second_dbt_model

They are materialized table table and view view respectively.

You can log in to VeloDB to view the data results and table creation statements of my_first_dbt_model and my_second_dbt_model.

Materialization types of dbt-velodb adapter

The Materialization method of dbt-velodb supports three types

  1. view
  2. table
  3. incremental

View

Using view as the materialization mode, Models will be rebuilt into views through the create view as statement every time they run. (By default, the materialization method of dbt is view)

Pros: 
No extra data is stored, and views on top of the source data will always contain the latest records.

Cons: 
View queries that perform large transformations or are nested on top of other views are slow.

Recommendation:
Usually start with the view of the model and only change to another materialization if you notice performance issues. Views are best suited for models that do not undergo major transformations, such as renaming, column changes.

Configuration items:

models:
  <resource-path>:
    +materialized: view

Or config in the model file

{{ config(materialized = "view") }}

Table

When using table materialization, your model is rebuilt as a table each time it is run through the create table as statement.

Pros:
table query speed will be faster than view.

Cons: 
The table takes a long time to build or rebuild, additional data will be stored, and incremental data synchronization cannot be performed.

Recommendation: 
It is recommended to use the table materialization method for models queried by BI tools or models with slow operations such as downstream queries and conversions.

Configuration items:

models:
  <resource-path>:
    +materialized: table
    +duplicate_key: [ <column-name>, ... ],
    +partition_by: [ <column-name>, ... ],
    +partition_type: <engine-type>,
    +partition_by_init: [<pertition-init>, ... ]
    +distributed_by: [ <column-name>, ... ],
    +buckets: int,
    +properties: {<key>:<value>,...}

Or config in the model file

{{ config(
    materialized = "table",
    duplicate_key = [ "<column-name>", ... ],
    partition_by = [ "<column-name>", ... ],
    partition_type = "<engine-type>",
    partition_by_init = ["<pertition-init>", ... ]
    distributed_by = [ "<column-name>", ... ],
    buckets = "int",
    properties = {"<key>":"<value>",...}
      ...
    ]
) }}

The details of the above configuration items are as follows:

configuration itemdescribeRequired?
materializedThe materialized form of the table (corresponding to creating a table model as a detailed model (Duplicate))Required
duplicate_keySort Columns for Detail ModelsOptional
partition_bytable partition columnOptional
partition_typeTable partition type, range or list.(default: RANGE)Optional
partition_by_initInitialized table partitionsOptional
distributed_bytable bucket columnOptional
bucketsNumber of bucketsOptional
propertiesOther configurations for creating tablesOptional

Incremental

Based on the incremental model result of the last run of dbt, records are incrementally inserted or updated into the table.

  • Note: The incremental implementation of VeloDB depends on the unique model. If there is an incremental requirement, specify the materialization as incremental when initializing the data of the model
Pros: 
Significantly reduces build time by only converting new records.

Cons: 
incremental mode requires additional configuration, which is an advanced usage of dbt, and requires the support of complex scenarios and the adaptation of corresponding components.

Recommendation: 
The incremental model is best for event-based scenarios or when dbt runs become too slow

Configuration items:

models:
  <resource-path>:
    +materialized: incremental
    +unique_key: [ <column-name>, ... ],
    +partition_by: [ <column-name>, ... ],
    +partition_type: <engine-type>,
    +partition_by_init: [<pertition-init>, ... ]
    +distributed_by: [ <column-name>, ... ],
    +buckets: int,
    +properties: {<key>:<value>,...}

Or config in the model file

{{ config(
    materialized = "incremental",
    unique_key = [ "<column-name>", ... ],
    partition_by = [ "<column-name>", ... ],
    partition_type = "<engine-type>",
    partition_by_init = ["<pertition-init>", ... ]
    distributed_by = [ "<column-name>", ... ],
    buckets = "int",
    properties = {"<key>":"<value>",...}
      ...
    ]
) }}

The details of the above configuration items are as follows:

configuration itemdescribeRequired?
materializedThe materialized form of the table (corresponding to creating a table model as a detailed model (unique))Required
unique_keyThe key column of the unique tableRequired
partition_bytable partition columnOptional
partition_typeTable partition type, range or list.(default: RANGE)Optional
partition_by_initInitialized table partitionsOptional
distributed_bytable bucket columnOptional
bucketsNumber of bucketsOptional
propertiesOther configurations for creating tablesOptional