DBT VeloDB Adapter
DBT (Data Build Tool) (opens in a new tab) is a component that focuses on T (Transform) in ELT (extraction, loading, transformation) - "transforming data". The dbt-velodb adapter is developed based on dbt-core 1.3.0 and depends on mysql -connector-python driver performs data conversion on VeloDB.
Version support
VeloDB/Apache Doris | python | dbt-core |
---|---|---|
>=2.0.0 / >=1.1.0 | >=3.8,<=3.10 | >=1.3.0 |
Dbt-velodb adapter uses
Install dbt-velodb adapter
Install using pip:
pip install dbt-velodb
The installation behavior will install all dbt running dependencies by default, you can use the following command to check and verify:
dbt --version
If the system does not recognize the dbt command, you can create a soft link:
ln -s /usr/local/python3/bin/dbt /usr/bin/dbt
Init dbt-velodb adapter
dbt init
A query command line will appear, enter the corresponding configuration as follows to initialize a dbt project:
name | Defaults | meaning |
---|---|---|
project | Item name | |
project | Item name | |
database | Enter the corresponding number to select the adapter (select doris) | |
host | Select the host of VeloDB Cloud Warehouse | |
port | 9030 | query_port of VeloDB Cloud Warehouse |
schema | In dbt-velodb, it is equivalent to database, library name | |
username | VeloDB username | |
password | The password of VeloDB, if not set, just press Enter | |
threads | 1 | Parallelism in dbt-velodb (setting a parallelism that does not match the cluster capability will increase the risk of dbt running failure) |
Run dbt-velodb adapter
For relevant dbt operation documents, please refer to here (opens in a new tab) . Go to the project directory just created and execute the default dbt model:
dbt run
You can see that two models are running: my_first_dbt_model and my_second_dbt_model
They are materialized table table and view view respectively.
You can log in to VeloDB to view the data results and table creation statements of my_first_dbt_model and my_second_dbt_model.
Materialization types of dbt-velodb adapter
The Materialization method of dbt-velodb supports three types
- view
- table
- incremental
View
Using view as the materialization mode, Models will be rebuilt into views through the create view as statement every time they run. (By default, the materialization method of dbt is view)
Pros:
No extra data is stored, and views on top of the source data will always contain the latest records.
Cons:
View queries that perform large transformations or are nested on top of other views are slow.
Recommendation:
Usually start with the view of the model and only change to another materialization if you notice performance issues. Views are best suited for models that do not undergo major transformations, such as renaming, column changes.
Configuration items:
models:
<resource-path>:
+materialized: view
Or config in the model file
{{ config(materialized = "view") }}
Table
When using table materialization, your model is rebuilt as a table each time it is run through the create table as statement.
Pros:
table query speed will be faster than view.
Cons:
The table takes a long time to build or rebuild, additional data will be stored, and incremental data synchronization cannot be performed.
Recommendation:
It is recommended to use the table materialization method for models queried by BI tools or models with slow operations such as downstream queries and conversions.
Configuration items:
models:
<resource-path>:
+materialized: table
+duplicate_key: [ <column-name>, ... ],
+partition_by: [ <column-name>, ... ],
+partition_type: <engine-type>,
+partition_by_init: [<pertition-init>, ... ]
+distributed_by: [ <column-name>, ... ],
+buckets: int,
+properties: {<key>:<value>,...}
Or config in the model file
{{ config(
materialized = "table",
duplicate_key = [ "<column-name>", ... ],
partition_by = [ "<column-name>", ... ],
partition_type = "<engine-type>",
partition_by_init = ["<pertition-init>", ... ]
distributed_by = [ "<column-name>", ... ],
buckets = "int",
properties = {"<key>":"<value>",...}
...
]
) }}
The details of the above configuration items are as follows:
configuration item | describe | Required? |
---|---|---|
materialized | The materialized form of the table (corresponding to creating a table model as a detailed model (Duplicate)) | Required |
duplicate_key | Sort Columns for Detail Models | Optional |
partition_by | table partition column | Optional |
partition_type | Table partition type, range or list.(default: RANGE ) | Optional |
partition_by_init | Initialized table partitions | Optional |
distributed_by | table bucket column | Optional |
buckets | Number of buckets | Optional |
properties | Other configurations for creating tables | Optional |
Incremental
Based on the incremental model result of the last run of dbt, records are incrementally inserted or updated into the table.
- Note: The incremental implementation of VeloDB depends on the unique model. If there is an incremental requirement, specify the materialization as incremental when initializing the data of the model
Pros:
Significantly reduces build time by only converting new records.
Cons:
incremental mode requires additional configuration, which is an advanced usage of dbt, and requires the support of complex scenarios and the adaptation of corresponding components.
Recommendation:
The incremental model is best for event-based scenarios or when dbt runs become too slow
Configuration items:
models:
<resource-path>:
+materialized: incremental
+unique_key: [ <column-name>, ... ],
+partition_by: [ <column-name>, ... ],
+partition_type: <engine-type>,
+partition_by_init: [<pertition-init>, ... ]
+distributed_by: [ <column-name>, ... ],
+buckets: int,
+properties: {<key>:<value>,...}
Or config in the model file
{{ config(
materialized = "incremental",
unique_key = [ "<column-name>", ... ],
partition_by = [ "<column-name>", ... ],
partition_type = "<engine-type>",
partition_by_init = ["<pertition-init>", ... ]
distributed_by = [ "<column-name>", ... ],
buckets = "int",
properties = {"<key>":"<value>",...}
...
]
) }}
The details of the above configuration items are as follows:
configuration item | describe | Required? |
---|---|---|
materialized | The materialized form of the table (corresponding to creating a table model as a detailed model (unique)) | Required |
unique_key | The key column of the unique table | Required |
partition_by | table partition column | Optional |
partition_type | Table partition type, range or list.(default: RANGE ) | Optional |
partition_by_init | Initialized table partitions | Optional |
distributed_by | table bucket column | Optional |
buckets | Number of buckets | Optional |
properties | Other configurations for creating tables | Optional |