Hive カタログ
Hive MetastoreまたはHive Metastore互換のメタデータサービスに接続することで、DorisはHiveデータベースとTable情報を自動的に取得し、データクエリを実行できます。
Hiveに加えて、多くの他のシステムがメタデータの保存にHive Metastoreを使用しています。したがって、Hive カタログを通じて、HiveTableだけでなく、IcebergやHudiなど、メタデータストレージにHive Metastoreを使用する他のTable形式にもアクセスできます。
適用シナリオ
| シナリオ | 説明 |
|---|---|
| クエリ加速 | Dorisのディストリビューティッドコンピューティングエンジンを使用して、Hiveデータに直接アクセスし、クエリを高速化します。 |
| データ統合 | Hiveデータを読み取ってDoris内部Tableに書き込むか、Dorisコンピューティングエンジンを使用してZeroETL操作を実行します。 |
| データ書き戻し | Dorisがサポートする任意のソースからのデータを処理し、HiveTableに書き戻します。 |
カタログの設定
構文
CREATE CATALOG [IF NOT EXISTS] catalog_name PROPERTIES (
'type'='hms', -- required
'hive.metastore.type' = '<hive_metastore_type>', -- optional
'hive.version' = '<hive_version>', -- optional
'fs.defaultFS' = '<fs_defaultfs>', -- optional
{MetaStoreProperties},
{StorageProperties},
{HiveProperties},
{CommonProperties}
);
-
<hive_metastore_type>Hive Metastoreのタイプを指定します。
hms: 標準のHive Metastoreサービス。glue: Hive Metastore互換インターフェースを使用してAWS Glueメタデータサービスにアクセス。dlf: Hive Metastore互換インターフェースを使用してAlibaba Cloud DLFメタデータサービスにアクセス。
-
<fs_defaultfs>このパラメータは、DorisからこのHive カタログのTableにデータを書き込む際に必要です。例:
'fs.defaultFS' = 'hdfs://namenode:port' -
{MetaStoreProperties}MetaStorePropertiesセクションは、Metastoreメタデータサービスの接続および認証情報を入力するためのものです。詳細については「サポートされているメタデータサービス」セクションを参照してください。
-
{StorageProperties}StoragePropertiesセクションは、ストレージシステムに関連する接続および認証情報を入力するためのものです。詳細については「サポートされているストレージシステム」セクションを参照してください。
-
{HiveProperties}HivePropertiesセクションは、Hive カタログに関連するプロパティを入力するためのものです。
-
get_schema_from_table: デフォルト値はfalseです。デフォルトでは、DorisはHive MetastoreからTableスキーマ情報を取得します。ただし、場合によってはStorage schema reading not supportedなどのエラーのように、互換性の問題が発生することがあります。この場合、このパラメータをtrueに設定すると、TableスキーマはTableオブジェクトから直接取得されます。ただし、この方法では列のデフォルト値情報が無視されることに注意してください。このプロパティはバージョン2.1.10および3.0.6以降でサポートされています。 -
hive.recursive_directories: パーティションディレクトリをリストする際にサブディレクトリに再帰するかどうか。このパラメータはバージョン3.0.2以降でサポートされています。バージョン4.0より前では、このパラメータはデフォルトでfalseでしたが、それ以降のバージョンではtrueがデフォルトです。一部のHiveTableのパーティションパスは、Tableスキーマのパーティション情報と一致しない場合があります。サブディレクトリ内のデータファイルを取得するには、このパラメータをtrueに設定する必要があります。 -
hive.ignore_absent_partitions: 存在しないパーティションを無視するかどうか。デフォルトはtrueです。falseに設定した場合、存在しないパーティションに遭遇するとクエリはエラーを報告します。このパラメータはバージョン3.0.2以降でサポートされています。
-
-
{CommonProperties}CommonPropertiesセクションは、共通属性を入力するためのものです。カタログ 概要の「Common Properties」セクションを参照してください。
サポートされているHiveバージョン
Hive 1.x、2.x、3.x、および4.xをサポートします。
HiveトランザクショナルTableはバージョン3.x以降でサポートされています。詳細については「Hive Transactional Tables」セクションを参照してください。
サポートされているメタデータサービス
注意:異なるDorisバージョンでサポートされているサービスタイプとパラメータは若干異なります。[Examples]セクションを参照してください。
サポートされているストレージシステム
- HDFS
- AWS S3
- Google Cloud Storage
- Azure Blob
- Alibaba Cloud OSS
- Tencent Cloud COS
- Huawei Cloud OBS
- MINIO
DorisでHiveTableを作成してデータを書き込むには、カタログ属性に
fs.defaultFSプロパティを明示的に追加する必要があります。カタログがクエリのみを目的として作成される場合は、このパラメータを省略できます。異なるDorisバージョンでサポートされているサービスタイプとパラメータは若干異なります。[Examples]セクションを参照してください。
サポートされているデータフォーマット
列タイプマッピング
| Hive タイプ | Doris タイプ | コメント |
|---|---|---|
| boolean | boolean | |
| tinyint | tinyint | |
| smallint | smallint | |
| int | int | |
| bigint | bigint | |
| date | date | |
| timestamp | datetime(6) | 精度6のdatetimeにマッピング |
| float | float | |
| double | double | |
| decimal(P, S) | decimal(P, S) | 精度が指定されていない場合はdecimal(9, 0)がデフォルト |
| char(N) | char(N) | |
| varchar(N) | varchar(N) | |
| string | string | |
| binary | string/varbinary | カタログのenable.mapping.varbinaryプロパティによって制御されます(4.0.2以降でサポート)。デフォルトはfalseで、stringにマッピングされます。trueの場合はvarbinaryタイプにマッピングされます。 |
| array | array | |
| map | map | |
| struct | struct | |
| other | unsupported |
例
Hive Metastore
Version 3.1+
- HDFS
- ViewFs
- S3
- OSS
- COS
- OBS
- GCS
- Azure Blob
- MinIO
Kerberos認証なしでHMSおよびHDFSサービスにアクセス
CREATE CATALOG hive_hms_hdfs_test_catalog PROPERTIES (
'type' = 'hms',
'hive.metastore.uris' = 'thrift://127.0.0.1:9383',
'fs.defaultFS' = 'hdfs://127.0.0.1:8520',
'hadoop.username' = 'doris'
);
Kerberosによる認証が有効化されたHMSとHDFSサービスにアクセスする
CREATE CATALOG hive_hms_hdfs_kerberos_test_catalog PROPERTIES (
'type' = 'hms',
'hive.metastore.uris' = 'thrift://127.0.0.1:9583',
'hive.metastore.client.principal' = 'hive/presto-master.docker.cluster@LABS.TERADATA.COM',
'hive.metastore.client.keytab' = '/keytabs/hive-presto-master.keytab',
'hive.metastore.service.principal' = 'hive/hadoop-master@LABS.TERADATA.COM',
'hive.metastore.sasl.enabled ' = 'true',
'hive.metastore.authentication.type' = 'kerberos',
'fs.defaultFS' = 'hdfs://127.0.0.1:8520',
'hadoop.security.auth_to_local' = 'RULE:[2:\$1@\$0](.*@LABS.TERADATA.COM)s/@.*//
RULE:[2:\$1@\$0](.*@OTHERLABS.TERADATA.COM)s/@.*//
RULE:[2:\$1@\$0](.*@OTHERREALM.COM)s/@.*//
DEFAULT',
'hadoop.security.authentication' = 'kerberos',
'hadoop.kerberos.principal' = 'hive/presto-master.docker.cluster@LABS.TERADATA.COM',
'hadoop.kerberos.keytab' = '/keytabs/hive-presto-master.keytab'
);
CREATE CATALOG hive_viewfs PROPERTIES (
'type' = 'hms',
'hive.metastore.uris' = 'thrift://172.0.0.1:9083',
'fs.defaultFS' = 'viewfs://your-cluster',
'fs.viewfs.mounttable.your-cluster.link./ns1' = 'hdfs://your-nameservice/',
'fs.viewfs.mounttable.your-cluster.homedir' = '/ns1',
'dfs.nameservices' = 'your-nameservice',
'dfs.ha.namenodes.your-nameservice' = 'nn1,nn2',
'dfs.namenode.rpc-address.your-nameservice.nn1' = '172.21.0.2:8088',
'dfs.namenode.rpc-address.your-nameservice.nn2' = '172.21.0.3:8088',
'dfs.client.failover.proxy.provider.your-nameservice' = 'org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider'
);
Kerberos認証なしでHMSにアクセス
CREATE CATALOG hive_hms_s3_test_catalog PROPERTIES (
'type' = 'hms',
'hive.metastore.uris' = 'thrift://127.0.0.1:9383',
's3.region' = 'ap-east-1',
's3.access_key' = '<ak>',
's3.secret_key' = '<sk>'
);
IAM Assumed Role を使用したS3アクセス認証情報の取得 (3.1.2+)
CREATE CATALOG hive_hms_on_s3_iamrole PROPERTIES (
'type' = 'hms',
'hive.metastore.uris' = 'thrift://127.0.0.1:9383',
's3.region' = 'us-east-1',
's3.role_arn' = 'arn:aws:iam::543815668950:role/role'
);
Kerberos認証なしでHMSにアクセス
CREATE CATALOG hive_hms_on_oss_catalog PROPERTIES (
'type' = 'hms',
'hive.metastore.uris' = 'thrift://127.0.0.1:9383',
'oss.region' = 'cn-beijing',
'oss.endpoint' = 'oss-cn-beijing.aliyuncs.com',
'oss.access_key' = '<ak>',
'oss.secret_key' = '<sk>'
);
Kerberos認証を有効にしたHMSへのアクセス
CREATE CATALOG hive_hms_on_oss_kerberos_old_catalog PROPERTIES (
'type' = 'hms',
'hive.metastore.uris' = 'thrift://127.0.0.1:9583',
'hive.metastore.client.principal' = 'hive/presto-master.docker.cluster@LABS.TERADATA.COM',
'hive.metastore.client.keytab' = '/keytabs/hive-presto-master.keytab',
'hive.metastore.service.principal' = 'hive/hadoop-master@LABS.TERADATA.COM',
'hive.metastore.sasl.enabled ' = 'true',
'hive.metastore.authentication.type' = 'kerberos',
'hadoop.security.auth_to_local' = 'RULE:[2:\$1@\$0](.*@LABS.TERADATA.COM)s/@.*//
RULE:[2:\$1@\$0](.*@OTHERLABS.TERADATA.COM)s/@.*//
RULE:[2:\$1@\$0](.*@OTHERREALM.COM)s/@.*//
DEFAULT',
'oss.region' = 'cn-beijing',
'oss.endpoint' = 'oss-cn-beijing.aliyuncs.com',
'oss.access_key' = '<ak>',
'oss.secret_key' = '<sk>'
);
Kerberos認証なしでHMSにアクセスする
CREATE CATALOG hive_hms_cos_test_region_catalog PROPERTIES (
'type' = 'hms',
'hive.metastore.uris' = 'thrift://127.0.0.1:9383',
'cos.region' = 'ap-beijing',
'cos.endpoint' = 'cos.ap-beijing.myqcloud.com',
'cos.access_key' = '<ak>',
'cos.secret_key' = '<sk>'
);
Kerberos認証を有効にしたHMSへのアクセス
CREATE CATALOG hive_hms_on_cos_kerberos_new_catalog PROPERTIES (
'type' = 'hms',
'hive.metastore.uris' = 'thrift://127.0.0.1:9583',
'hive.metastore.client.principal' = 'hive/presto-master.docker.cluster@LABS.TERADATA.COM',
'hive.metastore.client.keytab' = '/keytabs/hive-presto-master.keytab',
'hive.metastore.service.principal' = 'hive/hadoop-master@LABS.TERADATA.COM',
'hive.metastore.sasl.enabled ' = 'true',
'hive.metastore.authentication.type' = 'kerberos',
'hadoop.security.auth_to_local' = 'RULE:[2:\$1@\$0](.*@LABS.TERADATA.COM)s/@.*//
RULE:[2:\$1@\$0](.*@OTHERLABS.TERADATA.COM)s/@.*//
RULE:[2:\$1@\$0](.*@OTHERREALM.COM)s/@.*//
DEFAULT',
'cos.region' = 'ap-beijing',
'cos.endpoint' = 'cos.ap-beijing.myqcloud.com',
'cos.access_key' = '<ak>',
'cos.secret_key' = '<sk>'
);
Kerberos認証なしでHMSにアクセス
CREATE CATALOG test_hive_on_hms_obs_catalog PROPERTIES (
'type' = 'hms',
'hive.metastore.uris' = 'thrift://127.0.0.1:9383',
'obs.region' = 'cn-north-4',
'obs.endpoint' = 'obs.cn-north-4.myhuaweicloud.com',
'cos.access_key' = '<ak>',
'cos.secret_key' = '<sk>'
);
Kerberos認証を有効にしてHMSにアクセスする
CREATE CATALOG hive_hms_on_obs_kerberos_new_catalog PROPERTIES (
'type' = 'hms',
'iceberg.catalog.type' = 'hms',
'hive.metastore.uris' = 'thrift://127.0.0.1:9583',
'hive.metastore.service.principal' = 'hive/hadoop-master@LABS.TERADATA.COM',
'hive.metastore.sasl.enabled' = 'true',
'hive.metastore.client.principal' = 'hive/presto-master.docker.cluster@LABS.TERADATA.COM',
'hive.metastore.client.keytab' = '/keytabs/hive-presto-master.keytab',
'hive.metastore.authentication.type' = 'kerberos',
'hadoop.security.auth_to_local' = 'RULE:[2:\$1@\$0](.*@LABS.TERADATA.COM)s/@.*//
RULE:[2:\$1@\$0](.*@OTHERLABS.TERADATA.COM)s/@.*//
RULE:[2:\$1@\$0](.*@OTHERREALM.COM)s/@.*//
DEFAULT',
'obs.region' = 'cn-north-4',
'obs.endpoint' = 'obs.cn-north-4.myhuaweicloud.com',
'obs.access_key' = '<ak>',
'obs.secret_key' = '<sk>'
);
Kerberos認証なしでHMSにアクセス
CREATE CATALOG `hive_hms_on_gcs_new_catalog` PROPERTIES (
'type' = 'hms',
'hive.metastore.uris' = 'thrift://127.0.0.1:9383',
'gs.access_key' = '<ak>',
'gs.secret_key' = '<sk>',
'fs.gcs.support'='true'
);
3.1.3以降でサポート
CREATE CATALOG `hive_hms_on_azure_blob_new_catalog` PROPERTIES (
'type' = 'hms',
'hive.metastore.uris' = 'thrift://127.0.0.1:9383',
'azure.account_name' = '<account_name>',
'azure.account_key' = '<account_key>',
'azure.endpoint' = 'https://<account_name>.blob.core.windows.net',
'fs.azure.support'='true'
);
CREATE CATALOG test_hive_on_hms_minio_catalog PROPERTIES (
'type' = 'hms',
'hive.metastore.uris' = 'thrift://127.0.0.1:9383',
'fs.minio.support' = 'true',
'minio.use_path_style' = 'true',
'minio.endpoint' = 'http://127.0.0.1:19001',
'minio.access_key' = '<ak>',
'minio.secret_key' = '<sk>'
);
Version 2.1 & 3.0
- HDFS
- S3
- OSS
- COS
- OBS
- GCS
- MinIO
Kerberos認証なしでHMSにアクセス
CREATE CATALOG hive_hms_hdfs_test_catalog PROPERTIES (
'type' = 'hms',
'hive.metastore.uris' = 'thrift://127.0.0.1:9383',
'hadoop.username' = 'doris',
'fs.defaultFS' = 'hdfs://127.0.0.1:8320'
);
Kerberosメントを有効にしたHMSへのアクセス
CREATE CATALOG test_two_hive_kerberos PROPERTIES (
'type' = 'hms',
'hive.metastore.uris' = 'thrift://127.0.0.1:9583',
'hive.metastore.sasl.enabled ' = 'true',
'hive.metastore.kerberos.principal' = 'hive/hadoop-master@LABS.TERADATA.COM',
'fs.defaultFS' = 'hdfs://127.0.0.1:8520',
'hadoop.kerberos.min.seconds.before.relogin' = '5',
'hadoop.security.authentication' = 'kerberos',
'hadoop.kerberos.principal' = 'hive/presto-master.docker.cluster@LABS.TERADATA.COM',
'hadoop.kerberos.keytab' = '/keytabs/hive-presto-master.keytab',
'hadoop.security.auth_to_local' = 'RULE:[2:\$1@\$0](.*@LABS.TERADATA.COM)s/@.*//
RULE:[2:\$1@\$0](.*@OTHERLABS.TERADATA.COM)s/@.*//
RULE:[2:\$1@\$0](.*@OTHERREALM.COM)s/@.*//
DEFAULT'
);
CREATE CATALOG hms_on_s3_catalog PROPERTIES (
'type' = 'hms',
'hive.metastore.uris' = 'thrift://127.0.0.1:9383',
's3.region' = 'ap-east-1',
's3.endpoint' = 's3.ap-east-1.amazonaws.com',
's3.access_key' = '<ak>',
's3.secret_key' = '<sk>'
);
CREATE CATALOG hive_hms_on_oss_catalog PROPERTIES (
'type' = 'hms',
'hive.metastore.uris' = 'thrift://127.0.0.1:9383',
'oss.region' = 'cn-beijing',
'oss.endpoint' = 'oss-cn-beijing.aliyuncs.com',
'oss.access_key' = '<ak>',
'oss.secret_key' = '<sk>'
);
CREATE CATALOG hive_hms_cos_test_region_catalog PROPERTIES (
'type' = 'hms',
'hive.metastore.uris' = 'thrift://127.0.0.1:9383',
'cos.region' = 'ap-beijing',
'cos.endpoint' = 'cos.ap-beijing.myqcloud.com',
'cos.access_key' = '<ak>',
'cos.secret_key' = '<sk>'
);
CREATE CATALOG test_hive_on_hms_obs_catalog PROPERTIES (
'type' = 'hms',
'hive.metastore.uris' = 'thrift://127.0.0.1:9383',
'obs.region' = 'cn-north-4',
'obs.endpoint' = 'obs.cn-north-4.myhuaweicloud.com',
'obs.access_key' = '<ak>',
'obs.secret_key' = '<sk>'
);
CREATE CATALOG hive_hms_on_gcs_s3 PROPERTIES (
'type'='hms',
'hive.metastore.uris' = 'thrift://127.0.0.1:9383',
'warehouse' = 's3://bucket/gcs/paimon_warehouse',
'gs.access_key' = '<ak>',
'gs.secret_key' = '<sk>',
'gs.endpoint' = 'storage.googleapis.com'
);
CREATE CATALOG test_hive_on_hms_minio_catalog PROPERTIES (
'type' = 'hms',
'hive.metastore.uris' = 'thrift://127.0.0.1:9383',
's3.endpoint' = 'http://127.0.0.1:19000',
's3.access_key' = '<ak>',
's3.secret_key' = '<sk>',
's3.path.style.access' = 'true'
);
AWS Glue
Version 3.1+
- S3
AWS GlueとS3ストレージサービスは同じ認証情報を共有します。
CREATE CATALOG hive_glue_on_s3_catalog PROPERTIES (
'type' = 'hms',
'hive.metastore.type' = 'glue',
'glue.region' = 'ap-east-1',
'glue.endpoint' = 'https://glue.ap-east-1.amazonaws.com',
'glue.access_key' = '<ak>',
'glue.secret_key' = '<sk>'
);
Glueサービスの認証情報とS3の認証情報が異なる場合、以下の方法でS3の認証情報を個別に指定できます。
CREATE CATALOG hive_glue_on_s3_catalog PROPERTIES (
'type' = 'hms',
'hive.metastore.type' = 'glue',
'glue.region' = 'ap-east-1',
'glue.endpoint' = 'https://glue.ap-east-1.amazonaws.com',
'glue.access_key' = '<ak>',
'glue.secret_key' = '<sk>',
's3.region' = 'ap-east-1',
's3.endpoint' = 'https://s3.ap-east-1.amazonaws.com/',
's3.access_key' = '<ak>',
's3.secret_key' = '<sk>'
);
IAM Assumed Roleを使用してS3アクセス認証情報を取得する(3.1.2以降)
CREATE CATALOG `glue_hive_iamrole` PROPERTIES (
'type' = 'hms',
'hive.metastore.type' = 'glue',
'glue.region' = 'us-east-1',
'glue.endpoint' = 'https://glue.us-east-1.amazonaws.com',
'glue.role_arn' = '<role_arn>'
);
</TabItem>
</Tabs>
</details>
<details>
<summary>Version 2.1 & 3.0</summary>
<Tabs>
<TabItem value='S3' label='S3' default>
AWS Glue and S3 storage services share the same authentication credentials.
In non-EC2 environments, you need to use [aws configure](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html) to configure Credentials information and generate a credentials file in the ~/.aws directory.
Alternatively, you can explicitly specify the credentials provider factory class
when creating the catalog:
<b>"aws.catalog.credentials.provider.factory.class" = "com.amazonaws.glue.catalog.credentials.ConfigurationAWSCredentialsProviderFactory"</b>
```sql
create catalog hive_glue PROPERTIES(
'type' = 'hms',
'hive.metastore.type' = 'glue',
'glue.endpoint' = 'https://glue.ap-northeast-1.amazonaws.com',
'glue.region' = 'ap-northeast-1',
'glue.access_key' = '<ak>',
'glue.secret_key' = '<sk>'
);
Aliyun DLF
Version 3.1+
- DLF 1.0
CREATE CATALOG hive_dlf_oss_test_catalog PROPERTIES (
'type' = 'hms',
'hive.metastore.type' = 'dlf',
'dlf.uid' = '203225413946383283',
'dlf.catalog_id' = 'p2_regression_case',
'dlf.endpoint' = 'dlf.cn-beijing.aliyuncs.com',
'dlf.region' = 'cn-beijing',
'dlf.access_key' = '<ak>',
'dlf.secret_key' = '<sk>'
);
Version 2.1 & 3.0
- DLF 1.0
CREATE CATALOG hive_dlf_oss_test_catalog PROPERTIES (
'type' = 'hms',
'hive.metastore.type' = 'dlf',
'dlf.uid' = '203225413946383283',
'dlf.catalog.id' = 'p2_regression_case',
'dlf.endpoint' = 'dlf.cn-beijing.aliyuncs.com',
'dlf.region' = 'cn-beijing',
'dlf.access_key' = '<ak>',
'dlf.secret_key' = '<sk>'
);
Query 運用
Basic Query
After configuring the カタログ, you can query the table data within the カタログ using the following method:
-- 1. カタログに切り替え、データベースを使用してクエリを実行
SWITCH hive_ctl;
USE hive_db;
SELECT * FROM hive_tbl LIMIT 10;
-- 2. hiveデータベースを直接使用
USE hive_ctl.hive_db;
SELECT * FROM hive_tbl LIMIT 10;
-- 3. 完全修飾名を使用してクエリを実行
SELECT * FROM hive_ctl.hive_db.hive_tbl LIMIT 10;
Querying Hive Partitions
You can query Hive partition information using the following two methods:
-
SHOW PARTITIONS FROM [catalog.][db.]hive_tableThis statement lists all partitions and their values for the specified Hive table.
SHOW PARTITIONS FROM hive_table;
+--------------------------------+ | パーティション | +--------------------------------+ | pt1=2024-10-10/pt2=beijing | | pt1=2024-10-10/pt2=shanghai | | pt1=2024-10-11/pt2=beijing | | pt1=2024-10-11/pt2=shanghai | | pt1=2024-10-12/pt2=nanjing | +--------------------------------+
* Using the `table$partitions` Metadata Table
Starting from versions 2.1.7 and 3.0.3, you can query Hive partition information through the `table$partitions` metadata table. This table is essentially relational, with each partition column represented as a column, allowing it to be used in any SELECT statement.
```sql
SELECT * FROM hive_table$partitions;
+------------+-------------+
| pt1 | pt2 |
+------------+-------------+
| 2024-10-10 | beijing |
| 2024-10-10 | shanghai |
| 2024-10-12 | nanjing |
| 2024-10-11 | beijing |
| 2024-10-11 | shanghai |
+------------+-------------+
Querying Hive Transactional Tables
Hive Transactional tables support ACID semantics. For more details, see Hive Transactions.
-
Support for Hive Transactional Tables
Table タイプ Hiveでサポートされている操作 Hive Table Properties サポート対象のHiveバージョン Full-ACID Transactional Table Supports Insert, アップデート, Delete 'transactional'='true'4.x, 3.x, 2.x (2.x requires Major コンパクション in Hive to read) Insert-Only Transactional Table Supports Insert only 'transactional'='true','transactional_properties'='insert_only'4.x, 3.x, 2.x (specify hive.versionwhen creating the catalog) -
Current Limitations
Original Files scenarios are not supported. When a table is converted to a Transactional table, new data files will use the Hive Transactional table schema, but existing data files will not be converted. These files are referred to as Original Files.
Querying Hive Views
You can query Hive Views, but there are some limitations:
-
The Hive View definition (HiveQL) must be compatible with SQL statements supported by Doris. Otherwise, a parsing error will occur.
-
Some functions supported by HiveQL may have the same name as those in Doris but behave differently. This could lead to discrepancies between the results obtained from Hive and Doris. If you encounter such issues, please report them to the community.
Write 運用
Data can be written to Hive tables using the INSERT statement. This is supported for Hive tables created by Doris or existing Hive tables with compatible formats.
For partitioned tables, data will automatically be written to the corresponding partition or a new partition will be created based on the data. Currently, specifying a partition for writing is not supported.
INSERT INTO
The INSERT operation appends data to the target table. Specifying a partition for writing is currently not supported.
INSERT INTO hive_tbl VALUES (val1, val2, val3, val4);
INSERT INTO hive_ctl.hive_db.hive_tbl SELECT col1, col2 FROM internal.db1.tbl1;
INSERT INTO hive_tbl(col1, col2) VALUES (val1, val2);
INSERT INTO hive_tbl(col1, col2, partition_col1, partition_col2) VALUES (1, 2, "beijing", "2023-12-12");
INSERT OVERWRITE
INSERT OVERWRITE completely replaces the existing data in the table with new data. Specifying a partition for writing is currently not supported.
INSERT OVERWRITE TABLE hive_tbl VALUES (val1, val2, val3, val4);
INSERT OVERWRITE TABLE hive_ctl.hive_db.hive_tbl(col1, col2) SELECT col1, col2 FROM internal.db1.tbl1;
The semantics of INSERT OVERWRITE are consistent with Hive, with the following behaviors:
-
If the target table is partitioned and the source table is empty, the operation has no effect. The target table remains unchanged.
-
If the target table is non-partitioned and the source table is empty, the target table will be cleared.
-
Since specifying a partition for writing is not supported, INSERT OVERWRITE automatically handles the relevant partitions in the target table based on the source table values. If the target table is partitioned, only the affected partitions will be overwritten; unaffected partitions remain unchanged.
CTAS
You can create a Hive table and insert data using the CTAS (CREATE TABLE AS SELECT) statement:
CREATE TABLE hive_ctas ENGINE=hive AS SELECT * FROM other_table;
CTAS supports specifying file formats, partitioning methods, and more, as shown below:
CREATE TABLE hive_ctas ENGINE=hive
PARTITION BY LIST (pt1, pt2) ()
AS SELECT col1, pt1, pt2 FROM part_ctas_src WHERE col1 > 0;
CREATE TABLE hive_ctl.hive_db.hive_ctas (col1, col2, pt1) ENGINE=hive
PARTITION BY LIST (pt1) ()
PROPERTIES (
"file_format"="parquet",
"compression"="zstd"
)
AS SELECT col1, pt1 AS col2, pt2 AS pt1 FROM test_ctas.part_ctas_src WHERE col1 > 0;
Related パラメータ
- Session variables
| Parameter name | Default value | Desciption | Since version |
|---|---|---|---|
hive_parquet_use_column_names | true | When Doris reads the Parquet data type of the Hive table, it will find the column with the same name from the Parquet file to read the data according to the column name of the Hive table by default. When this variable is false, Doris will read data from the Parquet file according to the column order in the Hive table, regardless of the column name. Similar to the parquet.column.index.access variable in Hive. This parameter only applies to the top-level column name and is invalid inside the Struct. | 2.1.6+, 3.0.3+ |
hive_orc_use_column_names | true | Similar to hive_parquet_use_column_names, it is for the Hive table ORC data type. Similar to the orc.force.positional.evolution variable in Hive. | 2.1.6+, 3.0.3+ |
-
BE
パラメータ名 デフォルト値 デスクリプション hive_sink_max_file_size1GB Maximum data file size. When the data size exceeds this limit, the current file will be closed, and a new file will be created for further writing. table_sink_partition_write_max_partition_nums_per_writer128 Maximum number of partitions that each instance can write to on a BE node. table_sink_non_partition_write_scaling_data_processed_threshold25MB Data volume threshold for starting scaling-write for non-partitioned tables. A new writer (instance) will be used for every additional table_sink_non_partition_write_scaling_data_processed_thresholdof data. This mechanism adjusts the number of writers (instances) based on data volume to enhance concurrent write throughput, saving resources and minimizing file numbers for smaller data volumes.table_sink_partition_write_min_data_processed_rebalance_threshold25MB Minimum data volume threshold to trigger rebalancing for partitioned tables. Rebalancing is triggered if current accumulated data volume-data volume since last rebalancing or initial accumulation>=table_sink_partition_write_min_data_processed_rebalance_threshold. If the final file size varies significantly, reduce this threshold to improve balance. However, a lower threshold may increase rebalancing costs, potentially affecting performance.table_sink_partition_write_min_partition_data_processed_rebalance_threshold15MB Minimum partition data volume threshold to trigger rebalancing. Rebalancing occurs if current partition data volume>=threshold*number of tasks already allocated to the partition. If the final file size varies significantly, reduce this threshold to improve balance. However, a lower threshold may increase rebalancing costs, potentially affecting performance.
Database and Table Management
Users can create and delete databases and tables in the Hive Metastore through Doris. Note that Doris only calls the Hive Metastore API for these 運用 and does not store or persist any Hive metadata itself.
Creating and Dropping Databases
You can switch to the appropriate カタログ using the SWITCH statement and execute the CREATE DATABASE statement:
SWITCH hive_ctl;
CREATE DATABASE [IF NOT EXISTS] hive_db;
You can also create a database using a fully qualified name or specify a location, such as:
CREATE DATABASE [IF NOT EXISTS] hive_ctl.hive_db;
CREATE DATABASE [IF NOT EXISTS] hive_ctl.hive_db
PROPERTIES ('location'='hdfs://172.21.16.47:4007/path/to/db/');
You can view the location information of the Database using the SHOW CREATE DATABASE command:
mysql> SHOW CREATE DATABASE hive_db;
+----------+---------------------------------------------------------------------------------------------+
| Database | Create Database |
+----------+---------------------------------------------------------------------------------------------+
| hive_db | CREATE DATABASE hive_db LOCATION 'hdfs://172.21.16.47:4007/usr/hive/warehouse/hive_db.db' |
+----------+---------------------------------------------------------------------------------------------+
To drop a database:
DROP DATABASE [IF EXISTS] hive_ctl.hive_db;
For a Hive Database, you must first delete all tables under that Database before you can delete the Database itself; otherwise, an error will occur. This operation will also delete the corresponding Database in Hive.
Creating and Dropping Tables
-
Creating Tables
Doris supports creating both partitioned and non-partitioned tables in Hive.
-- パーティション化されていないhiveTableを作成
CREATE TABLE unpartitioned_table (
col1 BOOLEAN COMMENT 'col1',
col2 INT COMMENT 'col2',
col3 BIGINT COMMENT 'col3',
col4 CHAR(10) COMMENT 'col4',
col5 FLOAT COMMENT 'col5',
col6 DOUBLE COMMENT 'col6',
col7 DECIMAL(9,4) COMMENT 'col7',
col8 VARCHAR(11) COMMENT 'col8',
col9 STRING COMMENT 'col9'
) ENGINE=hive
PROPERTIES (
'file_format'='parquet'
);
-- パーティション化されたhiveTableを作成
-- パーティションカラムはTableのカラム定義リストに含まれている必要があります
CREATE TABLE partition_table (
col1 BOOLEAN COMMENT 'col1',
col2 INT COMMENT 'col2',
col3 BIGINT COMMENT 'col3',
col4 DECIMAL(2,1) COMMENT 'col4',
pt1 VARCHAR COMMENT 'pt1',
pt2 VARCHAR COMMENT 'pt2'
) ENGINE=hive
PARTITION BY LIST (pt1, pt2) ()
PROPERTIES (
'file_format'='orc',
'compression'='zlib'
);
-- textフォーマットTableを作成(2.1.7 & 3.0.3以降)
CREATE TABLE text_table (
id INT,
name STRING
) PROPERTIES (
'file_format'='text',
'compression'='gzip',
'field.delim'='\t',
'line.delim'='\n',
'collection.delim'=';',
'mapkey.delim'=':',
'serialization.null.format'='\N',
'escape.delim'='\'
);
After creating a table, you can view the Hive table creation statement using the `SHOW CREATE TABLE` command.
Note that unlike Hive's table creation syntax, when creating a partitioned table in Doris, partition columns must be included in the table schema. Additionally, partition columns must be placed at the end of the schema and maintain the same order.
:::tip
For Hive clusters where ACID transaction features are enabled by default, tables created by Doris will have the `transactional` property set to `true`. Since Doris only supports certain features of Hive transactional tables, this may lead to issues where Doris cannot read the Hive tables it creates. To avoid this, explicitly set `"transactional" = "false"` in the table properties to create non-transactional Hive tables:
```sql
CREATE TABLE non_acid_table(
`col1` BOOLEAN COMMENT 'col1',
`col2` INT COMMENT 'col2',
`col3` BIGINT COMMENT 'col3'
) ENGINE=hive
PROPERTIES (
'transactional'='false'
);
:::
-
Dropping Tables
You can delete a Hive table using the
DROP TABLEstatement. When a table is deleted, all data, including partition data, is also removed. -
Column タイプ Mapping
Refer to the [Column タイプ Mapping] section for details. Note the following restrictions:
- Columns must be of the default nullable type;
NOT NULLis not supported. - Hive 3.0 supports setting default values. To set default values, explicitly add
"hive.version" = "3.0.0"in the catalog properties. - If inserted data types are incompatible (e.g., inserting
'abc'into a numeric type), the value will be converted tonull.
- Columns must be of the default nullable type;
-
Partitioning
In Hive, partition types correspond to List partitions in Doris. Therefore, when creating a Hive partitioned table in Doris, use the List partition syntax, but there is no need to explicitly enumerate each partition. Doris will automatically create the corresponding Hive partition based on data values during data insertion. Single-column or multi-column partitioned tables are supported.
-
File Formats
-
ORC (default)
-
Parquet
Note that when the DATETIME type is written to a Parquet file, the physical type used is INT96 instead of INT64. This is to be compatible with the logic of Hive versions prior to 4.0.
-
Text (supported from versions 2.1.7 and 3.0.3)
Text format supports the following table properties:
field.delim: Column delimiter. Default is\1.line.delim: Line delimiter. Default is\n.collection.delim: Delimiter for elements in complex types. Default is\2.mapkey.delim: Delimiter for map key-value pairs. Default is\3.serialization.null.format: Format for storingNULLvalues. Default is\N.escape.delim: Escape character. Default is\.
-
-
Compression Formats
- Parquet: snappy (default), zstd, plain (no compression)
- ORC: snappy, zlib (default), zstd, plain (no compression)
- Text: gzip, deflate, bzip2, zstd, lz4, lzo, snappy, plain (default, no compression)
-
Storage Medium
- HDFS
- Object Storage
Subscribing to Hive Metastore Events
By having the FE nodes periodically read Notification Events from the HMS, Doris can detect real-time changes in Hive table metadata, improving metadata timeliness. Currently, the following events are supported:
| Event | Action and Corresponding Behavior |
|---|---|
| CREATE DATABASE | Creates a database in the corresponding data directory. |
| DROP DATABASE | Deletes a database in the corresponding data directory. |
| ALTER DATABASE | Mainly affects changes to database properties, comments, and default storage locations. These changes do not affect Doris's ability to query external data directories, so this event is currently ignored. |
| CREATE TABLE | Creates a table in the corresponding database. |
| DROP TABLE | Deletes a table in the corresponding database and invalidates the table cache. |
| ALTER TABLE | If renamed, deletes the old table and creates a new one with the new name; otherwise, invalidates the table cache. |
| ADD PARTITION | Adds a partition to the cached partition list of the corresponding table. |
| DROP PARTITION | Removes a partition from the cached partition list and invalidates the partition cache. |
| ALTER PARTITION | If renamed, deletes the old partition and creates a new one with the new name; otherwise, invalidates the partition cache. |
-
When data import causes file changes, partitioned tables trigger an
ALTER PARTITIONevent, while non-partitioned tables trigger anALTER TABLEevent. -
If you bypass HMS and directly manipulate the file system, HMS will not generate corresponding events, and Doris will not detect metadata changes.
The following parameters in fe.conf are related to this feature:
-
enable_hms_events_incremental_sync: Enables automatic incremental metadata synchronization. Disabled by default. -
hms_events_polling_interval_ms: Interval for reading events, default is 10000 milliseconds. -
hms_events_batch_size_per_rpc: Maximum number of events to read per RPC, default is 500.
To use this feature (excluding Huawei Cloud MRS), you need to modify the hive-site.xml of HMS and restart both HMS and HiveServer2.
<property>
<name>hive.metastore.event.db.notification.api.auth</name>
<value>false</value>
</property>
<property>
<name>hive.metastore.dml.events</name>
<value>true</value>
</property>
<property>
<name>hive.metastore.transactional.event.listeners</name>
<value>org.apache.hive.hcatalog.listener.DbNotificationListener</value>
</property>
For Huawei Cloud MRS, you need to modify the hivemetastore-site.xml and restart both HMS and HiveServer2.
<property>
<name>metastore.transactional.event.listeners</name>
<value>org.apache.hive.hcatalog.listener.DbNotificationListener</value>
</property>
付録
取引メカニズム
Write 運用 to Hive are placed in a separate transaction. Before the transaction is committed, the data is not visible externally. Only after the transaction is committed do the related table 運用 become visible to others.
Transactions ensure the atomicity of 運用, meaning all 運用 within a transaction either succeed together or fail together.
Transactions cannot fully guarantee isolation of 運用, but they strive to minimize inconsistencies by separating file system 運用 from Hive Metastore metadata 運用.
For example, in a transaction that requires modifying multiple partitions of a Hive table, if the task is divided into two batches, the first batch might be visible externally before the second batch is completed. This means the first batch of partitions can be read, but the second batch cannot.
If any exceptions occur during the transaction commit process, the transaction will be rolled back completely, including modifications to HDFS files and Hive Metastore metadata, without requiring any additional user intervention.
並行書き込みメカニズム
Apache Doris currently supports concurrent writing using multiple insert statements. However, users need to ensure that concurrent writes do not result in potential conflicts.
Since regular non-transactional Hive tables lack a complete 取引メカニズム, the Apache Doris 取引メカニズム aims to minimize inconsistency windows but cannot guarantee true ACID properties. Therefore, concurrent writes to Hive tables in Apache Doris may lead to data consistency issues.
-
Concurrent
INSERT運用 -
INSERT運用 append data and do not conflict when executed concurrently, producing the expected results. -
Concurrent
INSERT OVERWRITE運用 -
Concurrent
INSERT OVERWRITE運用 on the same table or partition may lead to data loss or corruption, resulting in unpredictable outcomes. -
Common solutions include:
-
For partitioned tables, write data to different partitions. Concurrent 運用 on different partitions do not conflict.
-
For non-partitioned tables, use
INSERTinstead ofINSERT OVERWRITEto avoid conflicts. -
For 運用 that may conflict, ensure that only one write operation occurs at a time on the business side.
-
HDFS File 運用
For Hive table data on HDFS, data is typically first written to a temporary directory, then finalized using file system 運用 like rename. Below is a detailed explanation of the specific file 運用 on HDFS for different data 運用.
The temporary directory format for data is: /tmp/.doris_staging/<username>/<uuid>
The format for the written data file names is: <query-id>_<uuid>-<index>.<compress-type>.<file-type>
Here are examples of file 運用 under various scenarios:
-
Non-Partitioned Table
-
Append (Add Data)
-
Target table directory:
hdfs://ns/usr/hive/warehouse/example.db/table1 -
Temporary file:
hdfs://ns/tmp/.doris_staging/root/f02247cb662846038baae272af5eeb05/b35fdbcea3a4e39-86d1f36987ef1492_7e3985bf-9de9-4fc7-b84e-adf11aa08756-0.orc -
During the commit phase, all temporary files are moved to the target table directory.
-
-
Overwrite (Replace Data)
-
Target table directory:
hdfs://ns/usr/hive/warehouse/example.db/table1 -
Temporary file:
hdfs://ns/tmp/.doris_staging/root/f02247cb662846038baae272af5eeb05/b35fdbcea3a4e39-86d1f36987ef1492_7e3985bf-9de9-4fc7-b84e-adf11aa08756-0.orc -
Commit phase steps:
-
Rename the target table directory to a temporary directory:
hdfs://ns/usr/hive/warehouse/example.db/_temp_b35fdbcea3a4e39-86d1f36987ef1492_table1 -
Rename the temporary directory to the target table directory.
-
Delete the temporary target table directory.
-
-
-
-
Partitioned Table
-
Add (Add to New パーティション)
-
Target table directory:
hdfs://ns/usr/hive/warehouse/example.db/table2/part_col=2024-01-01 -
Temporary file:
hdfs://ns/tmp/.doris_staging/root/a7eac7505d7a42fdb06cb9ef1ea3e912/par1=a/d678a74d232345e0-b659e2fb58e86ffd_549ad677-ee75-4fa1-b8a6-3e821e1dae61-0.orc -
During the commit phase, the temporary directory is renamed to the target table directory.
-
-
Append (Add Data to Existing パーティション)
-
Target table directory:
hdfs://ns/usr/hive/warehouse/example.db/table2/part_col=2024-01-01 -
Temporary file:
hdfs://ns/tmp/.doris_staging/root/a7eac7505d7a42fdb06cb9ef1ea3e912/par1=a/d678a74d232345e0-b659e2fb58e86ffd_549ad677-ee75-4fa1-b8a6-3e821e1dae61-0.orc -
During the commit phase, files from the temporary directory are moved to the target table directory.
-
-
Overwrite (Replace Existing パーティション)
-
Target table directory:
hdfs://ns/usr/hive/warehouse/example.db/table2/part_col=2024-01-01 -
Temporary file:
hdfs://ns/tmp/.doris_staging/root/a7eac7505d7a42fdb06cb9ef1ea3e912/par1=a/d678a74d232345e0-b659e2fb58e86ffd_549ad677-ee75-4fa1-b8a6-3e821e1dae61-0.orc -
Commit phase steps:
-
Rename the target partition directory to a temporary partition directory:
hdfs://ns/usr/hive/warehouse/example.db/table2/_temp_d678a74d232345e0-b659e2fb58e86ffd_part_col=2024-01-01 -
Rename the temporary partition directory to the target partition directory.
-
Delete the temporary target partition directory.
-
-
-
Change ログ
| Doris Version | Feature Support |
|---|---|
| 2.1.6 | Support for writing back to Hive tables |
| 3.0.4 | Support for Hive tables in JsonSerDe format. Support for transactional tables in Hive4. |