バージョン: 4.x

Hive HLL UDF

Hive HLL UDFは、HiveTable内でHLL操作を生成するためのUDFセットを提供します。これらはDoris HLLと同一です。Hive HLLはSpark HLL Loadを通じてDorisにインポートできます。HLLに関する詳細については、近似重複排除でのHLLの使用を参照してください：Approximate Deduplication Using HLL

機能紹介：

UDAF

· to_hll：Doris HLL列を返す集約関数で、to_bitmap関数と類似しています

· hll_union：グループの結合を計算してDoris HLL列を返す集約関数で、bitmap_union関数と類似しています

2. UDF

· hll_cardinality：HLLに追加された個別要素の数を返します。bitmap_count関数と類似しています

主な目的：

辞書構築とHLL事前集約の必要性を排除することで、Dorisへのデータインポート時間を短縮
HLLを使用してデータを圧縮することでHiveストレージを節約し、Bitmap統計と比較してストレージコストを大幅に削減
Hive内で柔軟なHLL操作を提供し、結合とカーディナリティ統計を含み、結果のHLLを直接Dorisにインポート可能

注意： HLL統計は約1%から2%の誤差率を持つ近似計算です。

使用方法

HiveTableを作成してテストデータを挿入

-- Create a test database, e.g., hive_test
use hive_test;

-- Create a Hive HLL table
CREATE TABLE IF NOT EXISTS `hive_hll_table`(
  `k1`   int       COMMENT '',
  `k2`   String    COMMENT '',
  `k3`   String    COMMENT '',
  `uuid` binary    COMMENT 'hll'
) comment  'comment'

-- Create a normal Hive table and insert test data
CREATE TABLE IF NOT EXISTS `hive_table`(
    `k1`   int       COMMENT '',
    `k2`   String    COMMENT '',
    `k3`   String    COMMENT '',
    `uuid` int       COMMENT ''
) comment  'comment'

insert into hive_table select 1, 'a', 'b', 12345;
insert into hive_table select 1, 'a', 'c', 12345;
insert into hive_table select 2, 'b', 'c', 23456;
insert into hive_table select 3, 'c', 'd', 34567;

Hive HLL UDFの使用:

Hive HLL UDFはHive/Sparkで使用する必要があります。まず、FEをコンパイルしてhive-udf.jarファイルを取得します。コンパイルの準備: ldbソースコードをコンパイルしている場合は、直接FEをコンパイルできます。そうでない場合は、thriftを手動でインストールする必要があります。コンパイルとインストールについてはSetting Up Dec Env for FE - IntelliJ IDEAを参照してください。

-- Clone the Doris source code
git clone https://github.com/apache/doris.git
cd doris
git submodule update --init --recursive

-- Install thrift (skip if already installed)
-- Enter the FE directory
cd fe

-- Execute the Maven packaging command (all FE submodules will be packaged)
mvn package -Dmaven.test.skip=true
-- Or package only the hive-udf module
mvn package -pl hive-udf -am -Dmaven.test.skip=true

-- The packaged hive-udf.jar file will be generated in the target directory
-- Upload the compiled hive-udf.jar file to HDFS, e.g., to the root directory
hdfs dfs -put hive-udf/target/hive-udf.jar /

次に、Hiveに入り、以下のSQL文を実行してください：

-- Load the hive hll udf jar package, modify the hostname and port according to your actual situation
add jar hdfs://hostname:port/hive-udf.jar;

-- Create UDAF functions
create temporary function to_hll as 'org.apache.doris.udf.ToHllUDAF' USING JAR 'hdfs://hostname:port/hive-udf.jar';
create temporary function hll_union as 'org.apache.doris.udf.HllUnionUDAF' USING JAR 'hdfs://hostname:port/hive-udf.jar';


-- Create UDF functions
create temporary function hll_cardinality as 'org.apache.doris.udf.HllCardinalityUDF' USING JAR 'hdfs://node:9000/hive-udf.jar';


-- Example: Use the to_hll UDAF to aggregate and generate HLL, and write it to the Hive HLL table
insert into hive_hll_table
select 
    k1,
    k2,
    k3,
    to_hll(uuid) as uuid
from 
    hive_table
group by 
    k1,
    k2,
    k3

-- Example: Use hll_cardinality to calculate the number of elements in the HLL
select k1, k2, k3, hll_cardinality(uuid) from hive_hll_table;
+-----+-----+-----+------+
| k1  | k2  | k3  | _c3  |
+-----+-----+-----+------+
| 1   | a   | b   | 1    |
| 1   | a   | c   | 1    |
| 2   | b   | c   | 1    |
| 3   | c   | d   | 1    |
+-----+-----+-----+------+

-- Example: Use hll_union to calculate the union of groups, returning 3 rows
select k1, hll_union(uuid) from hive_hll_table group by k1;

-- Example: Also can merge and then continue to statistics
select k3, hll_cardinality(hll_union(uuid)) from hive_hll_table group by k3;
+-----+------+
| k3  | _c1  |
+-----+------+
| b   | 1    |
| c   | 2    |
| d   | 1    |
+-----+------+

Hive HLL UDF 説明

Hive HLL を Doris にインポートする

方法1: カタログ (推奨)

TEXT形式として指定されたHiveTableを作成します。Binary型の場合、Hiveはbase64エンコードされた文字列として保存します。この時、Hive Catalogを使用してhll_from_base64関数を使ってHLLデータを直接Dorisにインポートできます。

完全な例は以下の通りです：

HiveTableを作成する

CREATE TABLE IF NOT EXISTS `hive_hll_table`(
`k1`   int       COMMENT '',
`k2`   String    COMMENT '',
`k3`   String    COMMENT '',
`uuid` binary    COMMENT 'hll'
) stored as textfile

-- then reuse the previous steps to insert data from a normal table into it using the to_hll function

Dorisカタログを作成する

CREATE CATALOG hive PROPERTIES (
    'type'='hms',
    'hive.metastore.uris' = 'thrift://127.0.0.1:9083'
);

Doris内部Tableを作成する

CREATE TABLE IF NOT EXISTS `doris_test`.`doris_hll_table`(
    `k1`   int                   COMMENT '',
    `k2`   varchar(10)           COMMENT '',
    `k3`   varchar(10)           COMMENT '',
    `uuid` HLL  HLL_UNION  COMMENT 'hll'
)
AGGREGATE KEY(k1, k2, k3)
DISTRIBUTED BY HASH(`k1`) BUCKETS 1
PROPERTIES (
    "replication_allocation" = "tag.location.default: 1"
);

HiveからDorisへのデータインポート

insert into doris_hll_table select k1, k2, k3, hll_from_base64(uuid) from hive.hive_test.hive_hll_table;

-- View the imported data, combining hll_to_base64 for decoding
select *, hll_to_base64(uuid) from doris_hll_table;
+------+------+------+------+---------------------+
| k1   | k2   | k3   | uuid | hll_to_base64(uuid) |
+------+------+------+------+---------------------+
|    1 | a    | b    | NULL | AQFw+a9MhpKhoQ==    |
|    1 | a    | c    | NULL | AQFw+a9MhpKhoQ==    |
|    2 | b    | c    | NULL | AQGyB7kbWBxh+A==    |
|    3 | c    | d    | NULL | AQFYbJB5VpNBhg==    |
+------+------+------+------+---------------------+

-- Also can use Doris's native HLL functions for statistics, and see that the results are consistent with the previous statistics in Hive
select k3, hll_cardinality(hll_union(uuid)) from doris_hll_table group by k3;
+------+----------------------------------+
| k3   | hll_cardinality(hll_union(uuid)) |
+------+----------------------------------+
| b    |                                1 |
| d    |                                1 |
| c    |                                2 |
+------+----------------------------------+

-- At this time, querying the external table data, i.e., the data before import, can also verify the correctness of the data
select k3, hll_cardinality(hll_union(hll_from_base64(uuid))) from hive.hive_test.hive_hll_table group by k3;
+------+---------------------------------------------------+
| k3   | hll_cardinality(hll_union(hll_from_base64(uuid))) |
+------+---------------------------------------------------+
| d    |                                                 1 |
| b    |                                                 1 |
| c    |                                                 2 |
+------+---------------------------------------------------+

Method 2: Spark Load

詳細については以下を参照してください：Spark Load -> Basic operation -> Creating Load (Example 3: when the upstream data source is hive binary type table)

使用方法​

HiveTableを作成してテストデータを挿入​

Hive HLL UDFの使用:​

Hive HLL UDF 説明​

Hive HLL を Doris にインポートする​

方法1: カタログ (推奨)​

Method 2: Spark Load​