メインコンテンツまでスキップ
バージョン: 4.x

新しいオープンデータ管理アーキテクチャとして、Data Lakehouseはデータウェアハウスの高性能とリアルタイム機能を、データレイクの低コストと柔軟性と統合し、ユーザーが様々なデータ処理と分析ニーズをより便利に満たすことを支援します。企業のビッグデータシステムにおいて、ますます適用されるようになっています。

最近のバージョンでは、Apache Dorisはデータレイクとの統合を深め、成熟したData Lakehouseソリューションを進化させています。

  • バージョン0.15以降、Apache DorisはHiveとIceberg外部Tableを導入し、データレイク向けのApache Icebergとの組み合わせ機能を探求しています。
  • バージョン1.2以降、Apache Dorisは正式にMulti-Catalog機能を導入し、様々なデータソースの自動メタデータマッピングとデータアクセスを可能にし、外部データの読み取りとクエリ実行に対する多数のパフォーマンス最適化を提供しています。現在では、高速でユーザーフレンドリーなLakehouseアーキテクチャを構築する能力を完全に備えています。
  • バージョン2.1では、Apache DorisのData Lakehouseアーキテクチャが大幅に強化され、主流のデータレイク形式(Hudi、Iceberg、Paimonなど)の読み書き機能を改善し、複数のSQL方言との互換性を導入し、既存システムからApache Dorisへのシームレスな移行を実現しました。データサイエンスと大規模データ読み取りシナリオに対して、DorisはArrow Flight高速読み取りインターフェースを統合し、データ転送効率を100倍向上させることを実現しました。

Building lakehouse using doris and lakesoul

Apache Doris & LakeSoul

LakeSoulは、DMetaSoulによって開発され、2023年5月にLinux Foundation AI & Data Foundationに寄贈されたクラウドネイティブlakehouseフレームワークです。メタデータ管理での高いスケーラビリティ、ACIDトランザクション、効率的で柔軟なupsert操作、スキーマ進化、バッチストリーム統合処理を特徴としています。

Apache Dorisの高性能クエリエンジンとLakeSoulの効率的なデータ管理により、ユーザーは以下を実現できます:

  • レイクへのリアルタイムデータ取り込み:LakeSoulのアーキテクチャを活用して、データを高効率かつ低レイテンシでレイクに取り込むことができ、集約、重複排除、部分列更新を含む様々なデータ更新機能をサポートします。
  • 高性能データ処理と分析:LakeSoulのバッチストリーム統合処理やスキーマ進化などの機能は、Dorisの強力なクエリエンジンとシームレスに統合でき、レイクデータに対する高速なクエリと分析レスポンスを実現します。

将来的に、Apache DorisはCDCストリーム同期や自動スキーマ進化など、LakeSoulのより高度な機能を段階的にサポートし、統一された高性能なリアルタイムlakehouseプラットフォームを共同で構築していきます。

本記事では、Apache Doris + LakeSoulのテストおよびデモンストレーション環境を迅速にセットアップする方法を説明し、様々な機能の使用方法を実演して、lakehouseアーキテクチャにおける両システムの統合と利点を紹介します。

詳細については、LakeSoul カタログを参照してください。

ユーザーガイド

本記事で言及されるすべてのスクリプトとコードは、以下のアドレスから取得できます:https://github.com/apache/doris/tree/master/samples/datalake/lakesoul

01 環境準備

本記事ではDocker Composeを使用してデプロイメントを行い、以下のコンポーネントとバージョンを使用します:

ComponentVersion
Apache DorisDefault 3.0.2
LakeSoul2.6.1
Postgres14.5
Apache Spark3.3.1
Apache Flink1.17
MinIORELEASE.2024-04-29T09-56-05Z

02 環境デプロイメント

  1. すべてのコンポーネントを開始

    bash ./start_all.sh
  2. 起動後、以下のスクリプトを使用してDorisコマンドラインにログインできます:

    -- login doris
    bash ./start_doris_client.sh

03 データクエリ

以下に示すように、Dorisクラスターにlakesoulという名前のCatalogが作成されています(SHOW CATALOGSを使用して確認できます)。以下がこのCatalogの作成文です:

  -- Already created
CREATE CATALOG `lakesoul` PROPERTIES (
'type'='lakesoul',
'lakesoul.pg.username'='lakesoul_test',
'lakesoul.pg.password'='lakesoul_test',
'lakesoul.pg.url'='jdbc:postgresql://lakesoul-meta-pg:5432/lakesoul_test?stringtype=unspecified',
'minio.endpoint'='http://minio:9000',
'minio.access_key'='admin',
'minio.secret_key'='password'
);

LakeSoul Table lakesoul.tpch.customer は既にDorisに読み込まれています。Dorisでデータをクエリしてください。

  • 簡単なクエリ

    Doris> use `lakesoul`.`tpch`;
    Database changed

    Doris> show tables;
    +---------------------+
    | Tables_in_tpch |
    +---------------------+
    | customer_from_spark |
    +---------------------+
    1 row in set (0.00 sec)

    Doris> select * from customer_from_spark where c_nationkey = 1 order by c_custkey limit 4;
    +-----------+--------------------+-----------------------------------------+-------------+-----------------+-----------+--------------+--------------------------------------------------------------------------------------------------------+
    | c_custkey | c_name | c_address | c_nationkey | c_phone | c_acctbal | c_mktsegment | c_comment |
    +-----------+--------------------+-----------------------------------------+-------------+-----------------+-----------+--------------+--------------------------------------------------------------------------------------------------------+
    | 3 | Customer#000000003 | MG9kdTD2WBHm | 1 | 11-719-748-3364 | 7498.12 | AUTOMOBILE | deposits eat slyly ironic, even instructions. express foxes detect slyly. blithely even accounts abov |
    | 14 | Customer#000000014 | KXkletMlL2JQEA | 1 | 11-845-129-3851 | 5266.30 | FURNITURE | , ironic packages across the unus |
    | 30 | Customer#000000030 | nJDsELGAavU63Jl0c5NKsKfL8rIJQQkQnYL2QJY | 1 | 11-764-165-5076 | 9321.01 | BUILDING | lithely final requests. furiously unusual account |
    | 59 | Customer#000000059 | zLOCP0wh92OtBihgspOGl4 | 1 | 11-355-584-3112 | 3458.60 | MACHINERY | ously final packages haggle blithely after the express deposits. furiou |
    +-----------+--------------------+-----------------------------------------+-------------+-----------------+-----------+--------------+--------------------------------------------------------------------------------------------------------+
    4 rows in set (3.14 sec)

    Doris> select * from customer_from_spark where c_nationkey = 1 order by c_custkey desc limit 4;
    +-----------+--------------------+-----------------------------------------+-------------+-----------------+-----------+--------------+-------------------------------------------------------------------------------------------------+
    | c_custkey | c_name | c_address | c_nationkey | c_phone | c_acctbal | c_mktsegment | c_comment |
    +-----------+--------------------+-----------------------------------------+-------------+-----------------+-----------+--------------+-------------------------------------------------------------------------------------------------+
    | 14983 | Customer#000014983 | ERN3vq5Fvt4DL | 1 | 11-424-279-1846 | 841.22 | AUTOMOBILE | furiously slyly special foxes. express theodolites cajole carefully. special dinos haggle pinto |
    | 14968 | Customer#000014968 | ,sykKTZBzVFl7ito1750v2TRYwmkRl2nvqGHwmx | 1 | 11-669-222-9657 | 6106.77 | HOUSEHOLD | ts above the furiously even deposits haggle across |
    | 14961 | Customer#000014961 | JEIORcsBp6RpLYH 9gNdDyWJ | 1 | 11-490-251-5554 | 4006.35 | HOUSEHOLD | quests detect carefully final platelets! quickly final frays haggle slyly blithely final acc |
    | 14940 | Customer#000014940 | bNoyCxPuqSwPLjbqjEUNGN d0mSP | 1 | 11-242-677-1085 | 8829.48 | HOUSEHOLD | ver the quickly express braids. regular dependencies haggle fluffily quickly i |
    +-----------+--------------------+-----------------------------------------+-------------+-----------------+-----------+--------------+-------------------------------------------------------------------------------------------------+
    4 rows in set (0.10 sec)
  • パーティション pruning DorisはLakeSoulに対してpartition pruningを実行でき、ネイティブ読み取りを通じてクエリプロセスを高速化できます。これはexplain verboseを通じて確認できます。

    Doris> explain verbose select * from customer_from_spark where c_nationkey < 3;
    +----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    | Explain String(Old Planner) |
    +----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    | PLAN FRAGMENT 0 |
    | OUTPUT EXPRS: |
    | `lakesoul`.`tpch`.`customer_from_spark`.`c_custkey` |
    | `lakesoul`.`tpch`.`customer_from_spark`.`c_name` |
    | `lakesoul`.`tpch`.`customer_from_spark`.`c_address` |
    | `lakesoul`.`tpch`.`customer_from_spark`.`c_nationkey` |
    | `lakesoul`.`tpch`.`customer_from_spark`.`c_phone` |
    | `lakesoul`.`tpch`.`customer_from_spark`.`c_acctbal` |
    | `lakesoul`.`tpch`.`customer_from_spark`.`c_mktsegment` |
    | `lakesoul`.`tpch`.`customer_from_spark`.`c_comment` |
    | PARTITION: UNPARTITIONED |
    | |
    | HAS_COLO_PLAN_NODE: false |
    | |
    | VRESULT SINK |
    | MYSQL_PROTOCAL |
    | |
    | 1:VEXCHANGE |
    | offset: 0 |
    | tuple ids: 0 |
    | |
    | PLAN FRAGMENT 1 |
    | |
    | PARTITION: RANDOM |
    | |
    | HAS_COLO_PLAN_NODE: false |
    | |
    | STREAM DATA SINK |
    | EXCHANGE ID: 01 |
    | UNPARTITIONED |
    | |
    | 0:VplanNodeName |
    | table: customer_from_spark |
    | predicates: (`c_nationkey` < 3) |
    | inputSplitNum=12, totalFileSize=0, scanRanges=12 |
    | partition=0/0 |
    | backends: |
    | 10002 |
    | s3://lakesoul-test-bucket/data/tpch/customer_from_spark/c_nationkey=1/part-00000-0568c817-d6bc-4fa1-bb9e-b311069b131c_00000.c000.parquet start: 0 length: 0 |
    | s3://lakesoul-test-bucket/data/tpch/customer_from_spark/c_nationkey=1/part-00001-d99a8fe6-61ab-4285-94da-2f84f8746a8a_00001.c000.parquet start: 0 length: 0 |
    | s3://lakesoul-test-bucket/data/tpch/customer_from_spark/c_nationkey=1/part-00002-8a8e396f-685f-4b0f-87fa-e2a3fe5be87e_00002.c000.parquet start: 0 length: 0 |
    | ... other 8 files ... |
    | s3://lakesoul-test-bucket/data/tpch/customer_from_spark/c_nationkey=0/part-00003-d5b598cd-5bed-412c-a26f-bb4bc9c937bc_00003.c000.parquet start: 0 length: 0 |
    | numNodes=1 |
    | pushdown agg=NONE |
    | tuple ids: 0 |
    | |
    | Tuples: |
    | TupleDescriptor{id=0, tbl=customer_from_spark} |
    | SlotDescriptor{id=0, col=c_custkey, colUniqueId=0, type=int, nullable=false, isAutoIncrement=false, subColPath=null} |
    | SlotDescriptor{id=1, col=c_name, colUniqueId=1, type=text, nullable=true, isAutoIncrement=false, subColPath=null} |
    | SlotDescriptor{id=2, col=c_address, colUniqueId=2, type=text, nullable=true, isAutoIncrement=false, subColPath=null} |
    | SlotDescriptor{id=3, col=c_nationkey, colUniqueId=3, type=int, nullable=false, isAutoIncrement=false, subColPath=null} |
    | SlotDescriptor{id=4, col=c_phone, colUniqueId=4, type=text, nullable=true, isAutoIncrement=false, subColPath=null} |
    | SlotDescriptor{id=5, col=c_acctbal, colUniqueId=5, type=decimalv3(15,2), nullable=true, isAutoIncrement=false, subColPath=null} |
    | SlotDescriptor{id=6, col=c_mktsegment, colUniqueId=6, type=text, nullable=true, isAutoIncrement=false, subColPath=null} |
    | SlotDescriptor{id=7, col=c_comment, colUniqueId=7, type=text, nullable=true, isAutoIncrement=false, subColPath=null} |
    +----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    57 rows in set (0.03 sec)

EXPLAIN VERBOSE文の結果を調べることで、述語条件c_nationkey < 3が最終的に1つのパーティション(partition=0/0)のみにヒットすることが確認できます。

04 CDC Table Support

Flink CDC JobをLaunchしてmysqlTableを同期します。mysqlTableはdocker composeのLaunch時にロードされます。

bash start_flink_cdc_job.sh
Start flink-cdc job...
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/flink/lib/log4j-slf4j-impl-2.17.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/hadoop/share/hadoop/common/lib/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Loading class `com.mysql.jdbc.Driver'. This is deprecated. The new driver class is `com.mysql.cj.jdbc.Driver'. The driver is automatically registered via the SPI and manual loading of the driver class is generally unnecessary.
Job has been submitted with JobID d1b3641dcd1ad85c6b373d49b1867e68

Flink CDC Jobが起動されます。lakesoul catalogを再作成することで、doris clientで起動プロセスを確認できます。Flink CDC Jobが起動された後、doris clientで同期中のLakeSoul CDCTableを確認できます。

Doris> show tables;
+---------------------+
| Tables_in_tpch |
+---------------------+
| customer_from_spark |
+---------------------+
2 rows in set (0.00 sec)


Doris> drop catalog if exists lakesoul;
Query OK, 0 rows affected (0.00 sec)

Doris> create catalog `lakesoul` properties ('type'='lakesoul', 'lakesoul.pg.username'='lakesoul_test', 'lakesoul.pg.password'='lakesoul_test', 'lakesoul.pg.url'='jdbc:postgresql://lakesoul-meta-pg:5432/lakesoul_test?stringtype=unspecified', 'minio.endpoint'='http://minio:9000', 'minio.access_key'='admin', 'minio.secret_key'='password');
Query OK, 0 rows affected (0.01 sec)

Doris> show tables;
+---------------------+
| Tables_in_tpch |
+---------------------+
| customer |
| customer_from_spark |
+---------------------+
2 rows in set (0.00 sec)

Doris> select c_custkey, c_name, c_address, c_nationkey , c_phone, c_acctbal , c_mktsegment , c_comment from lakesoul.tpch.customer where c_custkey < 10;
+-----------+--------------------+---------------------------------------+-------------+-----------------+-----------+--------------+-------------------------------------------------------------------------------------------------------------------+
| c_custkey | c_name | c_address | c_nationkey | c_phone | c_acctbal | c_mktsegment | c_comment |
+-----------+--------------------+---------------------------------------+-------------+-----------------+-----------+--------------+-------------------------------------------------------------------------------------------------------------------+
| 1 | Customer#000000001 | IVhzIApeRb ot,c,E | 15 | 25-989-741-2988 | 711.56 | BUILDING | to the even, regular platelets. regular, ironic epitaphs nag e |
| 3 | Customer#000000003 | MG9kdTD2WBHm | 1 | 11-719-748-3364 | 7498.12 | AUTOMOBILE | deposits eat slyly ironic, even instructions. express foxes detect slyly. blithely even accounts abov |
| 7 | Customer#000000007 | TcGe5gaZNgVePxU5kRrvXBfkasDTea | 18 | 28-190-982-9759 | 9561.95 | AUTOMOBILE | ainst the ironic, express theodolites. express, even pinto beans among the exp |
| 8 | Customer#000000008 | I0B10bB0AymmC, 0PrRYBCP1yGJ8xcBPmWhl5 | 17 | 27-147-574-9335 | 6819.74 | BUILDING | among the slyly regular theodolites kindle blithely courts. carefully even theodolites haggle slyly along the ide |
| 2 | Customer#000000002 | XSTf4,NCwDVaWNe6tEgvwfmRchLXak | 13 | 23-768-687-3665 | 121.65 | AUTOMOBILE | l accounts. blithely ironic theodolites integrate boldly: caref |
| 4 | Customer#000000004 | XxVSJsLAGtn | 4 | 14-128-190-5944 | 2866.83 | MACHINERY | requests. final, regular ideas sleep final accou |
| 5 | Customer#000000005 | KvpyuHCplrB84WgAiGV6sYpZq7Tj | 3 | 13-750-942-6364 | 794.47 | HOUSEHOLD | n accounts will have to unwind. foxes cajole accor |
| 6 | Customer#000000006 | sKZz0CsnMD7mp4Xd0YrBvx,LREYKUWAh yVn | 20 | 30-114-968-4951 | 7638.57 | AUTOMOBILE | tions. even deposits boost according to the slyly bold packages. final accounts cajole requests. furious |
| 9 | Customer#000000009 | xKiAFTjUsCuxfeleNqefumTrjS | 8 | 18-338-906-3675 | 8324.07 | FURNITURE | r theodolites according to the requests wake thinly excuses: pending requests haggle furiousl |
+-----------+--------------------+---------------------------------------+-------------+-----------------+-----------+--------------+-------------------------------------------------------------------------------------------------------------------+
9 rows in set (1.09 sec)

mysql clientに入り、データの変更を試してください。

bash start_mysql_client.sh

mysql clientから行の更新を試してください。

mysql> update customer set c_acctbal=2211.26 where c_custkey=1;
Query OK, 1 row affected (0.01 sec)
Rows matched: 1 Changed: 1 Warnings: 0

doris clientに戻り、データの変更を確認してください。

Doris> select c_custkey, c_name, c_address, c_nationkey , c_phone, c_acctbal , c_mktsegment , c_comment from lakesoul.tpch.customer where c_custkey < 10;
+-----------+--------------------+---------------------------------------+-------------+-----------------+-----------+--------------+-------------------------------------------------------------------------------------------------------------------+
| c_custkey | c_name | c_address | c_nationkey | c_phone | c_acctbal | c_mktsegment | c_comment |
+-----------+--------------------+---------------------------------------+-------------+-----------------+-----------+--------------+-------------------------------------------------------------------------------------------------------------------+
| 2 | Customer#000000002 | XSTf4,NCwDVaWNe6tEgvwfmRchLXak | 13 | 23-768-687-3665 | 121.65 | AUTOMOBILE | l accounts. blithely ironic theodolites integrate boldly: caref |
| 4 | Customer#000000004 | XxVSJsLAGtn | 4 | 14-128-190-5944 | 2866.83 | MACHINERY | requests. final, regular ideas sleep final accou |
| 5 | Customer#000000005 | KvpyuHCplrB84WgAiGV6sYpZq7Tj | 3 | 13-750-942-6364 | 794.47 | HOUSEHOLD | n accounts will have to unwind. foxes cajole accor |
| 6 | Customer#000000006 | sKZz0CsnMD7mp4Xd0YrBvx,LREYKUWAh yVn | 20 | 30-114-968-4951 | 7638.57 | AUTOMOBILE | tions. even deposits boost according to the slyly bold packages. final accounts cajole requests. furious |
| 9 | Customer#000000009 | xKiAFTjUsCuxfeleNqefumTrjS | 8 | 18-338-906-3675 | 8324.07 | FURNITURE | r theodolites according to the requests wake thinly excuses: pending requests haggle furiousl |
| 1 | Customer#000000001 | IVhzIApeRb ot,c,E | 15 | 25-989-741-2988 | 2211.26 | BUILDING | to the even, regular platelets. regular, ironic epitaphs nag e |
| 3 | Customer#000000003 | MG9kdTD2WBHm | 1 | 11-719-748-3364 | 7498.12 | AUTOMOBILE | deposits eat slyly ironic, even instructions. express foxes detect slyly. blithely even accounts abov |
| 7 | Customer#000000007 | TcGe5gaZNgVePxU5kRrvXBfkasDTea | 18 | 28-190-982-9759 | 9561.95 | AUTOMOBILE | ainst the ironic, express theodolites. express, even pinto beans among the exp |
| 8 | Customer#000000008 | I0B10bB0AymmC, 0PrRYBCP1yGJ8xcBPmWhl5 | 17 | 27-147-574-9335 | 6819.74 | BUILDING | among the slyly regular theodolites kindle blithely courts. carefully even theodolites haggle slyly along the ide |
+-----------+--------------------+---------------------------------------+-------------+-----------------+-----------+--------------+-------------------------------------------------------------------------------------------------------------------+
9 rows in set (0.11 sec)

mysql clientから行の削除を試してください。

mysql> delete from customer where c_custkey = 2;
Query OK, 1 row affected (0.01 sec)

doris clientに戻り、データの変更を確認します。

Doris> select c_custkey, c_name, c_address, c_nationkey , c_phone, c_acctbal , c_mktsegment , c_comment from lakesoul.tpch.customer where c_custkey < 10;
+-----------+--------------------+---------------------------------------+-------------+-----------------+-----------+--------------+-------------------------------------------------------------------------------------------------------------------+
| c_custkey | c_name | c_address | c_nationkey | c_phone | c_acctbal | c_mktsegment | c_comment |
+-----------+--------------------+---------------------------------------+-------------+-----------------+-----------+--------------+-------------------------------------------------------------------------------------------------------------------+
| 6 | Customer#000000006 | sKZz0CsnMD7mp4Xd0YrBvx,LREYKUWAh yVn | 20 | 30-114-968-4951 | 7638.57 | AUTOMOBILE | tions. even deposits boost according to the slyly bold packages. final accounts cajole requests. furious |
| 9 | Customer#000000009 | xKiAFTjUsCuxfeleNqefumTrjS | 8 | 18-338-906-3675 | 8324.07 | FURNITURE | r theodolites according to the requests wake thinly excuses: pending requests haggle furiousl |
| 1 | Customer#000000001 | IVhzIApeRb ot,c,E | 15 | 25-989-741-2988 | 2211.26 | BUILDING | to the even, regular platelets. regular, ironic epitaphs nag e |
| 3 | Customer#000000003 | MG9kdTD2WBHm | 1 | 11-719-748-3364 | 7498.12 | AUTOMOBILE | deposits eat slyly ironic, even instructions. express foxes detect slyly. blithely even accounts abov |
| 7 | Customer#000000007 | TcGe5gaZNgVePxU5kRrvXBfkasDTea | 18 | 28-190-982-9759 | 9561.95 | AUTOMOBILE | ainst the ironic, express theodolites. express, even pinto beans among the exp |
| 8 | Customer#000000008 | I0B10bB0AymmC, 0PrRYBCP1yGJ8xcBPmWhl5 | 17 | 27-147-574-9335 | 6819.74 | BUILDING | among the slyly regular theodolites kindle blithely courts. carefully even theodolites haggle slyly along the ide |
| 4 | Customer#000000004 | XxVSJsLAGtn | 4 | 14-128-190-5944 | 2866.83 | MACHINERY | requests. final, regular ideas sleep final accou |
| 5 | Customer#000000005 | KvpyuHCplrB84WgAiGV6sYpZq7Tj | 3 | 13-750-942-6364 | 794.47 | HOUSEHOLD | n accounts will have to unwind. foxes cajole accor |
+-----------+--------------------+---------------------------------------+-------------+-----------------+-----------+--------------+-------------------------------------------------------------------------------------------------------------------+
8 rows in set (0.11 sec)