S3 Import

Currently VeloDB Cloud supports data import through object storage in the same region

Applicable scene

The source data is in a storage system that supports the S3 protocol, such as AWS S3, Alibaba Cloud OSS, etc.
The amount of data ranges from tens to hundreds of GB.

Preparation

Get AK and SK of object storage, VeloDB Cloud needs AK and SK to verify when accessing object storage
Know REGION and ENDPOINT of object storage, REGION can be selected when creating a bucket or can be viewed in the bucket list.

Note: The REGION of the object storage must be in the same REGION as VeloDB Cloud for intranet import

The cloud storage system can find relevant information compatible with S3 in corresponding documents.

Start data import

Grammar

LOAD LABEL load_label
(
    data_desc1[, data_desc2, ...]
)
WITH S3
(
    "AWS_ENDPOINT" = "AWS_ENDPOINT",
    "AWS_ACCESS_KEY" = "AWS_ACCESS_KEY",
    "AWS_SECRET_KEY"="AWS_SECRET_KEY",
    "AWS_REGION" = "AWS_REGION"
)
PROPERTIES
();

load_label

Each import needs to specify a unique Label. Later, you can use this label to view the progress of the job.

[database.]label_name
data_desc1

Used to describe a set of files that need to be imported.
```
[MERGE|APPEND|DELETE]
DATA INFILE
(
"file_path1"[, file_path2, ...]
)
[NEGATIVE]
INTO TABLE `table_name`
[PARTITION (p1, p2, ...)]
[COLUMNS TERMINATED BY "column_separator"]
[FORMAT AS "file_type"]
[(column_list)]
[COLUMNS FROM PATH AS (c1, c2, ...)]
[PRECEDING FILTER predicate]
[SET (column_mapping)]
[WHERE predicate]
[DELETE ON expr]
[ORDER BY source_sequence]
[PROPERTIES ("key1"="value1", ...)]
```
- [MERGE|APPEND|DELETE]
  
  Data merge type. The default is APPEND, which means that this import is an ordinary append operation. The MERGE and DELETE types are only available for Unique Key model tables. Among them, the MERGE type needs to be used with the [DELETE ON]statement to mark the Delete Flag column. The DELETE type means that all the data imported this time are deleted data.
- DATA INFILE
  
  Specify the file path to import. Can be multiple. Wildcards can be used. The path must eventually match a file, and the import will fail if it matches only a directory.
- NEGTIVE
  
  This keyword is used to indicate that this import is a batch of "negative" imports. This method is only for aggregation data tables with integer SUM aggregation type. This method will reverse the integer value corresponding to the SUM aggregation column in the imported data. It is mainly used to offset the wrong data imported before.
- PARTITION(p1, p2, ...)
  
  You can specify to import only certain partitions of a table. Data that is no longer within the partition range will be ignored.
- COLUMNS TERMINATED BY
  
  Specifies the column separator. Valid only in CSV format. Only single-byte delimiters can be specified.
- FORMAT AS
  
  Specify the file type, CSV, PARQUET and ORC formats are supported. The default is CSV.
- column list
  
  Used to specify the column order in the original file.
  
  (k1, k2, tmpk1)
- COLUMNS FROM PATH AS
  
  Specifies the columns to extract from the import file path.
- PRECEDING FILTER predicate
  
  Pre-filter conditions. The data is column listfirst COLUMNS FROM PATH ASspliced into raw data rows sequentially according to and . Then filter according to the pre-filter conditions.
- SET (column_mapping)
  
  Specifies the conversion function for the column.
- WHERE predicate
  
  Filter the imported data based on conditions.
- DELETE ON expr
  
  It needs to be used together with the MEREGE import mode, only for tables of the Unique Key model. It is used to specify the column and calculation relationship representing Delete Flag in the imported data.
- ORDER BY
  
  Only for tables with Unique Key models. It is used to specify the column representing Sequence Col in the imported data. It is mainly used to ensure the data order when importing.
- PROPERTIES ("key1"="value1", ...)
  
  Specify some parameters of the imported format. If the imported file is jsonformat, you can specify parameters such as json_root, jsonpaths, fuzzy_parseand so on here.
PROPERTIES

Specify the relevant parameters for import. The following parameters are currently supported:
- timeout
  
  Import timeout. The default is 4 hours. The unit is second.
- max_filter_ratio
  
  The maximum tolerable data ratio that can be filtered (for reasons such as data irregularity). The default is zero tolerance. Values range from 0 to 1.
- exec_mem_limit
  
  Import memory limit. The default is 2GB. The unit is bytes.
- strict_mode
  
  Whether to impose strict restrictions on the data. The default is false.
- timezone
  
  Specify the time zone of some functions affected by time zone, such as strftime/alignment_timestamp/from_unixtimeetc. . If not specified, the "Asia/Shanghai" time zone is used
- load_parallelism
  
  Import concurrency, the default is 1. Increasing the import concurrency will start multiple execution plans to execute import tasks at the same time, speeding up the import.
- send_batch_parallelism
  
  It is used to set the parallelism of sending batch data. If the parallelism value exceeds the BE configuration, the max_send_batch_parallelism_per_jobBE as the coordination point will use max_send_batch_parallelism_per_jobthe value of .
- load_to_single_tablet
  
  Boolean type, if it is true, it means that a task can only import data to a tablet of the corresponding partition. The default value is false, and the number of tasks of the job depends on the overall concurrency. This parameter is only allowed to be set when the derivative of the olap table with random partition is used.

Example

    LOAD LABEL example_db.exmpale_label_1
    (
        DATA INFILE("s3://your_bucket_name/your_file.txt")
        INTO TABLE load_test
        COLUMNS TERMINATED BY ","
    )
    WITH S3
    (
        "AWS_ENDPOINT" = "AWS_ENDPOINT",
        "AWS_ACCESS_KEY" = "AWS_ACCESS_KEY",
        "AWS_SECRET_KEY"="AWS_SECRET_KEY",
        "AWS_REGION" = "AWS_REGION"
    )
    PROPERTIES
    (
        "timeout" = "3600"
    );

Frequently asked questions

The S3 SDK uses the virtual-hosted style by default. However, some object storage systems may not be enabled or support virtual-hosted style access. At this time, we can add use_path_styleparameters to force the use of path style:

  WITH S3
  (
        "AWS_ENDPOINT" = "AWS_ENDPOINT",
        "AWS_ACCESS_KEY" = "AWS_ACCESS_KEY",
        "AWS_SECRET_KEY"="AWS_SECRET_KEY",
        "AWS_REGION" = "AWS_REGION",
        "use_path_style" = "true"
  )

Routine Load Stage Import