Backup and Recovery

Currently, VeloDB Cloud does not support the automatic backup/restore function provided by doris, but you can manually simulate backup and recovery through import and export. If the data is very important, we recommend using the export method to back up the data to a storage service compatible with the S3 protocol; then use the import function to restore the data.

Currently, only table-level and partition-level data are supported for export. The consistency of the exported data is not guaranteed yet.

Preparation

Prepare AK and SK. VeloDB Cloud requires AK and SK for verification when accessing object storage
Prepare REGION and ENDPOINT REGION can be selected when creating a bucket or viewed in the bucket list.

Note: The object storage REGION must be in the same REGION as VeloDB Cloud to import the data.

The cloud storage system can find relevant information compatible with S3 in the corresponding documents.

Data Export

Before exporting data, you need to back up the table creation statement of the table where the data is located (the table creation statement is required when importing). You can use the following command to obtain:

SHOW CREATE TABLE db.table

db: database name
table: table name to be exported

The table creation statement needs to be saved by the user.

Data is exported to a storage service compatible with the S3 protocol through the EXPORT command. EXPORT is an asynchronous operation. The command will submit an EXPORT JOB to VeloDB Cloud and return immediately after the task is successfully submitted. After execution, you can view the export progress through SHOW EXPORT.

The EXPORT command is used as follows:

EXPORT TABLE db.table
TO export_path
[opt_properties]
WITH S3
(
"AWS_ENDPOINT" = "AWS_ENDPOINT",
"AWS_ACCESS_KEY" = "AWS_ACCESS_KEY",
"AWS_SECRET_KEY"="AWS_SECRET_KEY",
"AWS_REGION" = "AWS_REGION"
)

db: database name
table: table name to be exported
export_path: exported file path. It can be a directory or a file directory plus a file prefix. The latter is recommended.
```
s3://bucket-name/dir/prefix_
```
opt_properties: can be used to specify some export parameters, the syntax is:
```
PROPERTIES ("key" = "value")
```
You can specify the following parameters:
- label: specifies the label of this export job. If not specified, the system will randomly generate a label.
- parallelism: the concurrency of the export job, the default is 1. The export job will open parallelism threads to perform the export. (If parallelism is greater than the number of tablets, parallelism will be automatically set to the number of tablets)
- timeout: the timeout of the export job, the default is 2 hours, in seconds.
S3 related parameters can be filled in according to the meaning. The S3 SDK uses the virtual-hosted style by default. However, some object storage systems may not enable or support virtual-hosted style access. In this case, we can add the use_path_style parameter to force the use of path style, for example:
```
WITH S3
(
"AWS_ENDPOINT" = "AWS_ENDPOINT",
"AWS_ACCESS_KEY" = "AWS_ACCESS_KEY",
"AWS_SECRET_KEY"="AWS_SECRET_KEY",
"AWS_REGION" = "AWS_REGION",
"use_path_style" = "true"
)
```

For the execution of the SHOW EXPORT command, please refer to SHOW-EXPORT.

If you need to cancel the export job, use the CANCEL-EXPORT command.

Notes

It is not recommended to export a large amount of data at one time. The maximum recommended export data volume for an export job is tens of GB. Excessive exports will result in more junk files and higher retry costs. If the table data volume is too large, it is recommended to export by partition (see below).
If the export job fails, the generated files will not be deleted and need to be deleted manually by the user.
The export job will scan data, occupy IO resources, and may affect the query latency of the system.
The maximum number of partitions allowed to be exported by an export job is 2000. You can add the parameter maximum_number_of_export_partitions in fe.conf and restart FE to modify the configuration.

Data Import

Before data import, you need to build the target table first. There are several scenarios at this time:

If the target table does not exist, you need to execute the table creation statement saved in the export phase first
If the target table already exists, but the table schema is inconsistent with the exported data, you need to delete the table first and then use the saved table creation statement to create the table
If the target table already exists and the table schema is consistent with the export, you need to clear the data in the table

TRUNCATE TABLE db.table

db: database name
table: table name to be exported

After the table is built, you can import data from the storage service compatible with the S3 protocol into VeloDB Cloud. Data import is done through S3 import, which is used as follows:

LOAD LABEL [database.]label_name
(
DATA INFILE (file_path)
INTO TABLE table_name
)
WITH S3
(
"AWS_ENDPOINT" = "AWS_ENDPOINT",
"AWS_ACCESS_KEY" = "AWS_ACCESS_KEY",
"AWS_SECRET_KEY"="AWS_SECRET_KEY",
"AWS_REGION" = "AWS_REGION"
)
[opt_properties]

label_name: Each import needs to specify a unique label. You can use this label to check the progress of the job later.
file_path: The file path to be imported. Just add *.csv to the end of the export path. Assuming the export path is s3://bucket-name/dir/prefix_, the import path is s3://bucket-name/dir/prefix_*.csv.
opt_properties: Specify the relevant parameters for the import. Currently the following parameters are supported:
- timeout: Import timeout. Default is 4 hours. Unit is seconds.
- exec_mem_limit: Import memory limit. Default is 2GB. Unit is bytes.
- load_parallelism: Import concurrency, default is 1. Increasing the import concurrency will start multiple execution plans to execute the import task at the same time, speeding up the import.
- send_batch_parallelism: Used to set the parallelism of sending batch data. If the value of the parallelism exceeds the max_send_batch_parallelism_per_job in the BE configuration, the BE as the coordination point will use the value of max_send_batch_parallelism_per_job.

Partition-level import and export

The import and export methods introduced above do not support breakpoint retry, so if the data in the table is very large, the retry cost when an error occurs in the middle will be very high. If the data is always imported into the latest partitions, you can use the partition-level import and export to implement the incremental import and export functions.

Only a slight modification to the SQL for exporting data at the table level is needed to support data export at the partition level:

EXPORT TABLE db.table
PARTITION (partition)
TO export_path
[opt_properties]
WITH S3
(
"AWS_ENDPOINT" = "AWS_ENDPOINT",
"AWS_ACCESS_KEY" = "AWS_ACCESS_KEY",
"AWS_SECRET_KEY"="AWS_SECRET_KEY",
"AWS_REGION" = "AWS_REGION"
)

partition: the partition to be exported.

Similar to exporting, partition-level data import also requires slightly modifying the SQL for table-level data import:

LOAD LABEL [database.]label_name
(
DATA INFILE (file_path)
INTO TABLE table_name
PARTITION (partition)
)
WITH S3
(
"AWS_ENDPOINT" = "AWS_ENDPOINT",
"AWS_ACCESS_KEY" = "AWS_ACCESS_KEY",
"AWS_SECRET_KEY"="AWS_SECRET_KEY",
"AWS_REGION" = "AWS_REGION"
)
[opt_properties]

partition: Which partition should the data be imported into? Data that is not in the partition range will be ignored.

Note: There is no direct mapping between the exported partition and the imported partition, but we recommend that the table partition range be kept consistent during the data import and export process.

Since the import process matches files according to wildcards, if you need to support import by partition, you need to organize the data into different directories according to partition. For example, if there is a table named user_visit that needs to be exported, we can consider using the following rules to organize the exported data:

bucket/
db/
user_visit/
p1/
data_xxxx_0.csv
data_xxxx_1.csv
p2/
data_yyyy_0.csv
data_yyyy_1.csv

where p1 and p2 are partition names. If you need to export p3 data now, the corresponding export path is: s3://bucket-name/db/user_visit/p3/data_; the corresponding partition import path is: s3://bucket-name/db/user_visit/p3/data_*.csv.

Query Analysis External Table