HLL Approximate Deduplication

In actual business scenarios, as the volume of business data grows, the pressure on data deduplication increases. When the data reaches a certain scale, the cost of using precise deduplication becomes higher and higher. Under acceptable circumstances, it is a very good way to achieve fast deduplication and reduce computing pressure through an approximate algorithm. This article mainly introduces the HyperLogLog (HLL for short) provided by VeloDB, which is an approximate deduplication algorithm.

The characteristic of HLL is that it has very excellent space complexity O(mloglogn), time complexity is O(n), and the error of the calculation result can be controlled at about 1%-2%, the error is related to the size of the data set and the used hash related to the Greek function.

What is HyperLogLog

It is an upgraded version of the LogLog algorithm, and its function is to provide inaccurate deduplication counting. Its mathematical basis is the Bernoulli experiment .

Assuming that the coin has both sides, the probability of the coin being flipped up and down is 50% in the end. Flipping a coin until it comes up heads is recorded as a complete trial.

Then for multiple Bernoulli trials, assume that this multiple is n times. It means that there have been n positives. Assume that the number of throws experienced by each Bernoulli trial is k. For the first Bernoulli trial, the number of times is set to k1, and so on, the nth time corresponds to kn.

Among them, for these n Bernoulli experiments, there must be a maximum number of throws k, for example, it takes 12 throws before a head appears, then this is called k_max, which represents the most throws.

Bernoulli experiments can easily draw the following conclusions:

The number of throws of n Bernoulli processes is not greater than k_max.
n Bernoulli processes with at least one throw equal to k_max

Finally, combined with the method of maximum likelihood estimation, it is found that there is an estimation correlation between n and k_max: n = 2 ^ k_max. When we only record k_max, we can estimate how many pieces of data there are in total, which is the cardinality.

Suppose the test results are as follows:

The first test: it took 3 tosses before a head appeared, at this time k=3, n=1
The 2nd experiment: it took 2 tosses to get heads, at this time k=2, n=2
The 3rd experiment: it took 6 tosses to get heads, at this time k=6, n=3
The nth experiment: it took 12 tosses before the heads appeared. At this time, we estimate that n = 2^12

Take the first three sets of experiments in the above example, then k_max = 6, and finally n=3, we put it into the estimation formula, obviously: 3 ≠ 2^6. That is to say, when the number of trials is small, the error of this estimation method is very large.

These three sets of trials, we call a round of estimation. If only one round is performed, when n is large enough, the estimated error rate will be relatively reduced, but it is still not small enough.

HLL functions

HLL is an engineering implementation based on the HyperLogLog algorithm. It is used to save the intermediate results of the HyperLogLog calculation process. It can only be used as the value column type of the table to continuously reduce the amount of data through aggregation.

To achieve the purpose of speeding up the query, based on it is an estimated result, the error is about 1%, the hll column is generated by other columns or the data in the imported data, and the hll_hash function is used when importing

To specify which column in the data is used to generate the hll column, it is often used to replace count distinct, and is used to quickly calculate uv in business by combining rollup

HLL_UNION_AGG(hll)

This function is an aggregate function that calculates a cardinality estimate for all data that satisfies the criteria.

HLL_CARDINALITY(hll)

This function is used to calculate the cardinality estimate for a single hll column

HLL_HASH(column_name)

Generate HLL column type for insert or import, please refer to the relevant instructions for the use of import

How to use HLL

When using HLL deduplication, you need to set the target column type to HLL and the aggregate function to HLL_UNION in the table creation statement
Columns of type HLL cannot be used as Key columns
Users do not need to specify the length and default value, the length is controlled within the system according to the degree of data aggregation

Create a table

create table test_hll(
    dt date,
    id int,
    name char(10),
    province char(10),
    os char(10),
    pv hll hll_union
)
Aggregate KEY (dt,id,name,province,os)
distributed by hash(id) buckets 10
PROPERTIES(
    "replication_num" = "1",
    "in_memory"="false"
);

import data

Stream load import

curl --location-trusted -u root: -H "label:label_test_hll_load" \
    -H "column_separator:," \
    -H "columns:dt,id,name,province,os, pv=hll_hash(id)" -T test_hll.csv http://fe_IP:8030/api/demo/test_hll/_stream_load

The sample data is as follows (test_hll.csv):

2022-05-05,10001,test01,beijing,windows
2022-05-05,10002,test01,beijing,linux
2022-05-05,10003,test01,beijing,macos
2022-05-05,10004,test01,hebei,windows
2022-05-06,10001,test01,shanghai,windows
2022-05-06,10002,test01,shanghai,linux
2022-05-06,10003,test01,jiangsu,macos
2022-05-06,10004,test01,shanxi,windows

The import results are as follows

# curl --location-trusted -u root: -H "label:label_test_hll_load"     -H "column_separator:,"     -H "columns:dt,id,name,province,os, pv=hll_hash(id)" -T test_hll.csv http://127.0.0.1:8030/api/demo/test_hll/_stream_load

{
    "TxnId": 693,
    "Label": "label_test_hll_load",
    "TwoPhaseCommit": "false",
    "Status": "Success",
    "Message": "OK",
    "NumberTotalRows": 8,
    "NumberLoadedRows": 8,
    "NumberFilteredRows": 0,
    "NumberUnselectedRows": 0,
    "LoadBytes": 320,
    "LoadTimeMs": 23,
    "BeginTxnTimeMs": 0,
    "StreamLoadPutTimeMs": 1,
    "ReadDataTimeMs": 0,
    "WriteDataTimeMs": 9,
    "CommitAndPublishTimeMs": 11
}

Broker Load

LOAD LABEL demo.test_hlllabel
 (
    DATA INFILE("hdfs://hdfs_host:hdfs_port/user/doris_test_hll/data/input/file")
    INTO TABLE `test_hll`
    COLUMNS TERMINATED BY ","
    (dt,id,name,province,os)
    SET (
      pv = HLL_HASH(id)
    )
 );

query data

HLL columns do not allow direct query of the original value, and can only be queried through the aggregate function of HLL.

Find the total PV

mysql> select HLL_UNION_AGG(pv) from test_hll;
+---------------------+
| hll_union_agg(`pv`) |
+---------------------+
|                   4 |
+---------------------+
1 row in set (0.00 sec)

Equivalent to:

mysql> SELECT COUNT(DISTINCT pv) FROM test_hll;
+----------------------+
| count(DISTINCT `pv`) |
+----------------------+
|                    4 |
+----------------------+
1 row in set (0.01 sec)

Seek the PV of each day

mysql> select HLL_UNION_AGG(pv) from test_hll group by dt;
+---------------------+
| hll_union_agg(`pv`) |
+---------------------+
|                   4 |
|                   4 |
+---------------------+
2 rows in set (0.01 sec)

SQL Blacklist BITMAP Accurate Deduplication