Skip to main content
Version: 26.x

Analyzing Hugging Face Data

Hugging Face is a popular centralized platform where users can store, share, and collaborate on building machine learning models, datasets, and other resources. Hugging Face Dataset repositories may contain data files in various formats such as CSV, Parquet, and JSONL.

Doris can directly access and analyze data from Hugging Face datasets using SQL through the HTTP Table Valued Function.

note

This feature is supported starting from version 4.0.2.

Features

FeatureDescription
Access ProtocolAccess Hugging Face Dataset via HTTP protocol
Type InferenceSupports automatic type inference
Supported File FormatsCSV, JSON, Parquet, ORC
Data OperationsSupports CREATE TABLE AS SELECT and INSERT INTO ... SELECT

The parameters are the same as File Table Valued Function.

URI Syntax

The URI format for accessing Hugging Face datasets is as follows:

hf://datasets/<owner>/<repo>[@<branch>]/<path>
ComponentDescriptionRequired
ownerDataset ownerYes
repoDataset repository nameYes
branchBranch name, defaults to mainNo
pathFile path, supports wildcardsYes

Wildcard Description:

WildcardDescriptionExample
*Matches any characters in a single directory level*/*.parquet matches all Parquet files in first-level subdirectories
**Recursively matches multiple directory levels**/*.parquet matches Parquet files at all levels
[...]Matches any single character in the character settest-0000[0-9].parquet matches test-00000 to test-00009

Use Cases

Case 1: Quick Data Query

Query public datasets on Hugging Face directly using SQL without downloading files.

Example: Query CSV data from the fka/awesome-chatgpt-prompts repository:

SELECT COUNT(*) FROM
HTTP(
"uri" = "hf://datasets/fka/awesome-chatgpt-prompts/blob/main/prompts.csv",
"format" = "csv"
);

Corresponding data file: https://huggingface.co/datasets/fka/awesome-chatgpt-prompts/blob/main/prompts.csv

Example: Query Parquet files from the stanfordnlp/imdb repository using wildcards to match multiple files:

SELECT * FROM
HTTP(
"uri" = "hf://datasets/stanfordnlp/imdb@main/*/*.parquet",
"format" = "parquet"
) ORDER BY text LIMIT 1;

Corresponding data file: https://huggingface.co/datasets/stanfordnlp/imdb/blob/main/plain_text/test-00000-of-00001.parquet

Case 2: Import Data to Local Tables

Import Hugging Face datasets into Doris tables for subsequent analysis.

Method 1: Use CREATE TABLE AS SELECT to create a new table and import data:

CREATE TABLE hf_table AS
SELECT * FROM
HTTP(
"uri" = "hf://datasets/stanfordnlp/imdb@script/dataset_infos.json",
"format" = "json"
);

Corresponding data file: https://huggingface.co/datasets/stanfordnlp/imdb/blob/script/dataset_infos.json

Method 2: Use INSERT INTO ... SELECT to insert data into an existing table:

INSERT INTO hf_table
SELECT * FROM
HTTP(
"uri" = "hf://datasets/stanfordnlp/imdb@main/**/test-00000-of-0000[1].parquet",
"format" = "parquet"
) ORDER BY text LIMIT 1;

Corresponding data file: https://huggingface.co/datasets/stanfordnlp/imdb/blob/main/plain_text/test-00000-of-00001.parquet

Case 3: Access Private Datasets

For datasets that require authorization, you need to add Token authentication in the request.

Steps:

  1. Log in to your Hugging Face account and obtain an Access Token (starts with hf_).
  2. Pass the Token through the http.header.Authorization property in SQL.

Example:

SELECT * FROM
HTTP(
"uri" = "hf://datasets/gaia-benchmark/GAIA/blob/main/2023/validation/metadata.level1.parquet",
"format" = "parquet",
"http.header.Authorization" = "Bearer hf_MWYzOJJoZEymb..."
) LIMIT 1\G