- Installation
- Documentation
- Overview
- Connect
- Data Import
- Overview
- CSV Files
- JSON Files
- Multiple Files
- Parquet Files
- Partitioning
- Appender
- INSERT Statements
- Client APIs
- Overview
- C
- Overview
- Startup
- Configuration
- Query
- Data Chunks
- Values
- Types
- Prepared Statements
- Appender
- Table Functions
- Replacement Scans
- API Reference
- C++
- CLI
- Go
- Java
- Julia
- Node.js
- Python
- Overview
- Data Ingestion
- Conversion between DuckDB and Python
- DB API
- Relational API
- Function API
- Types API
- Expression API
- Spark API
- API Reference
- Known Python Issues
- R
- Rust
- Swift
- Wasm
- ADBC
- ODBC
- Configuration
- SQL
- Introduction
- Statements
- Overview
- ANALYZE
- ALTER TABLE
- ALTER VIEW
- ATTACH/DETACH
- CALL
- CHECKPOINT
- COMMENT ON
- COPY
- CREATE INDEX
- CREATE MACRO
- CREATE SCHEMA
- CREATE SECRET
- CREATE SEQUENCE
- CREATE TABLE
- CREATE VIEW
- CREATE TYPE
- DELETE
- DESCRIBE
- DROP
- EXPORT/IMPORT DATABASE
- INSERT
- PIVOT
- Profiling
- SELECT
- SET/RESET
- SUMMARIZE
- Transaction Management
- UNPIVOT
- UPDATE
- USE
- VACUUM
- Query Syntax
- SELECT
- FROM & JOIN
- WHERE
- GROUP BY
- GROUPING SETS
- HAVING
- ORDER BY
- LIMIT and OFFSET
- SAMPLE
- Unnesting
- WITH
- WINDOW
- QUALIFY
- VALUES
- FILTER
- Set Operations
- Prepared Statements
- Data Types
- Overview
- Array
- Bitstring
- Blob
- Boolean
- Date
- Enum
- Interval
- List
- Literal Types
- Map
- NULL Values
- Numeric
- Struct
- Text
- Time
- Timestamp
- Time Zones
- Union
- Typecasting
- Expressions
- Overview
- CASE Statement
- Casting
- Collations
- Comparisons
- IN Operator
- Logical Operators
- Star Expression
- Subqueries
- Functions
- Overview
- Array Functions
- Bitstring Functions
- Blob Functions
- Date Format Functions
- Date Functions
- Date Part Functions
- Enum Functions
- Interval Functions
- Lambda Functions
- Nested Functions
- Numeric Functions
- Pattern Matching
- Regular Expressions
- Text Functions
- Time Functions
- Timestamp Functions
- Timestamp with Time Zone Functions
- Utility Functions
- Aggregate Functions
- Constraints
- Indexes
- Information Schema
- Metadata Functions
- Keywords and Identifiers
- Samples
- Window Functions
- PostgreSQL Compatibility
- Extensions
- Overview
- Official Extensions
- Working with Extensions
- Versioning of Extensions
- Arrow
- AutoComplete
- AWS
- Azure
- Excel
- Full Text Search
- httpfs (HTTP and S3)
- Iceberg
- ICU
- inet
- jemalloc
- JSON
- MySQL
- PostgreSQL
- Spatial
- SQLite
- Substrait
- TPC-DS
- TPC-H
- VSS
- Guides
- Overview
- Data Viewers
- Database Integration
- File Formats
- Overview
- CSV Import
- CSV Export
- Directly Reading Files
- Excel Import
- Excel Export
- JSON Import
- JSON Export
- Parquet Import
- Parquet Export
- Querying Parquet Files
- Network & Cloud Storage
- Overview
- HTTP Parquet Import
- S3 Parquet Import
- S3 Parquet Export
- S3 Iceberg Import
- S3 Express One
- GCS Import
- Cloudflare R2 Import
- DuckDB over HTTPS/S3
- Meta Queries
- Describe Table
- EXPLAIN: Inspect Query Plans
- EXPLAIN ANALYZE: Profile Queries
- List Tables
- Summarize
- DuckDB Environment
- ODBC
- Performance
- Overview
- Import
- Schema
- Indexing
- Environment
- File Formats
- How to Tune Workloads
- My Workload Is Slow
- Benchmarks
- Python
- Installation
- Executing SQL
- Jupyter Notebooks
- SQL on Pandas
- Import from Pandas
- Export to Pandas
- Import from Numpy
- Export to Numpy
- SQL on Arrow
- Import from Arrow
- Export to Arrow
- Relational API on Pandas
- Multiple Python Threads
- Integration with Ibis
- Integration with Polars
- Using fsspec Filesystems
- SQL Editors
- SQL Features
- Snippets
- Development
- DuckDB Repositories
- Testing
- Overview
- sqllogictest Introduction
- Writing Tests
- Debugging
- Result Verification
- Persistent Testing
- Loops
- Multiple Connections
- Catch
- Profiling
- Release Calendar
- Building
- Overview
- Build Instructions
- Build Configuration
- Building Extensions
- Supported Platforms
- Troubleshooting
- Benchmark Suite
- Internals
- Sitemap
- Why DuckDB
- Media
- FAQ
- Code of Conduct
- Live Demo
The httpfs
extension introduces support for the hf://
protocol to access data sets hosted in Hugging Face repositories.
See the announcement blog post for details.
Usage
Hugging Face repositories can be queried using the following URL pattern:
hf://datasets/⟨my_username⟩/⟨my_dataset⟩/⟨path_to_file⟩
For example, to read a CSV file, you can use the following query:
SELECT *
FROM 'hf://datasets/datasets-examples/doc-formats-csv-1/data.csv';
Where:
datasets-examples
is the name of the user/organizationdoc-formats-csv-1
is the name of the dataset repositorydata.csv
is the file path in the repository
The result of the query is:
kind | sound |
---|---|
dog | woof |
cat | meow |
pokemon | pika |
human | hello |
To read a JSONL file, you can run:
SELECT *
FROM 'hf://datasets/datasets-examples/doc-formats-jsonl-1/data.jsonl';
Finally, for reading a Parquet file, use the following query:
SELECT *
FROM 'hf://datasets/datasets-examples/doc-formats-parquet-1/data/train-00000-of-00001.parquet';
Each of these commands reads the data from the specified file format and displays it in a structured tabular format. Choose the appropriate command based on the file format you are working with.
Creating a local table
To avoid accessing the remote endpoint for every query, you can save the data in a DuckDB table by running a CREATE TABLE ... AS
command. For example:
CREATE TABLE data AS
SELECT *
FROM 'hf://datasets/datasets-examples/doc-formats-csv-1/data.csv';
Then, simply query the data
table as follows:
SELECT *
FROM data;
Multiple files
To query all files under a specific directory, you can use a glob pattern. For example:
SELECT count(*) AS count
FROM 'hf://datasets/cais/mmlu/astronomy/*.parquet';
count |
---|
173 |
By using glob patterns, you can efficiently handle large datasets and perform comprehensive queries across multiple files, simplifying your data inspections and processing tasks. Here, you can see how you can look for questions that contain the word “planet” in astronomy:
SELECT count(*) AS count
FROM 'hf://datasets/cais/mmlu/astronomy/*.parquet'
WHERE question LIKE '%planet%';
count |
---|
21 |
Versioning and revisions
In Hugging Face repositories, dataset versions or revisions are different dataset updates. Each version is a snapshot at a specific time, allowing you to track changes and improvements. In git terms, it can be understood as a branch or specific commit.
You can query different dataset versions/revisions by using the following URL:
hf://datasets/⟨my-username⟩/⟨my-dataset⟩@⟨my_branch⟩/⟨path_to_file⟩
For example:
SELECT *
FROM 'hf://datasets/datasets-examples/doc-formats-csv-1@~parquet/**/*.parquet';
kind | sound |
---|---|
dog | woof |
cat | meow |
pokemon | pika |
human | hello |
The previous query will read all parquet files under the ~parquet
revision. This is a special branch where Hugging Face automatically generates the Parquet files of every dataset to enable efficient scanning.
Authentication
Configure your Hugging Face Token in the DuckDB Secrets Manager to access private or gated datasets. First, visit Hugging Face Settings – Tokens to obtain your access token. Second, set it in your DuckDB session using DuckDB’s Secrets Manager. DuckDB supports two providers for managing secrets:
CONFIG
The user must pass all configuration information into the CREATE SECRET
statement. To create a secret using the CONFIG
provider, use the following command:
CREATE SECRET hf_token (
TYPE HUGGINGFACE,
TOKEN 'your_hf_token'
);
CREDENTIAL_CHAIN
Automatically tries to fetch credentials. For the Hugging Face token, it will try to get it from ~/.cache/huggingface/token
. To create a secret using the CREDENTIAL_CHAIN
provider, use the following command:
CREATE SECRET hf_token (
TYPE HUGGINGFACE,
PROVIDER CREDENTIAL_CHAIN
);