- Installation
- Guides
- Overview
- Data Import & Export
- CSV Import
- CSV Export
- Parquet Import
- Parquet Export
- Query Parquet
- HTTP Parquet Import
- S3 Parquet Import
- S3 Parquet Export
- JSON Import
- JSON Export
- Excel Import
- Excel Export
- SQLite Import
- PostgreSQL Import
- Meta Queries
- ODBC
- Python
- Install
- Execute SQL
- Jupyter Notebooks
- SQL on Pandas
- Import from Pandas
- Export to Pandas
- SQL on Arrow
- Import from Arrow
- Export to Arrow
- Relational API on Pandas
- Multiple Python Threads
- DuckDB with Ibis
- DuckDB with Polars
- DuckDB with Vaex
- DuckDB with DataFusion
- DuckDB with fsspec Filesystems
- SQL Features
- SQL Editors
- Data Viewers
- Documentation
- Connect
- Data Import
- Overview
- CSV Files
- JSON Files
- Multiple Files
- Parquet Files
- Partitioning
- Appender
- Insert Statements
- Client APIs
- Overview
- C
- Overview
- Startup
- Configure
- Query
- Data Chunks
- Values
- Types
- Prepared Statements
- Appender
- Table Functions
- Replacement Scans
- API Reference
- C++
- CLI
- Java
- Julia
- Node.js
- Python
- Overview
- Data Ingestion
- Result Conversion
- DB API
- Relational API
- Function API
- Types API
- Expression API
- Spark API
- API Reference
- Known Python Issues
- R
- Rust
- Scala
- Swift
- Wasm
- ADBC
- ODBC
- SQL
- Introduction
- Statements
- Overview
- Alter Table
- Alter View
- Attach/Detach
- Call
- Checkpoint
- Copy
- Create Macro
- Create Schema
- Create Sequence
- Create Table
- Create View
- Create Type
- Delete
- Drop
- Export
- Insert
- Pivot
- Select
- Set/Reset
- Unpivot
- Update
- Use
- Vacuum
- Query Syntax
- SELECT
- FROM & JOIN
- WHERE
- GROUP BY
- GROUPING SETS
- HAVING
- ORDER BY
- LIMIT
- SAMPLE
- UNNEST
- WITH
- WINDOW
- QUALIFY
- VALUES
- FILTER
- Set Operations
- Data Types
- Overview
- Bitstring
- Blob
- Boolean
- Date
- Enum
- Interval
- List
- Map
- NULL Values
- Numeric
- Struct
- Text
- Time
- Timestamp
- Time Zones
- Union
- Expressions
- Functions
- Overview
- Bitstring Functions
- Blob Functions
- Date Format Functions
- Date Functions
- Date Part Functions
- Enum Functions
- Interval Functions
- Nested Functions
- Numeric Functions
- Pattern Matching
- Text Functions
- Time Functions
- Timestamp Functions
- Timestamp with Time Zone Functions
- Utility Functions
- Aggregates
- Configuration
- Constraints
- Indexes
- Information Schema
- Metadata Functions
- Pragmas
- Rules for Case Sensitivity
- Samples
- Window Functions
- Extensions
- Sitemap
- Why DuckDB
- Media
- FAQ
- Code of Conduct
- Live Demo
Vaex is a high performance DataFrame library in Python. Vaex is a hybrid DataFrame, as it supports both Numpy's and Apache Arrow's data structures. Vaex DataFrames can export data as Apache Arrow Table, which can be directly used by DuckDB. Since DuckDB can output results as an Apache Arrow Table which can be easily turned into a Vaex DataFrame, one can easily alternate between DuckDB and Vaex.
The following example shows how one can use both DuckDB and Vaex DataFrame for a simple exploratory work.
Installation
pip install duckdb
pip install vaex
Vaex DataFrame to DuckDB
A Vaex DataFrame can be exported as an Arrow Table via the to_arrow_table()
method. This operation does not take extra memory if the data being exported is already in memory or memory-mapped. The exported Arrow Table can be queried directly via DuckDB.
Let's use the well known Titanic dataset that also ships with Vaex, to do some operations like filling missing values and creating new columns. Then we will export the DataFrame to an Arrow Table:
import duckdb
import vaex
df = vaex.datasets.titanic()
df['age'] = df.age.fillna(df.age.mean())
df['fare'] = df.age.fillna(df.fare.mean())
df['family_size'] = (df.sibsp + df.parch + 1)
df['fare_per_family_member'] = df.fare / df.family_size
df['name_title'] = df['name'].str.replace('.* ([A-Z][a-z]+)\..*', "\\1", regex=True)
arrow_table = df.to_arrow_table()
Now we can directly query the Arrow Table using DuckDB, the output of which can be another Arrow Table, which can be used for subsequent DuckDB queries, or it can be converted to a Vaex DataFrame:
query_result_arrow_table = duckdb.query('''
SELECT
pclass,
MEAN(age) AS age,
MEAN(family_size) AS family_size,
MEAN(fare_per_family_member) AS fare_per_family_member,
COUNT(DISTINCT(name_title)) AS distinct_titles,
LIST(DISTINCT(name_title))
FROM arrow_table
GROUP BY pclass
ORDER BY pclass
''').arrow()
DuckDB to Vaex DataFrame
The output of a DuckDB query can be an Arrow Table, which can be easily converted to a Vaex DataFrame via the vaex.from_arrow_table()
method. One can also pass data around via Pandas DataFrames, but Arrow is faster.
We can use the query result from above and convert it to a vaex DataFrame:
df_from_duckdb = vaex.from_arrow_table(query_result_arrow_table)
One can then continue to use Vaex, and also export the data or part of it to an Arrow Table to be used with DuckDB as needed.
To learn more about Vaex, feel free to explore their Documentation.