DataFusion is a DataFrame and SQL library built in Rust with bindings for Python. It uses Apache Arrow's columnar format as its memory model. DataFusion can output results as Apache Arrow, and DuckDB can read those results directly. DuckDB can also rapidly output results to Apache Arrow, which can be easily converted to a DataFusion DataFrame. Due to the interoperability of Apache Arrow, workflows can alternate between DuckDB and DataFusion with ease!
This example workflow is also available as a Google Colab notebook.
Installation
pip install --quiet duckdb datafusion pyarrow
DataFusion to DuckDB
To convert from DataFusion to DuckDB, first save DataFusion results into Arrow batches using the collect
function, and then create an Arrow table using PyArrow's Table.from_batches
function. Then include that Arrow Table in the FROM
clause of a DuckDB query.
As a note, Pandas is not required as a first step prior to using DataFusion, but was helpful for generating example data to reuse in the second example below.
Import the libraries, create an example Pandas DataFrame, then convert to DataFusion.
import duckdb
import pyarrow as pa
import pandas as pd
import datafusion as df
from datafusion import functions as f
pandas_df = pd.DataFrame(
{
"A": [1, 2, 3, 4, 5],
"fruits": ["banana", "banana", "apple", "apple", "banana"],
"B": [5, 4, 3, 2, 1],
"cars": ["beetle", "audi", "beetle", "beetle", "beetle"],
}
)
arrow_table = table = pa.Table.from_pandas(pandas_df)
arrow_batches = table.to_batches()
ctx = df.SessionContext()
datafusion_df = ctx.create_dataframe([arrow_batches])
datafusion_df
DataFrame()
+---+--------+---+--------+
| A | fruits | B | cars |
+---+--------+---+--------+
| 1 | banana | 5 | beetle |
| 2 | banana | 4 | audi |
| 3 | apple | 3 | beetle |
| 4 | apple | 2 | beetle |
| 5 | banana | 1 | beetle |
+---+--------+---+--------+
Calculate a new DataFusion DataFrame and output it to a variable as an Apache Arrow table.
arrow_batches = (
datafusion_df
.aggregate(
[df.col("fruits")],
[f.sum(df.col("A")).alias("sum_A_by_fruits")]
)
.sort(df.col("fruits").sort(ascending=True))
.collect()
)
datafusion_to_arrow = (
pa.Table.from_batches(arrow_batches)
)
datafusion_to_arrow
Then query the Apache Arrow table using DuckDB, and output the results as another Apache Arrow table for use in a subsequent DuckDB or DataFusion operation.
output = duckdb.query("""
SELECT
fruits,
first(sum_A_by_fruits) AS sum_A
FROM datafusion_to_arrow
GROUP BY ALL
ORDER BY ALL
""").arrow()
DuckDB to DataFusion
DuckDB can output results as Apache Arrow tables, which can be imported into DataFusion with the DataFusion DataFrame constructor. The same approach could be used with Pandas DataFrames, but Arrow is a faster way to pass data between DuckDB and DataFusion.
This example reuses the original Pandas DataFrame created above as a starting point. As a note, Pandas is not required as a first step, but was only used to generate example data.
After the import statements and example DataFrame creation above, query the Pandas DataFrame using DuckDB and output the results as an Arrow table.
duckdb_to_arrow = duckdb.query("""
SELECT
fruits,
cars,
'fruits' AS literal_string_fruits,
SUM(B) FILTER (cars = 'beetle') OVER () AS B,
SUM(A) FILTER (B > 2) OVER (PARTITION BY cars) AS sum_A_by_cars,
SUM(A) OVER (PARTITION BY fruits) AS sum_A_by_fruits
FROM df
ORDER BY
fruits,
df.B
""").arrow()
Load the Apache Arrow table into DataFusion using the DataFusion DataFrame constructor.
datafusion_df_2 = ctx.create_dataframe([duckdb_to_arrow.to_batches()])
datafusion_df_2
Complete a calculation using DataFusion, then output the results as another Apache Arrow table for use in a subsequent DuckDB or DataFusion operation.
output_2 = (
datafusion_df_2
.aggregate(
[df.col("fruits")],
[f.sum(df.col('sum_A_by_fruits'))]
)
).collect()
output_2
To learn more about DataFusion, feel free to explore their API documentation.