DuckDB 可以查询多种不同类型的 Apache Arrow 对象。
Apache Arrow 表格
Arrow Tables 存储在本地变量中时,可以像在DuckDB中的常规表一样进行查询。
import duckdb
import pyarrow as pa
# connect to an in-memory database
con = duckdb.connect()
my_arrow_table = pa.Table.from_pydict({'i': [1, 2, 3, 4],
'j': ["one", "two", "three", "four"]})
# query the Apache Arrow Table "my_arrow_table" and return as an Arrow Table
results = con.execute("SELECT * FROM my_arrow_table WHERE i = 2").arrow()
Apache Arrow 数据集
Arrow Datasets 存储为变量的数据集也可以像常规表一样进行查询。 数据集对于指向Parquet文件目录以分析大型数据集非常有用。 DuckDB会将列选择和行过滤器下推到数据集扫描操作中,以便仅将必要的数据拉入内存。
import duckdb
import pyarrow as pa
import tempfile
import pathlib
import pyarrow.parquet as pq
import pyarrow.dataset as ds
# connect to an in-memory database
con = duckdb.connect()
my_arrow_table = pa.Table.from_pydict({'i': [1, 2, 3, 4],
'j': ["one", "two", "three", "four"]})
# create example Parquet files and save in a folder
base_path = pathlib.Path(tempfile.gettempdir())
(base_path / "parquet_folder").mkdir(exist_ok = True)
pq.write_to_dataset(my_arrow_table, str(base_path / "parquet_folder"))
# link to Parquet files using an Arrow Dataset
my_arrow_dataset = ds.dataset(str(base_path / 'parquet_folder/'))
# query the Apache Arrow Dataset "my_arrow_dataset" and return as an Arrow Table
results = con.execute("SELECT * FROM my_arrow_dataset WHERE i = 2").arrow()
Apache Arrow 扫描器
Arrow Scanners 存储为变量后,也可以像常规表一样进行查询。扫描器读取数据集并选择特定列或应用行级过滤。这与DuckDB将列选择和过滤器下推到Arrow数据集中的方式类似,但使用的是Arrow计算操作。Arrow可以使用异步IO快速访问文件。
import duckdb
import pyarrow as pa
import tempfile
import pathlib
import pyarrow.parquet as pq
import pyarrow.dataset as ds
import pyarrow.compute as pc
# connect to an in-memory database
con = duckdb.connect()
my_arrow_table = pa.Table.from_pydict({'i': [1, 2, 3, 4],
'j': ["one", "two", "three", "four"]})
# create example Parquet files and save in a folder
base_path = pathlib.Path(tempfile.gettempdir())
(base_path / "parquet_folder").mkdir(exist_ok = True)
pq.write_to_dataset(my_arrow_table, str(base_path / "parquet_folder"))
# link to Parquet files using an Arrow Dataset
my_arrow_dataset = ds.dataset(str(base_path / 'parquet_folder/'))
# define the filter to be applied while scanning
# equivalent to "WHERE i = 2"
scanner_filter = (pc.field("i") == pc.scalar(2))
arrow_scanner = ds.Scanner.from_dataset(my_arrow_dataset, filter = scanner_filter)
# query the Apache Arrow scanner "arrow_scanner" and return as an Arrow Table
results = con.execute("SELECT * FROM arrow_scanner").arrow()
Apache Arrow RecordBatchReaders
Arrow RecordBatchReaders 是用于Arrow流式二进制格式的读取器,也可以像表一样直接查询。这种流式格式在发送Arrow数据用于进程间通信或语言运行时之间的通信等任务时非常有用。
import duckdb
import pyarrow as pa
# connect to an in-memory database
con = duckdb.connect()
my_recordbatch = pa.RecordBatch.from_pydict({'i': [1, 2, 3, 4],
'j': ["one", "two", "three", "four"]})
my_recordbatchreader = pa.ipc.RecordBatchReader.from_batches(my_recordbatch.schema, [my_recordbatch])
# query the Apache Arrow RecordBatchReader "my_recordbatchreader" and return as an Arrow Table
results = con.execute("SELECT * FROM my_recordbatchreader WHERE i = 2").arrow()