By utilizing the Apache DataSketches library this extension can efficiently compute approximate distinct item counts and estimations of quantiles, while allowing the sketches to be serialized.
Installing and Loading
INSTALL datasketches FROM community;
LOAD datasketches;
Example
-- This is just a demonstration of a single sketch type,
-- see the README for more sketches.
--
-- Lets simulate a temperature sensor
CREATE TABLE readings(temp integer);
INSERT INTO readings(temp) select unnest(generate_series(1, 10));
-- Create a sketch by aggregating id over the readings table.
SELECT datasketch_tdigest_rank(datasketch_tdigest(10, temp), 5) from readings;
┌────────────────────────────────────────────────────────────┐
│ datasketch_tdigest_rank(datasketch_tdigest(10, "temp"), 5) │
│ double │
├────────────────────────────────────────────────────────────┤
│ 0.45 │
└────────────────────────────────────────────────────────────┘
-- Put some more readings in at the high end.
INSERT INTO readings(temp) values (10), (10), (10), (10);
-- Now the rank of 5 is moved down.
SELECT datasketch_tdigest_rank(datasketch_tdigest(10, temp), 5) from readings;
┌────────────────────────────────────────────────────────────┐
│ datasketch_tdigest_rank(datasketch_tdigest(10, "temp"), 5) │
│ double │
├────────────────────────────────────────────────────────────┤
│ 0.32142857142857145 │
└────────────────────────────────────────────────────────────┘
-- Lets get the cumulative distribution function from the sketch.
SELECT datasketch_tdigest_cdf(datasketch_tdigest(10, temp), [1,5,9]) from readings;
┌──────────────────────────────────────────────────────────────────────────────────┐
│ datasketch_tdigest_cdf(datasketch_tdigest(10, "temp"), main.list_value(1, 5, 9)) │
│ double[] │
├──────────────────────────────────────────────────────────────────────────────────┤
│ [0.03571428571428571, 0.32142857142857145, 0.6071428571428571, 1.0] │
└──────────────────────────────────────────────────────────────────────────────────┘
-- The sketch can be persisted and updated later when more data
-- arrives without having to rescan the previously aggregated data.
SELECT datasketch_tdigest(10, temp) from readings;
datasketch_tdigest(10, "temp") = \x02\x01\x14\x0A\x00\x04\x00...
About datasketches
This extension provides an interface to the Apache DataSketches library. This extension enables users to efficiently compute approximate results for large datasets directly within DuckDB, using state-of-the-art streaming algorithms for distinct counting, quantile estimation, and more.
Why use this extension?
DuckDB already has great implementations of HyperLogLog via approx_count_distinct(x)
and TDigest via approx_quantile(x, pos)
, but it doesn't expose the internal state of the aggregates nor allow the the user to tune all of the parameters of the sketches. This extension allows data sketches to be serialized as BLOB
s which can be stored and shared across different systems, processes, and environments without loss of fidelity. This makes data sketches highly useful in distributed data processing pipelines.
This extension has implemented these sketches from Apache DataSketches.
- Quantile Estimation
- Approximate Distinct Count
For more information and information regarding usage, see the README.
Added Functions
function_name | function_type | description | comment | example |
---|---|---|---|---|
datasketch_cpc | aggregate | |||
datasketch_cpc_describe | scalar | |||
datasketch_cpc_estimate | scalar | |||
datasketch_cpc_is_empty | scalar | |||
datasketch_cpc_lower_bound | scalar | |||
datasketch_cpc_union | aggregate | |||
datasketch_cpc_upper_bound | scalar | |||
datasketch_hll | aggregate | |||
datasketch_hll_describe | scalar | |||
datasketch_hll_estimate | scalar | |||
datasketch_hll_is_compact | scalar | |||
datasketch_hll_is_empty | scalar | |||
datasketch_hll_lg_config_k | scalar | |||
datasketch_hll_lower_bound | scalar | |||
datasketch_hll_union | aggregate | |||
datasketch_hll_upper_bound | scalar | |||
datasketch_kll | aggregate | |||
datasketch_kll_cdf | scalar | |||
datasketch_kll_describe | scalar | |||
datasketch_kll_is_empty | scalar | |||
datasketch_kll_is_estimation_mode | scalar | |||
datasketch_kll_k | scalar | |||
datasketch_kll_max_item | scalar | |||
datasketch_kll_min_item | scalar | |||
datasketch_kll_n | scalar | |||
datasketch_kll_normalized_rank_error | scalar | |||
datasketch_kll_num_retained | scalar | |||
datasketch_kll_pmf | scalar | |||
datasketch_kll_quantile | scalar | |||
datasketch_kll_rank | scalar | |||
datasketch_quantiles | aggregate | |||
datasketch_quantiles_cdf | scalar | |||
datasketch_quantiles_describe | scalar | |||
datasketch_quantiles_is_empty | scalar | |||
datasketch_quantiles_is_estimation_mode | scalar | |||
datasketch_quantiles_k | scalar | |||
datasketch_quantiles_max_item | scalar | |||
datasketch_quantiles_min_item | scalar | |||
datasketch_quantiles_n | scalar | |||
datasketch_quantiles_normalized_rank_error | scalar | |||
datasketch_quantiles_num_retained | scalar | |||
datasketch_quantiles_pmf | scalar | |||
datasketch_quantiles_quantile | scalar | |||
datasketch_quantiles_rank | scalar | |||
datasketch_req | aggregate | |||
datasketch_req_cdf | scalar | |||
datasketch_req_describe | scalar | |||
datasketch_req_is_empty | scalar | |||
datasketch_req_is_estimation_mode | scalar | |||
datasketch_req_k | scalar | |||
datasketch_req_max_item | scalar | |||
datasketch_req_min_item | scalar | |||
datasketch_req_n | scalar | |||
datasketch_req_num_retained | scalar | |||
datasketch_req_pmf | scalar | |||
datasketch_req_quantile | scalar | |||
datasketch_req_rank | scalar | |||
datasketch_tdigest | aggregate | |||
datasketch_tdigest_cdf | scalar | |||
datasketch_tdigest_describe | scalar | |||
datasketch_tdigest_is_empty | scalar | |||
datasketch_tdigest_k | scalar | |||
datasketch_tdigest_pmf | scalar | |||
datasketch_tdigest_quantile | scalar | |||
datasketch_tdigest_rank | scalar | |||
datasketch_tdigest_total_weight | scalar |
Added Types
type_name | type_size | logical_type | type_category | internal |
---|---|---|---|---|
sketch_cpc | 16 | BLOB | true | |
sketch_hll | 16 | BLOB | true | |
sketch_kll_bigint | 16 | BLOB | true | |
sketch_kll_double | 16 | BLOB | true | |
sketch_kll_float | 16 | BLOB | true | |
sketch_kll_integer | 16 | BLOB | true | |
sketch_kll_smallint | 16 | BLOB | true | |
sketch_kll_tinyint | 16 | BLOB | true | |
sketch_kll_ubigint | 16 | BLOB | true | |
sketch_kll_uinteger | 16 | BLOB | true | |
sketch_kll_usmallint | 16 | BLOB | true | |
sketch_kll_utinyint | 16 | BLOB | true | |
sketch_quantiles_bigint | 16 | BLOB | true | |
sketch_quantiles_double | 16 | BLOB | true | |
sketch_quantiles_float | 16 | BLOB | true | |
sketch_quantiles_integer | 16 | BLOB | true | |
sketch_quantiles_smallint | 16 | BLOB | true | |
sketch_quantiles_tinyint | 16 | BLOB | true | |
sketch_quantiles_ubigint | 16 | BLOB | true | |
sketch_quantiles_uinteger | 16 | BLOB | true | |
sketch_quantiles_usmallint | 16 | BLOB | true | |
sketch_quantiles_utinyint | 16 | BLOB | true | |
sketch_req_bigint | 16 | BLOB | true | |
sketch_req_double | 16 | BLOB | true | |
sketch_req_float | 16 | BLOB | true | |
sketch_req_integer | 16 | BLOB | true | |
sketch_req_smallint | 16 | BLOB | true | |
sketch_req_tinyint | 16 | BLOB | true | |
sketch_req_ubigint | 16 | BLOB | true | |
sketch_req_uinteger | 16 | BLOB | true | |
sketch_req_usmallint | 16 | BLOB | true | |
sketch_req_utinyint | 16 | BLOB | true | |
sketch_tdigest_double | 16 | BLOB | true | |
sketch_tdigest_float | 16 | BLOB | true |