Search Shortcut cmd + k | ctrl + k
datasketches

By utilizing the Apache DataSketches library this extension can efficiently compute approximate distinct item counts and estimations of quantiles, while allowing the sketches to be serialized.

Maintainer(s): rustyconover

Installing and Loading

INSTALL datasketches FROM community;
LOAD datasketches;

Example

-- This is just a demonstration of a single sketch type,
-- see the README for more sketches.
--
-- Lets simulate a temperature sensor
CREATE TABLE readings(temp integer);

INSERT INTO readings(temp) select unnest(generate_series(1, 10));

-- Create a sketch by aggregating id over the readings table.
SELECT datasketch_tdigest_rank(datasketch_tdigest(10, temp), 5) from readings;
┌────────────────────────────────────────────────────────────┐
 datasketch_tdigest_rank(datasketch_tdigest(10, "temp"), 5) 
                           double                           
├────────────────────────────────────────────────────────────┤
                                                       0.45 
└────────────────────────────────────────────────────────────┘

-- Put some more readings in at the high end.
INSERT INTO readings(temp) values (10), (10), (10), (10);

-- Now the rank of 5 is moved down.
SELECT datasketch_tdigest_rank(datasketch_tdigest(10, temp), 5) from readings;
┌────────────────────────────────────────────────────────────┐
 datasketch_tdigest_rank(datasketch_tdigest(10, "temp"), 5) 
                           double                           
├────────────────────────────────────────────────────────────┤
                                        0.32142857142857145 
└────────────────────────────────────────────────────────────┘

-- Lets get the cumulative distribution function from the sketch.
SELECT datasketch_tdigest_cdf(datasketch_tdigest(10, temp), [1,5,9]) from readings;
┌──────────────────────────────────────────────────────────────────────────────────┐
 datasketch_tdigest_cdf(datasketch_tdigest(10, "temp"), main.list_value(1, 5, 9)) 
                                     double[]                                     
├──────────────────────────────────────────────────────────────────────────────────┤
 [0.03571428571428571, 0.32142857142857145, 0.6071428571428571, 1.0]              
└──────────────────────────────────────────────────────────────────────────────────┘

-- The sketch can be persisted and updated later when more data
-- arrives without having to rescan the previously aggregated data.
SELECT datasketch_tdigest(10, temp) from readings;
datasketch_tdigest(10, "temp") = \x02\x01\x14\x0A\x00\x04\x00...

About datasketches

This extension provides an interface to the Apache DataSketches library. This extension enables users to efficiently compute approximate results for large datasets directly within DuckDB, using state-of-the-art streaming algorithms for distinct counting, quantile estimation, and more.

Why use this extension?

DuckDB already has great implementations of HyperLogLog via approx_count_distinct(x) and TDigest via approx_quantile(x, pos), but it doesn't expose the internal state of the aggregates nor allow the the user to tune all of the parameters of the sketches. This extension allows data sketches to be serialized as BLOBs which can be stored and shared across different systems, processes, and environments without loss of fidelity. This makes data sketches highly useful in distributed data processing pipelines.

This extension has implemented these sketches from Apache DataSketches.

For more information and information regarding usage, see the README.

Added Functions

function_name function_type description comment example
datasketch_cpc aggregate      
datasketch_cpc_describe scalar      
datasketch_cpc_estimate scalar      
datasketch_cpc_is_empty scalar      
datasketch_cpc_lower_bound scalar      
datasketch_cpc_union aggregate      
datasketch_cpc_upper_bound scalar      
datasketch_hll aggregate      
datasketch_hll_describe scalar      
datasketch_hll_estimate scalar      
datasketch_hll_is_compact scalar      
datasketch_hll_is_empty scalar      
datasketch_hll_lg_config_k scalar      
datasketch_hll_lower_bound scalar      
datasketch_hll_union aggregate      
datasketch_hll_upper_bound scalar      
datasketch_kll aggregate      
datasketch_kll_cdf scalar      
datasketch_kll_describe scalar      
datasketch_kll_is_empty scalar      
datasketch_kll_is_estimation_mode scalar      
datasketch_kll_k scalar      
datasketch_kll_max_item scalar      
datasketch_kll_min_item scalar      
datasketch_kll_n scalar      
datasketch_kll_normalized_rank_error scalar      
datasketch_kll_num_retained scalar      
datasketch_kll_pmf scalar      
datasketch_kll_quantile scalar      
datasketch_kll_rank scalar      
datasketch_quantiles aggregate      
datasketch_quantiles_cdf scalar      
datasketch_quantiles_describe scalar      
datasketch_quantiles_is_empty scalar      
datasketch_quantiles_is_estimation_mode scalar      
datasketch_quantiles_k scalar      
datasketch_quantiles_max_item scalar      
datasketch_quantiles_min_item scalar      
datasketch_quantiles_n scalar      
datasketch_quantiles_normalized_rank_error scalar      
datasketch_quantiles_num_retained scalar      
datasketch_quantiles_pmf scalar      
datasketch_quantiles_quantile scalar      
datasketch_quantiles_rank scalar      
datasketch_req aggregate      
datasketch_req_cdf scalar      
datasketch_req_describe scalar      
datasketch_req_is_empty scalar      
datasketch_req_is_estimation_mode scalar      
datasketch_req_k scalar      
datasketch_req_max_item scalar      
datasketch_req_min_item scalar      
datasketch_req_n scalar      
datasketch_req_num_retained scalar      
datasketch_req_pmf scalar      
datasketch_req_quantile scalar      
datasketch_req_rank scalar      
datasketch_tdigest aggregate      
datasketch_tdigest_cdf scalar      
datasketch_tdigest_describe scalar      
datasketch_tdigest_is_empty scalar      
datasketch_tdigest_k scalar      
datasketch_tdigest_pmf scalar      
datasketch_tdigest_quantile scalar      
datasketch_tdigest_rank scalar      
datasketch_tdigest_total_weight scalar      

Added Types

type_name type_size logical_type type_category internal
sketch_cpc 16 BLOB   true
sketch_hll 16 BLOB   true
sketch_kll_bigint 16 BLOB   true
sketch_kll_double 16 BLOB   true
sketch_kll_float 16 BLOB   true
sketch_kll_integer 16 BLOB   true
sketch_kll_smallint 16 BLOB   true
sketch_kll_tinyint 16 BLOB   true
sketch_kll_ubigint 16 BLOB   true
sketch_kll_uinteger 16 BLOB   true
sketch_kll_usmallint 16 BLOB   true
sketch_kll_utinyint 16 BLOB   true
sketch_quantiles_bigint 16 BLOB   true
sketch_quantiles_double 16 BLOB   true
sketch_quantiles_float 16 BLOB   true
sketch_quantiles_integer 16 BLOB   true
sketch_quantiles_smallint 16 BLOB   true
sketch_quantiles_tinyint 16 BLOB   true
sketch_quantiles_ubigint 16 BLOB   true
sketch_quantiles_uinteger 16 BLOB   true
sketch_quantiles_usmallint 16 BLOB   true
sketch_quantiles_utinyint 16 BLOB   true
sketch_req_bigint 16 BLOB   true
sketch_req_double 16 BLOB   true
sketch_req_float 16 BLOB   true
sketch_req_integer 16 BLOB   true
sketch_req_smallint 16 BLOB   true
sketch_req_tinyint 16 BLOB   true
sketch_req_ubigint 16 BLOB   true
sketch_req_uinteger 16 BLOB   true
sketch_req_usmallint 16 BLOB   true
sketch_req_utinyint 16 BLOB   true
sketch_tdigest_double 16 BLOB   true
sketch_tdigest_float 16 BLOB   true