Fast Moving Holistic Aggregates
TLDR: DuckDB, a free and Open-Source analytical data management system, has a windowing API that can compute complex moving aggregates like interquartile ranges and median absolute deviation much faster than the conventional approaches.
In a previous post, we described the DuckDB windowing architecture and mentioned the support for some advanced moving aggregates. In this post, we will compare the performance various possible moving implementations of these functions and explain how DuckDB’s performant implementations work.
When people think of aggregate functions, they typically have something simple in mind such as
But more generally, what an aggregate function does is summarise a set of values into a single value.
Such summaries can be arbitrarily complex, and involve any data type.
For example, DuckDB provides aggregates for concatenating strings (
and constructing lists (
In SQL, aggregated sets come from either a
GROUP BY clause or an
OVER windowing specification.
All of the basic SQL aggregate functions like
MAX can be computed
by reading values one at a time and throwing them away.
But there are some functions that potentially need to keep track of all the values before they can produce a result.
These are called holistic aggregates, and they require more care when implementing.
For some aggregates (like
STRING_AGG) the order of the values can change the result.
This is not a problem for windowing because
OVER clauses can specify an ordering,
but in a
GROUP BY clause, the values are unordered.
To handle this, order sensitive aggregates can include a
WITHIN GROUP(ORDER BY <expr>) clause
to specify the order of the values.
Because the values must all be collected and sorted,
aggregates that use the
WITHIN GROUP clause are holistic.
Because sorting the arguments to a windowed aggregate can be specified with the
you might wonder if there are any other kinds of holistic aggregates that do not use sorting,
or which use an ordering different from the one in the
It turns out that there are a number of important statistical functions that
turn into holistic aggregates in SQL.
In particular, here are the statistical holistic aggregates that DuckDB currently supports:
||The most common value in a set|
||The middle value of a set|
||The exact value corresponding to a fractional position.|
||The interpolated value corresponding to a fractional position.|
||A list of the exact values corresponding to a list of fractional positions.|
||A list of the interpolated value corresponding to a list of fractional positions.|
||The median of the absolute values of the differences of each value from the median.|
Where things get really interesting is when we try to compute moving versions of these aggregates.
For example, computing a moving
AVG is fairly straightforward:
You can subtract values that have left the frame and add in the new ones,
or use the segment tree approach from the previous post on windowing.
Computing a moving median is not as easy.
Let’s look at a simple example of how we might implement moving
median in Python
for the following string data, using a frame that includes one element from each side:
For this example we are using strings so we don’t have to worry about interpolating values.
data = ('a', 'b', 'c', 'd', 'c', 'b',) w = len(data) for row in range(w): l = max(row - 1, 0) # First index of the frame r = min(row + 1, w-1) # Last index of the frame frame = list(data[l:r+1]) # Copy the frame values frame.sort() # Sort the frame values n = (r - l) // 2 # Middle index of the frame median = frame[n] # The median is the middle value print(row, data[row], median)
Each frame has a different set of values to aggregate and we can’t change the order in the table, so we have to copy them each time before we sort. Sorting is slow, and there is a lot of repetition.
All of these holistic aggregates have similar problems if we just reuse the simple implementations for moving versions. Fortunately, there are much faster approaches for all of them.
In the previous post on windowing, we explained the component operations used to implement a generic aggregate function (initialize, update, finalize, combine and window). In the rest of this post, we will dig into how they can be implemented for these complex aggregates.
quantile aggregate variants all extract the value(s) at a given fraction (or fractions) of the way
through the ordered list of values in the set.
The simplest variant is the
median function, which we met in the introduction, which uses a fraction of
There are other variants depending on whether the values are
quantitative (i.e., they have a distance and the values can be interpolated)
or merely ordinal (i.e., they can be ordered, but ties have to be broken.)
Still other variants depend on whether the fraction is a single value or a list of values,
but they can all be implemented in similar ways.
A common way to implement
quantile that we saw in the Python example is to collect all the values into the state,
sort them, and then read out the values at the requested positions.
(This is probably why the SQL standard refers to it as an “ordered-set aggregate”.)
States can be combined by concatenation,
which lets us group in parallel and build segment trees for windowing.
This approach is very time-consuming because sorting is
O(N log N),
but happily for
quantile we can use a related algorithm called
which can find a positional value in only
O(N) time by partially sorting the array.
You may have run into this algorithm if you have ever used the
std::nth_element algorithm in the C++ standard library.
This works well for grouped quantiles, but for moving quantiles
the segment tree approach ends up being about 5% slower than just starting from scratch for each value.
To really improve the performance of moving quantiles,
we note that the partial order probably does not change much between frames.
If we maintain a list of indirect indicies into the window and call
nth_element on the indicies,
we can reorder the partially ordered indicies instead of the values themselves.
In the common case where the frame has the same size,
we can even check to see whether the new value disrupts the partial ordering at all,
and skip the reordering!
With this approach, we can obtain a significant performance boost of 1.5-10 times.
In this example, we have a 3-element frame (green) that moves one space to the right for each value:
The median values in orange must be computed from scratch.
Notice that in the example, this only happens at the start of the window.
The median values in white are computed using the existing partial ordering.
In the example, this happens when the frame changes size.
Finally, the median values in blue do not require reordering
because the new value is the same as the old value.
With this algorithm, we can create a faster implementation of single-fraction
quantile without sorting.
We can extend this implementation to lists of fractions by leveraging the fact that each call to
partially orders the values, which further improves performance.
The “reuse” trick can be generalised to distinguish between fractions that are undisturbed
and ones that need to be recomputed.
A common application of multiple fractions is computing
by using the fraction list
[0.25, 0.5, 0.75].
This is the fraction list we use for the multiple fraction benchmarks.
Combined with moving
this moving aggregate can be used to generate the data for a moving box-and-whisker plot.
Maintaining the partial ordering can also be used to boost the performance of the
median absolute deviation
Unfortunately, the second partial ordering can’t use the single value trick
because the “function” being used to partially order the values will have changed if the data median changes.
Still, the values are still probably not far off,
which again improves the performance of
mode aggregate returns the most common value in a set.
One common way to implement it is to accumulate all the values in the state,
sort them and then scan for the longest run.
These states can be combined by merging,
which lets us compute the mode in parallel and build segment trees for windowing.
Once again, this approach is very time-consuming because sorting is
O(N log N).
It may also use more memory than necessary because it keeps all the values
instead of keeping only the unique values.
If there are heavy-hitters in the list,
(which is typically what
mode is being used to find)
this can be significant.
Another way to implement
mode is to use a hash map for the state that maps values to counts.
Hash tables are typically
O(N) for accumulation, which is an improvement on sorting,
and they only need to store unique values.
If the state also tracks the largest value and count seen so far,
we can just return that value when we finalize the aggregate.
States can be combined by merging,
which lets us group in parallel and build segment trees for windowing.
Unfortunately, as the benchmarks below demonstrate, this segment tree approach for windowing is quite slow!
The overhead of merging the hash tables for the segment trees turns out to be about 5% slower
than just building a new hash table for each row in the window.
But for a moving
we can instead make a single hash table and update it every time the frame moves,
removing the old values, adding the new values, and updating the value/count pair.
At times the current mode value may have its count decremented,
but when that happens we can rescan the table to find the new mode.
In this example, the 4-element frame (green) moves one space to the right for each value:
When the mode is unchanged (blue) it can be used directly. When the mode becomes ambiguous (orange), we must recan the table. This approach is much faster, and in the benchmarks it comes in between 15 and 55 times faster than the other two.
To benchmark the various implementations, we run moving window queries against a 10M table of integers:
create table rank100 as select b % 100 as a, b from range(10000000) tbl(b)
The results are then re-aggregated down to one row to remove the impact of streaming the results. The frames are 100 elements wide, and the test is repeated with a fixed trailing frame:
select quantile_cont(a, [0.25, 0.5, 0.75]) over ( order by b asc rows between 100 preceding and current row) as iqr from rank100
and a variable frame that moves pseudo-randomly around the current value:
select quantile_cont(a, [0.25, 0.5, 0.75]) over ( order by b asc rows between mod(b * 47, 521) preceding and 100 - mod(b * 47, 521) following) as iqr from rank100
The two examples here are the interquartile range queries;
the other queries use the single argument aggregates
As a final step, we ran the same query with
which has the same overhead as the other benchmarks, but is trivial to compute
(it just returns the frame size).
That overhead was subtracted from the run times to give the algorithm timings:
As can be seen, there is a substantial benefit from implementing the window operation for all of these aggregates, often on the order of a factor of ten.
An unexpected finding was that the segment tree approach for these complex states is always slower (by about 5%) than simply creating the state for each output row. This suggests that when writing combinable complex aggregates, it is well worth benchmarking the aggregate and then considering providing a window operation instead of deferring to the segment tree machinery.
DuckDB’s aggregate API enables aggregate functions to define a windowing operation that can significantly improve the performance of moving window computations for complex aggregates. This functionality has been used to significantly speed up windowing for several statistical aggregates, such as mode, interquartile ranges and median absolute deviation.
DuckDB is a free and open-source database management system (MIT licensed). It aims to be the SQLite for Analytics, and provides a fast and efficient database system with zero external dependencies. It is available not just for Python, but also for C/C++, R, Java, and more.back to news archive