SQL Gymnastics: Bending SQL into Flexible New Shapes
TL;DR: Combining multiple features of DuckDB’s friendly SQL allows for highly flexible queries that can be reused across tables.
DuckDB's especially friendly SQL dialect simplifies common query operations. However, these features also unlock new and flexible ways to write advanced SQL! In this post we will combine multiple friendly features to both move closer to real-world use cases and stretch your imagination. These queries are useful in their own right, but their component pieces are even more valuable to have in your toolbox.
What is the craziest thing you have built with SQL? We want to hear about it! Tag DuckDB on X (the site formerly known as Twitter) or LinkedIn, and join the DuckDB Discord community.
Traditional SQL Is Too Rigid to Reuse
SQL queries are typically crafted specifically for the unique tables within a database. This limits reusability. For example, have you ever seen a library of high-level SQL helper functions? SQL as a language typically is not flexible enough to build reusable functions. Today, we are flying towards a more flexible future!
Dynamic Aggregates Macro
In SQL, typically the columns to SELECT
and GROUP BY
must be specified individually.
However, in many business intelligence workloads, groupings and aggregate functions must be easily user-adjustable.
Imagine an interactive charting workflow – first I want to plot total company revenue over time.
Then if I see a dip in revenue in that first plot, I want to adjust the plot to group the revenue by business unit to see which section of the company caused the issue.
This typically requires templated SQL, using a language that compiles down to SQL (like Malloy), or building a SQL string using another programming language.
How much we can do with just SQL?
Let's have a look at a flexible SQL-only approach and then break down how it is constructed.
First we will create an example data table. col1
is unique on each row, but the other columns are various groupings of the rows.
CREATE OR REPLACE TABLE example AS
SELECT x % 11 AS col1, x % 5 AS col2, x % 2 AS col3, 1 AS col4
FROM range(1, 11) t(x);
FROM example;
col1 | col2 | col3 | col4 |
---|---|---|---|
1 | 1 | 1 | 1 |
2 | 2 | 0 | 1 |
3 | 3 | 1 | 1 |
4 | 4 | 0 | 1 |
5 | 0 | 1 | 1 |
6 | 1 | 0 | 1 |
7 | 2 | 1 | 1 |
8 | 3 | 0 | 1 |
9 | 4 | 1 | 1 |
10 | 0 | 0 | 1 |
Creating the Macro
The macro below accepts lists of columns to include or exclude, a list of columns to aggregate, and an aggregate function to apply. All of these can be passed in as parameters from the host language that is querying the database.
-- We use a table macro (or function) for reusability
CREATE OR REPLACE MACRO dynamic_aggregates(
included_columns,
excluded_columns,
aggregated_columns,
aggregate_function
) AS TABLE (
FROM example
SELECT
-- Use a COLUMNS expression to only select the columns
-- we include or do not exclude
COLUMNS(c -> (
-- If we are not using an input parameter (list is empty),
-- ignore it
(list_contains(included_columns, c) OR
len(included_columns) = 0)
AND
(NOT list_contains(excluded_columns, c) OR
len(excluded_columns) = 0)
)),
-- Use the list_aggregate function to apply an aggregate
-- function of our choice
list_aggregate(
-- Convert to a list (to enable the use of list_aggregate)
list(
-- Use a COLUMNS expression to choose which columns
-- to aggregate
COLUMNS(c -> list_contains(aggregated_columns, c))
), aggregate_function
)
GROUP BY ALL -- Group by all selected but non-aggregated columns
ORDER BY ALL -- Order by each column from left to right
);
Executing the Macro
Now we can use that macro for many different aggregation operations. For illustrative purposes, the 3 queries below show different ways to achieve identical results.
Select col3 and col4, and take the minimum values of col1 and col2:
FROM dynamic_aggregates(
['col3', 'col4'], [], ['col1', 'col2'], 'min'
);
Select all columns except col1 and col2, and take the minimum values of col1 and col2:
FROM dynamic_aggregates(
[], ['col1', 'col2'], ['col1', 'col2'], 'min'
);
If the same column is in both the included and excluded list, it is excluded (exclusions win ties). If we include col2, col3, and col4, but we exclude col2, then it is as if we only included col3 and col4:
FROM dynamic_aggregates(
['col2', 'col3', 'col4'], ['col2'], ['col1', 'col2'], 'min'
);
Executing either of those queries will return this result:
col3 | col4 | list_aggregate(list(example.col1), 'min') | list_aggregate(list(example.col2), 'min') |
---|---|---|---|
0 | 1 | 2 | 0 |
1 | 1 | 1 | 0 |
Understanding the Design
The first step of our flexible table macro is to choose a specific table using DuckDB's FROM
-first syntax.
Well that's not very dynamic!
If we wanted to, we could work around this by creating a copy of this macro for each table we want to expose to our application.
However, we will show another approach in our next example, and completely solve the issue in a follow up blog post with an in-development DuckDB feature.
Stay tuned!
Then we SELECT
our grouping columns based on the list parameters that were passed in.
The COLUMNS
expression will execute a lambda function to decide which columns meet the criteria to be selected.
The first portion of the lambda function checks if a column name was passed in within the included_columns
list.
However, if we choose not to use an inclusion rule (by passing in a blank included_columns
list), we want to ignore that parameter.
If the list is blank, len(included_columns) = 0
will evaluate to true
and effectively disable the filtering on included_columns
.
This is a common pattern for optional filtering that is generically useful across a variety of SQL queries.
(Shout out to my mentor and friend Paul Bloomquist for teaching me this pattern!)
We repeat that pattern for excluded_columns
so that it will be used if populated, but ignored if left blank.
The excluded_columns
list will also win ties, so that if a column is in both lists, it will be excluded.
Next, we apply our aggregate function to the columns we want to aggregate.
It is easiest to follow the logic of this part of the query by working from the innermost portion outward.
The COLUMNS
expression will acquire the columns that are in our aggregated_columns
list.
Then, we do a little bit of gymnastics (it had to happen sometime…).
If we were to apply a typical aggregation function (like sum
or min
), it would need to be specified statically in our macro.
To pass it in dynamically as a string (potentially all the way from the application code calling this SQL statement), we take advantage of a unique property of the list_aggregate
function.
It accepts the name of a function (as a string) in its second parameter.
So, to use this unique property, we use the list
aggregate function to transform all the values within each group into a list.
Then we use the list_aggregate
function to apply the aggregate_function
we passed into the macro to each list.
Almost done!
Now GROUP BY ALL
will automatically choose to group by the columns returned by the first COLUMNS
expression.
The ORDER BY ALL
expression will order each column in ascending order, moving from left to right.
We made it!
Extra credit! In the next release of DuckDB, version 0.10.1, we will be able to apply a dynamic alias to the result of a
COLUMNS
expression. For example, each new aggregate column could be renamed in the patternagg_[the original column name]
. This will unlock the ability to chain together these type of macros, as the naming will be predictable.
Takeaways
Several of the approaches used within this macro can be applied in a variety of ways in your SQL workflows.
Using a lambda function in combination with the COLUMNS
expression can allow you to select any arbitrary list of columns.
The OR len(my_list) = 0
trick allows list parameters to be ignored when blank.
Once you have that arbitrary set of columns, you can even apply a dynamically chosen aggregation function to those columns using list
and list_aggregate
.
However, we still had to specify a table at the start.
We are also limited to aggregate functions that are available to be used with list_aggregate
.
Let's relax those two constraints!
Creating Version 2 of the Macro
This approach takes advantage of two key concepts:
- Macros can be used to create temporary aggregate functions
- A macro can query a Common Table Expression (CTE) /
WITH
clause that is in scope during execution
CREATE OR REPLACE MACRO dynamic_aggregates_any_cte_any_func(
included_columns,
excluded_columns,
aggregated_columns
/* No more aggregate_function */
) AS TABLE (
FROM any_cte -- No longer a fixed table!
SELECT
COLUMNS(c -> (
(list_contains(included_columns, c) OR
len(included_columns) = 0)
AND
(NOT list_contains(excluded_columns, c) OR
len(excluded_columns) = 0)
)),
-- We no longer convert to a list,
-- and we refer to the latest definition of any_func
any_func(COLUMNS(c -> list_contains(aggregated_columns, c)))
GROUP BY ALL
ORDER BY ALL
);
Executing Version 2
When we call this macro, there is additional complexity. We no longer execute a single statement, and our logic is no longer completely parameterizable (so some templating or SQL construction will be needed). However, we can execute this macro against any arbitrary CTE, using any arbitrary aggregation function. Pretty powerful and very reusable!
-- We can define or redefine any_func right before calling the macro
CREATE OR REPLACE TEMP FUNCTION any_func(x)
AS 100.0 * sum(x) / count(x);
-- Any table structure is valid for this CTE!
WITH any_cte AS (
SELECT
x % 11 AS id,
x % 5 AS my_group,
x % 2 AS another_group,
1 AS one_big_group
FROM range(1, 101) t(x)
)
FROM dynamic_aggregates_any_cte_any_func(
['another_group', 'one_big_group'], [], ['id', 'my_group']
);
another_group | one_big_group | any_func(any_cte.id) | any_func(any_cte.my_group) |
---|---|---|---|
0 | 1 | 502.0 | 200.0 |
1 | 1 | 490.0 | 200.0 |
Understanding Version 2
Instead of querying the very boldly named example
table, we query the possibly more generically named any_cte
.
Note that any_cte
has a different schema than our prior example – the columns in any_cte
can be anything!
When the macro is created, any_cte
doesn't even exist.
When the macro is executed, it searches for a table-like object named any_cte
, and it was defined in the CTE as the macro was called.
Similarly, any_func
does not exist initially.
It only needs to be created (or recreated) at some point before the macro is executed.
Its only requirements are to be an aggregate function that operates on a single column.
FUNCTION
andMACRO
are synonyms in DuckDB and can be used interchangeably!
Takeaways from Version 2
A macro can act on any arbitrary table by using a CTE at the time it is called. This makes our macro far more reusable – it can work on any table! Not only that, but any custom aggregate function can be used.
Look how far we have stretched SQL – we have made a truly reusable SQL function! The table is dynamic, the grouping columns are dynamic, the aggregated columns are dynamic, and so is the aggregate function. Our daily gymnastics stretches have paid off. However, stay tuned for a way to achieve similar results with a simpler approach in a future post.
Custom Summaries for Any Dataset
Next we have a truly production-grade example! This query powers a portion of the MotherDuck Web UI's Column Explorer component. Hamilton Ulmer led the creation of this component and is the author of this query as well! The purpose of the Column Explorer, and this query, is to get a high-level overview of the data in all columns within a dataset as quickly and easily as possible.
DuckDB has a built-in SUMMARIZE
keyword that can calculate similar metrics across an entire table.
However, for larger datasets, SUMMARIZE
can take a couple of seconds to load.
This query provides a custom summarization capability that can be tailored to the properties of your data that you are most interested in.
Traditionally, databases required that every column be referred to explicitly, and work best when data is arranged in separate columns.
This query takes advantage of DuckDB's ability to apply functions to all columns at once, its ability to UNPIVOT
(or stack) columns, and its STRUCT
data type to store key/value pairs.
The result is a clean, pivoted summary of all the rows and columns in a table.
Let's take a look at the entire function, then break it down piece by piece.
This example dataset comes from Hugging Face, which hosts DuckDB-accessible Parquet files for many of their datasets. First, we create a local table populated from this remote Parquet file.
Creation
CREATE OR REPLACE TABLE spotify_tracks AS
FROM 'https://huggingface.co/datasets/maharshipandya/spotify-tracks-dataset/resolve/refs%2Fconvert%2Fparquet/default/train/0000.parquet?download=true';
Then we create and execute our custom_summarize
macro.
We use the same any_cte
trick from above to allow this to be reused on any query result or table.
CREATE OR REPLACE MACRO custom_summarize() AS TABLE (
WITH metrics AS (
FROM any_cte
SELECT
{
name: first(alias(COLUMNS(*))),
type: first(typeof(COLUMNS(*))),
max: max(COLUMNS(*))::VARCHAR,
min: min(COLUMNS(*))::VARCHAR,
approx_unique: approx_count_distinct(COLUMNS(*)),
nulls: count(*) - count(COLUMNS(*)),
}
), stacked_metrics AS (
UNPIVOT metrics
ON COLUMNS(*)
)
SELECT value.* FROM stacked_metrics
);
Execution
The spotify_tracks
dataset is effectively renamed to any_cte
and then summarized.
WITH any_cte AS (FROM spotify_tracks)
FROM custom_summarize();
The result contains one row for every column in the raw dataset, and several columns of summary statistics.
name | type | max | min | approx_unique | nulls |
---|---|---|---|---|---|
Unnamed: 0 | BIGINT | 113999 | 0 | 114089 | 0 |
track_id | VARCHAR | 7zz7iNGIWhmfFE7zlXkMma | 0000vdREvCVMxbQTkS888c | 89815 | 0 |
artists | VARCHAR | 龍藏Ryuzo | !nvite | 31545 | 1 |
album_name | VARCHAR | 당신이 잠든 사이에 Pt. 4 Original Television Soundtrack | ! ! ! ! ! Whispers ! ! ! ! ! | 47093 | 1 |
track_name | VARCHAR | 행복하길 바래 | !I'll Be Back! | 72745 | 1 |
popularity | BIGINT | 100 | 0 | 99 | 0 |
duration_ms | BIGINT | 5237295 | 0 | 50168 | 0 |
explicit | BOOLEAN | true | false | 2 | 0 |
danceability | DOUBLE | 0.985 | 0.0 | 1180 | 0 |
energy | DOUBLE | 1.0 | 0.0 | 2090 | 0 |
key | BIGINT | 11 | 0 | 12 | 0 |
loudness | DOUBLE | 4.532 | -49.531 | 19436 | 0 |
mode | BIGINT | 1 | 0 | 2 | 0 |
speechiness | DOUBLE | 0.965 | 0.0 | 1475 | 0 |
acousticness | DOUBLE | 0.996 | 0.0 | 4976 | 0 |
instrumentalness | DOUBLE | 1.0 | 0.0 | 5302 | 0 |
liveness | DOUBLE | 1.0 | 0.0 | 1717 | 0 |
valence | DOUBLE | 0.995 | 0.0 | 1787 | 0 |
tempo | DOUBLE | 243.372 | 0.0 | 46221 | 0 |
time_signature | BIGINT | 5 | 0 | 5 | 0 |
track_genre | VARCHAR | world-music | acoustic | 115 | 0 |
So how was this query constructed? Let's break down each CTE step by step.
Step by Step Breakdown
Metrics CTE
First let's have a look at the metrics
CTE and the shape of the data that is returned:
FROM any_cte
SELECT
{
name: first(alias(COLUMNS(*))),
type: first(typeof(COLUMNS(*))),
max: max(COLUMNS(*))::VARCHAR,
min: min(COLUMNS(*))::VARCHAR,
approx_unique: approx_count_distinct(COLUMNS(*)),
nulls: count(*) - count(COLUMNS(*)),
};
main.struct_pack("name" := first(alias(subset."Unnamed: 0")), … | main.struct_pack("name" := first(alias(subset.track_id)), … | … | main.struct_pack("name" := first(alias(subset.time_signature)), … | main.struct_pack("name" := first(alias(subset.track_genre)), … |
---|---|---|---|---|
{'name': Unnamed: 0, 'type': BIGINT, 'max': 113999, 'min': 0, 'approx_unique': 114089, 'nulls': 0} | {'name': track_id, 'type': VARCHAR, 'max': 7zz7iNGIWhmfFE7zlXkMma, 'min': 0000vdREvCVMxbQTkS888c, 'approx_unique': 89815, 'nulls': 0} | … | {'name': time_signature, 'type': BIGINT, 'max': 5, 'min': 0, 'approx_unique': 5, 'nulls': 0} | {'name': track_genre, 'type': VARCHAR, 'max': world-music, 'min': acoustic, 'approx_unique': 115, 'nulls': 0} |
This intermediate result maintains the same number of columns as the original dataset, but only returns a single row of summary statistics.
The names of the columns are truncated due to their length.
The default naming of COLUMNS
expressions will be improved in DuckDB 0.10.1, so names will be much cleaner!
The data in each column is organized into a STRUCT
of key-value pairs.
You can also see that a clean name of the original column is stored within the STRUCT
thanks to the use of the alias
function.
While we have calculated the summary statistics, the format of those statistics is difficult to visually interpret.
The query achieves this structure using the COLUMNS(*)
expression to apply multiple summary metrics to all columns, and the {...}
syntax to create a STRUCT
.
The keys of the struct represent the names of the metrics (and what we want to use as the column names in the final result).
We use this approach since we want to transpose the columns to rows and then split the summary metrics into their own columns.
stacked_metrics
CTE
Next, the data is unpivoted to reshape the table from one row and multiple columns to two columns and multiple rows.
UNPIVOT metrics
ON COLUMNS(*);
name | value |
---|---|
main.struct_pack("name" := first(alias(spotify_tracks."Unnamed: 0")), … | {'name': Unnamed: 0, 'type': BIGINT, 'max': 113999, 'min': 0, 'approx_unique': 114089, 'nulls': 0} |
main.struct_pack("name" := first(alias(spotify_tracks.track_id)), … | {'name': track_id, 'type': VARCHAR, 'max': 7zz7iNGIWhmfFE7zlXkMma, 'min': 0000vdREvCVMxbQTkS888c, 'approx_unique': 89815, 'nulls': 0} |
… | … |
main.struct_pack("name" := first(alias(spotify_tracks.time_signature)), … | {'name': time_signature, 'type': BIGINT, 'max': 5, 'min': 0, 'approx_unique': 5, 'nulls': 0} |
main.struct_pack("name" := first(alias(spotify_tracks.track_genre)), … | {'name': track_genre, 'type': VARCHAR, 'max': world-music, 'min': acoustic, 'approx_unique': 115, 'nulls': 0} |
By unpivoting on COLUMNS(*)
, we take all columns and pivot them downward into two columns: one for the auto-generated name
of the column, and one for the value
that was within that column.
Return the Results
The final step is the most gymnastics-like portion of this query.
We explode the value
column's struct format so that each key becomes its own column using the STRUCT.*
syntax.
This is another way to make a query less reliant on column names – the split occurs automatically based on the keys in the struct.
SELECT value.*
FROM stacked_metrics;
We have now split apart the data into multiple columns, so the summary metrics are nice and interpretable.
name | type | max | min | approx_unique | nulls |
---|---|---|---|---|---|
Unnamed: 0 | BIGINT | 113999 | 0 | 114089 | 0 |
track_id | VARCHAR | 7zz7iNGIWhmfFE7zlXkMma | 0000vdREvCVMxbQTkS888c | 89815 | 0 |
artists | VARCHAR | 龍藏Ryuzo | !nvite | 31545 | 1 |
album_name | VARCHAR | 당신이 잠든 사이에 Pt. 4 Original Television Soundtrack | ! ! ! ! ! Whispers ! ! ! ! ! | 47093 | 1 |
track_name | VARCHAR | 행복하길 바래 | !I'll Be Back! | 72745 | 1 |
popularity | BIGINT | 100 | 0 | 99 | 0 |
duration_ms | BIGINT | 5237295 | 0 | 50168 | 0 |
explicit | BOOLEAN | true | false | 2 | 0 |
danceability | DOUBLE | 0.985 | 0.0 | 1180 | 0 |
energy | DOUBLE | 1.0 | 0.0 | 2090 | 0 |
key | BIGINT | 11 | 0 | 12 | 0 |
loudness | DOUBLE | 4.532 | -49.531 | 19436 | 0 |
mode | BIGINT | 1 | 0 | 2 | 0 |
speechiness | DOUBLE | 0.965 | 0.0 | 1475 | 0 |
acousticness | DOUBLE | 0.996 | 0.0 | 4976 | 0 |
instrumentalness | DOUBLE | 1.0 | 0.0 | 5302 | 0 |
liveness | DOUBLE | 1.0 | 0.0 | 1717 | 0 |
valence | DOUBLE | 0.995 | 0.0 | 1787 | 0 |
tempo | DOUBLE | 243.372 | 0.0 | 46221 | 0 |
time_signature | BIGINT | 5 | 0 | 5 | 0 |
track_genre | VARCHAR | world-music | acoustic | 115 | 0 |
Conclusion
We have shown that it is now possible to build reusable SQL macros in a highly flexible way. You can now build a macro that:
- Operates on any dataset
- Selects any columns
- Groups by any columns
- Aggregates any number of columns with any function.
Phew!
Along the way we have covered some useful tricks to have in your toolbox:
- Applying a macro to any dataset using a CTE
- Selecting a dynamic list of columns by combining the
COLUMNS
expression with a lambda and thelist_contains
function - Passing in an aggregate function as a string using
list_aggregate
- Applying any custom aggregation function within a macro
- Making list parameters optional using
OR len(list_parameter) = 0
- Using the
alias
function with aCOLUMNS
expression to store the original name of all columns - Summarizing all columns and then transposing that summary using
UNPIVOT
andSTRUCT.*
The combination of these friendly SQL features is more powerful than using any one individually. We hope that we have inspired you to take your SQL to new limits!
As always, we welcome your feedback and suggestions. We also have more flexibility in mind that will be demonstrated in future posts. Please share the times you have stretched SQL in imaginative ways!
Happy analyzing!