Creating a SQL-Only Extension for Excel-Style Pivoting in DuckDB

Author Avatar
Alex Monahan2024-09-27

TL;DR: Easily create sharable extensions using only SQL macros that can apply to any table and any columns. We demonstrate the power of this capability with the pivot_table extension that provides Excel-style pivoting.

The Power of SQL-Only Extensions

SQL is not a new language. As a result, it has historically been missing some of the modern luxuries we take for granted. With version 1.1, DuckDB has launched community extensions, bringing the incredible power of a package manager to the SQL language. A bold goal of ours is for DuckDB to become a convenient way to wrap any C++ library, much the way that Python does today, but across any language with a DuckDB client.

For extension builders, compilation and distribution are much easier. For the user community, installation is as simple as two commands:

INSTALL pivot_table FROM community;
LOAD pivot_table;

The extension can then be used in any query through SQL functions.

However, not all of us are C++ developers! Can we, as a SQL community, build up a set of SQL helper functions? What would it take to build these extensions with just SQL?

Reusability

Traditionally, SQL is highly customized to the schema of the database on which it was written. Can we make it reusable? Some techniques for reusability were discussed in the SQL Gymnasics post, but now we can go even further. With version 1.1, DuckDB's world-class friendly SQL dialect makes it possible to create macros that can be applied:

  • To any tables
  • On any columns
  • Using any functions

The new ability to work on any tables is thanks to the query and query_table functions! The query function is a safe way to execute SELECT statements defined by SQL strings, while query_table is a way to make a FROM clause pull from multiple tables at once. They are very powerful when used in combination with other friendly SQL features like the COLUMNS expression and LIST lambda functions.

Community Extensions as a Central Repository

Traditionally, there has been no central repository for SQL functions across databases, let alone across companies! DuckDB's community extensions can be that knowledge base. DuckDB extensions can be used across all languages with a DuckDB client, including Python, NodeJS, Java, Rust, Go, and even WebAssembly (Wasm)!

If you are a DuckDB fan and a SQL user, you can share your expertise back to the community with an extension. This post will show you how! No C++ knowledge is needed – just a little bit of copy/paste and GitHub Actions handles all the compilation. If I can do it, you can do it!

Powerful SQL

All that said, just how valuable can a SQL MACRO be? Can we do more than make small snippets? I'll make the case that you can do quite complex and powerful operations in DuckDB SQL using the pivot_table extension as an example. The pivot_table function allows for Excel-style pivots, including subtotals, grand_totals, and more. It is also very similar to the Pandas pivot_table function, but with all the scalability and speed benefits of DuckDB. It contains over 250 tests, so it is intended to be useful beyond just an example!

To achieve this level of flexibility, the pivot_table extension uses many friendly and advanced SQL features:

DuckDB's innovative syntax makes this extension possible!

So, we now have all 3 ingredients we will need: a central package manager, reusable macros, and enough syntactic flexibility to do valuable work.

Create Your Own SQL Extension

Let's walk through the steps to creating your own SQL-only extension.

Writing the Extension

Extension Setup

The first step is to create your own GitHub repo from the DuckDB Extension Template for SQL by clicking Use this template.

Then clone your new repository onto your local machine using the terminal:

git clone --recurse-submodules https://github.com/⟨you⟩/⟨your-new-extension-repo⟩.git

Note that --recurse-submodules will ensure DuckDB is pulled which is required to build the extension.

Next, replace the name of the example extension with the name of your extension in all the right places by running the Python script below.

Note If you don't have Python installed, head to python.org and follow those instructions. This script doesn't require any libraries, so Python is all you need! (No need to set up any environments.)

python3 ./scripts/bootstrap-template.py extension_name_you_want

Initial Extension Test

At this point, you can follow the directions in the README to build and test locally if you would like. However, even easier, you can simply commit your changes to git and push them to GitHub, and GitHub Actions can do the compilation for you! GitHub Actions will also run tests on your extension to validate it is working properly.

Note The instructions are not written for a Windows audience, so we recommend GitHub Actions in that case!

git add -A
git commit -m "Initial commit of my SQL extension!"
git push

Write Your SQL Macros

It it likely a bit faster to iterate if you test your macros directly in DuckDB. After you have written your SQL, we will move it into the extension. The example we will use demonstrates how to pull a dynamic set of columns from a dynamic table name (or a view name!).

CREATE OR REPLACE MACRO select_distinct_columns_from_table(table_name, columns_list) AS TABLE (
    SELECT DISTINCT
        COLUMNS(column_name -> list_contains(columns_list, column_name))
    FROM query_table(table_name)
    ORDER BY ALL
);

FROM select_distinct_columns_from_table('duckdb_types', ['type_category']);
type_category
BOOLEAN
COMPOSITE
DATETIME
NUMERIC
STRING
NULL

Add SQL Macros

Technically, this is the C++ part, but we are going to do some copy/paste and use GitHub Actions for compiling so it won't feel that way!

DuckDB supports both scalar and table macros, and they have slightly different syntax. The extension template has an example for each (and code comments too!) inside the file named ⟨your_extension_name⟩.cpp. Let's add a table macro here since it is the more complex one. We will copy the example and modify it!

static const DefaultTableMacro your_extension_name_table_macros[] = {
	{DEFAULT_SCHEMA, "times_two_table", {"x", nullptr}, {{"two", "2"}, {nullptr, nullptr}}, R"(SELECT x * two as output_column;)"},
	{
        DEFAULT_SCHEMA, // Leave the schema as the default
        "select_distinct_columns_from_table", // Function name
        {"table_name", "columns_list", nullptr}, // Parameters
        {{nullptr, nullptr}}, // Optional parameter names and values (we choose not to have any here)
        // The SQL text inside of your SQL Macro, wrapped in R"( )", which is a raw string in C++
        R"(
            SELECT DISTINCT
                COLUMNS(column_name -> list_contains(columns_list, column_name))
            FROM query_table(table_name)
            ORDER BY ALL
        )"
    },
	{nullptr, nullptr, {nullptr}, {{nullptr, nullptr}}, nullptr}
	};

That's it! All we had to provide were the name of the function, the names of the parameters, and the text of our SQL macro.

Testing the Extension

We also recommend adding some tests for your extension to the ⟨your_extension_name⟩.test file. This uses sqllogictest to test with just SQL! Let's add the example from above.

Note In sqllogictest, query I indicates that there will be 1 column in the result. We then add ---- and the resultset in tab separated format with no column names.

query I
FROM select_distinct_columns_from_table('duckdb_types', ['type_category']);
----
BOOLEAN
COMPOSITE
DATETIME
NUMERIC
STRING
NULL

Now, just add, commit, and push your changes to GitHub like before, and GitHub Actions will compile your extension and test it!

If you would like to do further ad-hoc testing of your extension, you can download the extension from your GitHub Actions run's artifacts and then install it locally using these steps.

Uploading to the Community Extensions Repository

Once you are happy with your extension, it's time to share it with the DuckDB community! Follow the steps in the Community Extensions post. A summary of those steps is:

  1. Send a PR with a metadata file description.yml that contains the description of the extension. For example, the h3 Community Extension uses the following YAML configuration:

    extension:
      name: h3
      description: Hierarchical hexagonal indexing for geospatial data
      version: 1.0.0
      language: C++
      build: cmake
      license: Apache-2.0
      maintainers:
        - isaacbrodsky
    
    repo:
      github: isaacbrodsky/h3-duckdb
      ref: 3c8a5358e42ab8d11e0253c70f7cc7d37781b2ef
    
  2. Wait for approval from the maintainers

And there you have it! You have created a shareable DuckDB Community Extension. Now let's have a look at the pivot_table extension as an example of just how powerful a SQL-only extension can be.

Capabilities of the pivot_table Extension

The pivot_table extension supports advanced pivoting functionality that was previously only available in spreadsheets, dataframe libraries, or custom host language functions. It uses the Excel pivoting API: values, rows, columns, and filters – handling 0 or more of each of those parameters. However, not only that, but it supports subtotals and grand_totals. If multiple values are passed in, the values_axis parameter allows the user to choose if each value should get its own column or its own row.

Why is this a good example of how DuckDB moves beyond traditional SQL? The Excel pivoting API requires dramatically different SQL syntax depending on which parameters are in use. If no columns are pivoted outward, a GROUP BY is all that is needed. However, once columns are involved, a PIVOT is required.

This function can operate on one or more table_names that are passed in as a parameter. Any set of tables (or views!) will first be vertically stacked and then pivoted.

Example Using pivot_table

Check out a live example using the extension in the DuckDB Wasm shell here!

First we will create an example data table. We are a duck product distributor, and we are tracking our fowl finances.
CREATE OR REPLACE TABLE business_metrics (
    product_line VARCHAR,
    product VARCHAR,
    year INTEGER,
    quarter VARCHAR,
    revenue INTEGER,
    cost INTEGER
);

INSERT INTO business_metrics VALUES
    ('Waterfowl watercraft', 'Duck boats', 2022, 'Q1', 100, 100),
    ('Waterfowl watercraft', 'Duck boats', 2022, 'Q2', 200, 100),
    ('Waterfowl watercraft', 'Duck boats', 2022, 'Q3', 300, 100),
    ('Waterfowl watercraft', 'Duck boats', 2022, 'Q4', 400, 100),
    ('Waterfowl watercraft', 'Duck boats', 2023, 'Q1', 500, 100),
    ('Waterfowl watercraft', 'Duck boats', 2023, 'Q2', 600, 100),
    ('Waterfowl watercraft', 'Duck boats', 2023, 'Q3', 700, 100),
    ('Waterfowl watercraft', 'Duck boats', 2023, 'Q4', 800, 100),

    ('Duck Duds', 'Duck suits', 2022, 'Q1', 10, 10),
    ('Duck Duds', 'Duck suits', 2022, 'Q2', 20, 10),
    ('Duck Duds', 'Duck suits', 2022, 'Q3', 30, 10),
    ('Duck Duds', 'Duck suits', 2022, 'Q4', 40, 10),
    ('Duck Duds', 'Duck suits', 2023, 'Q1', 50, 10),
    ('Duck Duds', 'Duck suits', 2023, 'Q2', 60, 10),
    ('Duck Duds', 'Duck suits', 2023, 'Q3', 70, 10),
    ('Duck Duds', 'Duck suits', 2023, 'Q4', 80, 10),

    ('Duck Duds', 'Duck neckties', 2022, 'Q1', 1, 1),
    ('Duck Duds', 'Duck neckties', 2022, 'Q2', 2, 1),
    ('Duck Duds', 'Duck neckties', 2022, 'Q3', 3, 1),
    ('Duck Duds', 'Duck neckties', 2022, 'Q4', 4, 1),
    ('Duck Duds', 'Duck neckties', 2023, 'Q1', 5, 1),
    ('Duck Duds', 'Duck neckties', 2023, 'Q2', 6, 1),
    ('Duck Duds', 'Duck neckties', 2023, 'Q3', 7, 1),
    ('Duck Duds', 'Duck neckties', 2023, 'Q4', 8, 1),
;

FROM business_metrics;
product_line product year quarter revenue cost
Waterfowl watercraft Duck boats 2022 Q1 100 100
Waterfowl watercraft Duck boats 2022 Q2 200 100
Waterfowl watercraft Duck boats 2022 Q3 300 100
Waterfowl watercraft Duck boats 2022 Q4 400 100
Waterfowl watercraft Duck boats 2023 Q1 500 100
Waterfowl watercraft Duck boats 2023 Q2 600 100
Waterfowl watercraft Duck boats 2023 Q3 700 100
Waterfowl watercraft Duck boats 2023 Q4 800 100
Duck Duds Duck suits 2022 Q1 10 10
Duck Duds Duck suits 2022 Q2 20 10
Duck Duds Duck suits 2022 Q3 30 10
Duck Duds Duck suits 2022 Q4 40 10
Duck Duds Duck suits 2023 Q1 50 10
Duck Duds Duck suits 2023 Q2 60 10
Duck Duds Duck suits 2023 Q3 70 10
Duck Duds Duck suits 2023 Q4 80 10
Duck Duds Duck neckties 2022 Q1 1 1
Duck Duds Duck neckties 2022 Q2 2 1
Duck Duds Duck neckties 2022 Q3 3 1
Duck Duds Duck neckties 2022 Q4 4 1
Duck Duds Duck neckties 2023 Q1 5 1
Duck Duds Duck neckties 2023 Q2 6 1
Duck Duds Duck neckties 2023 Q3 7 1
Duck Duds Duck neckties 2023 Q4 8 1

Next, we install the extension from the community repository:

INSTALL pivot_table FROM community;
LOAD pivot_table;

Now we can build pivot tables like the one below. There is a little bit of boilerplate required, and the details of how this works will be explained shortly.

DROP TYPE IF EXISTS columns_parameter_enum;

CREATE TYPE columns_parameter_enum AS ENUM (
    FROM build_my_enum(['business_metrics'],    -- table_names
                       ['year', 'quarter'],     -- columns
                       [])                      -- filters
);

FROM pivot_table(['business_metrics'],          -- table_names
                 ['sum(revenue)', 'sum(cost)'], -- values
                 ['product_line', 'product'],   -- rows
                 ['year', 'quarter'],           -- columns
                 [],                            -- filters
                 subtotals := 1,
                 grand_totals := 1,
                 values_axis := 'rows'
                 );
product_line                     product                     value_names 2022_Q1 2022_Q2 2022_Q3 2022_Q4 2023_Q1 2023_Q2 2023_Q3 2023_Q4
Duck Duds Duck neckties sum(cost) 1 1 1 1 1 1 1 1
Duck Duds Duck neckties sum(revenue) 1 2 3 4 5 6 7 8
Duck Duds Duck suits sum(cost) 10 10 10 10 10 10 10 10
Duck Duds Duck suits sum(revenue) 10 20 30 40 50 60 70 80
Duck Duds Subtotal sum(cost) 11 11 11 11 11 11 11 11
Duck Duds Subtotal sum(revenue) 11 22 33 44 55 66 77 88
Waterfowl watercraft Duck boats sum(cost) 100 100 100 100 100 100 100 100
Waterfowl watercraft Duck boats sum(revenue) 100 200 300 400 500 600 700 800
Waterfowl watercraft Subtotal sum(cost) 100 100 100 100 100 100 100 100
Waterfowl watercraft Subtotal sum(revenue) 100 200 300 400 500 600 700 800
Grand Total Grand Total sum(cost) 111 111 111 111 111 111 111 111
Grand Total Grand Total sum(revenue) 111 222 333 444 555 666 777 888

How the pivot_table Extension Works

The pivot_table extension is a collection of multiple scalar and table SQL macros. This allows the logic to be modularized. You can see below that the functions are used as building blocks to create more complex functions. This is typically difficult to do in SQL, but it is easy in DuckDB!

The functions and a brief description of each follows.

Building Block Scalar Functions

  • nq: “No quotes” – Escape semicolons in a string to prevent SQL injection
  • sq: “Single quotes” – Wrap a string in single quotes and escape embedded single quotes
  • dq: “Double quotes” – Wrap in double quotes and escape embedded double quotes
  • nq_list: Escape semicolons for each string in a list. Uses nq.
  • sq_list: Wrap each string in a list in single quotes. Uses sq.
  • dq_list: Wrap each string in a list in double quotes. Uses dq.
  • nq_concat: Concatenate a list of strings together with semicolon escaping. Uses nq_list.
  • sq_concat: Concatenate a list of strings together, wrapping each in single quotes. Uses sq_list.
  • dq_concat: Concatenate a list of strings together, wrapping each in double quotes. Uses dq_list.

Functions Creating During Refactoring for Modularity

  • totals_list: Build up a list as a part of enabling subtotals and grand_totals.
  • replace_zzz: Rename subtotal and grand_total indicators after sorting so they are more friendly.

Core Pivoting Logic Functions

  • build_my_enum: Determine which new columns to create when pivoting horizontally. Returns a table. See below for details.
  • pivot_table: Based on inputs, decide whether to call no_columns, columns_values_axis_columns or columns_values_axis_rows. Execute query on the SQL string that is generated. Returns a table. See below for details.
    • no_columns: Build up the SQL string for query to execute when no columns are pivoted out.
    • columns_values_axis_columns: Build up the SQL string for query to execute when pivoting horizontally with each entry in values receiving a separate column.
    • columns_values_axis_rows: Build up the SQL string for query to execute when pivoting horizontally with each entry in values receiving a separate row.
  • pivot_table_show_sql: Return the SQL string that would have been executed by query for debugging purposes.

The build_my_enum Function

The first step in using the pivot_table extension's capabilities is to define an ENUM (a user-defined type) containing all of the new column names to create when pivoting horizontally called columns_parameter_enum. DuckDB's automatic PIVOT syntax can automatically define this, but in our case, we need 2 explicit steps. The reason for this is that automatic pivoting runs 2 statements behind the scenes, but a MACRO must only be a single statement. If the columns parameter is not in use, this step is essentially a no-op, so it can be omitted or included for consistency (recommended).

The query and query_table functions only support SELECT statements (for security reasons), so the dynamic portion of the ENUM creation occurs in the function build_my_enum. If this type of usage becomes common, features could be added to DuckDB to enable a CREATE OR REPLACE syntax for ENUM types, or possibly even temporary enums. That would reduce this pattern from 3 statements down to 2. Please let us know!

The build_my_enum function uses a combination of query_table to pull from multiple input tables, and the query function so that double quotes (and correct character escaping) can be completed prior to passing in the list of table names. It uses a similar pattern to the core pivot_table function: build up a SQL query as a string, then call it with query. The SQL string is constructed using list lambda functions and the building block functions for quoting.

The pivot_table Function

At its core, the pivot_table function determines the SQL required to generate the desired pivot based on which parameters are in use.

Since this SQL statement is a string at the end of the day, we can use a hierarchy of scalar SQL macros rather than a single large macro. This is a common traditional issue with SQL – it tends to not be very modular or reusable, but we are able to compartmentalize our logic wth DuckDB's syntax.

Note If a non-optional parameter is not in use, an empty list ([]) should be passed in.

  • table_names: A list of table or view names to aggregate or pivot. Multiple tables are combined with UNION ALL BY NAME prior to any other processing.
  • values: A list of aggregation metrics in the format ['agg_fn_1(col_1)', 'agg_fn_2(col_2)', ...].
  • rows: A list of column names to SELECT and GROUP BY.
  • columns: A list of column names to PIVOT horizontally into a separate column per value in the original column. If multiple column names are passed in, only unique combinations of data that appear in the dataset are pivoted.
    • Ex: If passing in a columns parameter like ['continent', 'country'], only valid continent / country pairs will be included.
    • (no Europe_Canada column would be generated).
  • filters: A list of WHERE clause expressions to be applied to the raw dataset prior to aggregating in the format ['col_1 = 123', 'col_2 LIKE ''woot%''', ...].
    • The filters are combined with AND.
  • values_axis (Optional): If multiple values are passed in, determine whether to create a separate row or column for each value. Either rows or columns, defaulting to columns.
  • subtotals (Optional): If enabled, calculate the aggregate metric at multiple levels of detail based on the rows parameter. Either 0 or 1, defaulting to 0.
  • grand_totals (Optional): If enabled, calculate the aggregate metric across all rows in the raw data in addition to at the granularity defined by rows. Either 0 or 1, defaulting to 0.

No Horizontal Pivoting (No columns in Use)

If not using the columns parameter, no columns need to be pivoted horizontally. As a result, a GROUP BY statement is used. If subtotals are in use, the ROLLUP expression is used to calculate the values at the different levels of granularity. If grand_totals are in use, but not subtotals, the GROUPING SETS expression is used instead of ROLLUP to evaluate across all rows.

In this example, we build a summary of the revenue and cost of each product_line and product.

FROM pivot_table(['business_metrics'],
                 ['sum(revenue)', 'sum(cost)'],
                 ['product_line', 'product'],
                 [],
                 [],
                 subtotals := 1,
                 grand_totals := 1,
                 values_axis := 'columns'
                 );
product_line product sum(revenue) sum("cost")
Duck Duds Duck neckties 36 8
Duck Duds Duck suits 360 80
Duck Duds Subtotal 396 88
Waterfowl watercraft Duck boats 3600 800
Waterfowl watercraft Subtotal 3600 800
Grand Total Grand Total 3996 888

Pivot Horizontally, One Column per Metric in values

Build up a PIVOT statement that will pivot out all valid combinations of raw data values within the columns parameter. If subtotals or grand_totals are in use, make multiple copies of the input data, but replace appropriate column names in the rows parameter with a string constant. Pass all expressions in values to the PIVOT statement's USING clause so they each receive their own column.

We enhance our previous example to pivot out a separate column for each year / value combination:

DROP TYPE IF EXISTS columns_parameter_enum;

CREATE TYPE columns_parameter_enum AS ENUM (
    FROM build_my_enum(['business_metrics'],
                       ['year'],
                       [])
);

FROM pivot_table(['business_metrics'],
                 ['sum(revenue)', 'sum(cost)'],
                 ['product_line', 'product'],
                 ['year'],
                 [],
                 subtotals := 1,
                 grand_totals := 1,
                 values_axis := 'columns'
                 );
product_line product 2022_sum(revenue) 2022_sum("cost") 2023_sum(revenue) 2023_sum("cost")
Duck Duds Duck neckties 10 4 26 4
Duck Duds Duck suits 100 40 260 40
Duck Duds Subtotal 110 44 286 44
Waterfowl watercraft Duck boats 1000 400 2600 400
Waterfowl watercraft Subtotal 1000 400 2600 400
Grand Total Grand Total 1110 444 2886 444

Pivot Horizontally, One Row per Metric in values

Build up a separate PIVOT statement for each metric in values and combine them with UNION ALL BY NAME. If subtotals or grand_totals are in use, make multiple copies of the input data, but replace appropriate column names in the rows parameter with a string constant.

To simplify the appearance slightly, we adjust one parameter in our previous query and set values_axis := 'rows':

DROP TYPE IF EXISTS columns_parameter_enum;

CREATE TYPE columns_parameter_enum AS ENUM (
    FROM build_my_enum(['business_metrics'],
                       ['year'],
                       [])
);

FROM pivot_table(['business_metrics'],
                 ['sum(revenue)', 'sum(cost)'],
                 ['product_line', 'product'],
                 ['year'],
                 [],
                 subtotals := 1,
                 grand_totals := 1,
                 values_axis := 'rows'
                 );
product_line product value_names 2022 2023
Duck Duds Duck neckties sum(cost) 4 4
Duck Duds Duck neckties sum(revenue) 10 26
Duck Duds Duck suits sum(cost) 40 40
Duck Duds Duck suits sum(revenue) 100 260
Duck Duds Subtotal sum(cost) 44 44
Duck Duds Subtotal sum(revenue) 110 286
Waterfowl watercraft Duck boats sum(cost) 400 400
Waterfowl watercraft Duck boats sum(revenue) 1000 2600
Waterfowl watercraft Subtotal sum(cost) 400 400
Waterfowl watercraft Subtotal sum(revenue) 1000 2600
Grand Total Grand Total sum(cost) 444 444
Grand Total Grand Total sum(revenue) 1110 2886

Conclusion

With DuckDB 1.1, sharing your SQL knowledge with the community has never been easier! DuckDB's community extension repository is truly a package manager for the SQL language. Macros in DuckDB are now highly reusable (thanks to query and query_table), and DuckDB's SQL syntax provides plenty of power to accomplish complex tasks.

Please let us know if the pivot_table extension is helpful to you – we are open to both contributions and feature requests! Together we can write the ultimate pivoting capability just once and use it everywhere.

In the future, we have plans to further simplify the creation of SQL extensions. Of course, we would love your feedback! Join us on Discord in the community-extensions channel.

Happy analyzing!