USING KEY in Recursive CTEs
TL;DR: Recursive CTEs in SQL allow for powerful iterative queries like graph traversals but can be memory-intensive and slow due to repeated row accumulation. DuckDB’s new USING KEY
feature addresses this by treating intermediate results as keyed dictionaries instead of an ever growing set: existing entries can be updated. This can lead to dramatically better performance and memory efficiency, especially in graph algorithms like shortest path and distance-vector routing. It also simplifies query logic by providing direct access to the entire dictionary.
Assembling SQL Queries from Pieces: CTEs
As SQL queries become more complex, managing their readability, modularity and reusability increasingly becomes a challenge. Common Table Expressions (CTEs) were introduced to address these issues by allowing developers to define temporary, named result sets within a query. Similar to functions in programming, CTEs allow a large query to be broken down into logical building blocks, making it easier to understand, maintain and debug.
CTEs are particularly useful for structuring multi-step transformations that might otherwise require deeply nested subqueries or complex joins. By improving both the clarity and structure of SQL code, CTEs have become an essential tool in modern query writing – enabling the clear, declarative expression of even the most sophisticated logic.
Iterate Like It's 1999: Recursive CTEs
To enhance the expressive power of SQL, recursive CTEs were introduced in the SQL:1999 standard. These allow a query to reference the results from previous iterations within the same expression, enabling SQL to solve more complex problems such as graph traversal and other iterative computations.
This capability pushes SQL beyond basic data retrieval, allowing for the formulation of complex, iterative logic directly in SQL. In fact, recursive CTEs make SQL Turing-complete, meaning it can theoretically express any computation (given sufficient time and memory).
_But how do recursive CTEs work in DuckDB?_
Let’s look at a simple example to break down the mechanism. Suppose we want to calculate the largest power of 2 that is smaller than 100. We can use a recursive CTE power
to generate powers of 2 iteratively until we reach that limit. For any row (a, b, c)
in power
we will have that a^b = c
:
WITH RECURSIVE power(a, b, c) AS (
SELECT 2, 0, 1 -- 2^0 = 1
UNION
SELECT a, b+1, a * c -- a^(b+1) = a * a^b
FROM power -- reads the working table (contains a single row)
WHERE a * c < 100
)
FROM power; -- reads the union table (contains all intermediate results)
We can divide a recursive CTE into two parts, separated by the UNION
keyword. The part above the UNION
is the non-recursive part (SELECT 2, 0, 1
in our example), and the part below is the recursive part.
In the recursive part, CTE power
references itself. This self-reference points to what we call the working table. The working table always holds the rows produced in the immediately
preceding iteration. And only those.
Here's how it works step by step:
- First, the non-recursive part is executed, producing initial rows – in our example, just the row
(2,0,1)
. These rows are stored in the working table. - Then, the recursive part is executed using the rows from the working table. Every new row the recursive part produces is stored in the intermediate table, which holds results from the current iteration.
- If the intermediate table is empty, the iteration ends.
- Otherwise, we clear the working table and replace it with the contents of the intermediate table – preparing for the next iteration.
- We additionally add the contents of the intermediate table to the union table, which accumulates all intermediate results across iterations.
Here you can see the entries of the three involved tables in each iteration of the CTE:
Iteration | Output of Recursive Step | Working Table | Intermediate Table | Union Table |
---|---|---|---|---|
0 | SELECT 2, 0, 1 | ∅ (no rows) | (2,0,1) | (2,0,1) |
1 | 2 * 1 = 2 | (2,0,1) | (2,1,2) | (2,0,1) (2,1,2) |
2 | 2 * 2 = 4 | (2,1,2) | (2,2,4) | (2,0,1) (2,1,2) (2,2,4) |
3 | 2 * 4 = 8 | (2,2,4) | (2,3,8) | (2,0,1) … (2,3,8) |
4 | 2 * 8 = 16 | (2,3,8) | (2,4,16) | (2,0,1) … (2,4,16) |
5 | 2 * 16 = 32 | (2,4,16) | (2,5,32) | (2,0,1) … (2,5,32) |
6 | 2 * 32 = 64 | (2,5,32) | (2,6,64) | (2,0,1) … (2,6,64) |
7 | 2 * 64 = 128 | (2,6,64) | ∅ stop! | (2,0,1) … (2,6,64) |
When the recursive CTE completes, the union table holds the entire result set, including all intermediate rows from every iteration:
┌───────┬───────┬───────┐
│ a │ b │ c │ -- a^b = c
│ int32 │ int32 │ int32 │
├───────┼───────┼───────┤
│ 2 │ 0 │ 1 │
│ 2 │ 1 │ 2 │
│ 2 │ 2 │ 4 │
│ 2 │ 3 │ 8 │
│ 2 │ 4 │ 16 │
│ 2 │ 5 │ 32 │
│ 2 │ 6 │ 64 │
└───────┴───────┴───────┘
The union table provides a complete record of the history of the power computation. This can lead to unnecessary overhead, especially if all we need is the last row – the final result of the recursion. Storing every intermediate value can be inefficient, in particular when the intermediate results aren't required or when working with large datasets (that involve many rows or wide rows, e.g., in the presence of array-typed columns).
Recursive CTEs Suffer from Amnesia
While the union table holds the just mentioned history once the computation is complete, recursive CTEs suffer from a case of "amnesia" while the iteration is going on: the recursive part only ever sees the intermediate results of the immediately preceding iteration. This can be limiting and many of us will have seen how query authors work around this limitation in terms of manually maintained arrays (or similar container structures) that hold information about previous iterations. This may be costly. Then again, enabling the recursive part to access the union table with all its accumulated – potentially sizable – intermediate results may easily incur performance problems while we iterate. A real conundrum.
USING
a KEY
to Cure Amnesia
Can we afford to allow the recursive part to access the union table? Yes, but we need a means to control its size. Operating the union table in append-only mode thus is a no-go. Instead, let the recursive part optionally overwrite existing rows in the union table with fresh information that has been computed in the current iteration. This can reduce the union table's size significantly (see our experiments below).
Starting with version 1.3, DuckDB features a USING KEY
variant of recursive CTEs that incorporates just this idea.
If you would like to read more about its origins, check out CIDR 2023. For further information about implementation, see SIGMOD 2025 (to be published).
The variant introduces two major differences compared to traditional recursive CTEs:
- It provides access to the working table (as always) as well as access to the union table – which we now call the recurring table.
- Instead of simply appending new rows to the recurring table, the table now functions more like a dictionary (similar to a Python
dict
), allowing key-based updates.
To use this new feature, add the USING KEY (...)
clause to your recursive CTE:
WITH RECURSIVE power(a, b, c) USING KEY (a)
-- key: a, payload: b, c
AS (
...
)
With USING KEY
, the schema of the recursive CTE is divided into key columns and payload columns. The key columns are specified using the USING KEY (column names)
clause, while the remaining columns are treated as payload.
This division affects how the recurring table behaves. Instead of stubbornly appending new rows on each iteration, it acts more like a dictionary: if the recursive part returns a row that hasn't been seen before, it is added to the recurring table as usual. But if a row shares a key with an existing entry in the recurring table, the payload is updated – replacing the previous values for that key in the recurring table.
If multiple rows with the same key are produced in a single iteration, only the last one is retained. Therefore, you may wish to use the
ORDER BY
clause in the recursive part to control which row is kept.
This approach allows recursive queries to maintain and update state more efficiently, especially for algorithms in which keeping the latest (or “best”) value for a given key is crucial.
Overwriting Old Intermediate Results
Let us pick up our recursive example query from above. Now we compute the powers of bases 2 and 3, as long as these are less than 100:
WITH RECURSIVE power(a,b,c) USING KEY (a) AS (
FROM (VALUES (2,0,1), (3,0,1)) -- 2^0 = 1, 3^0 = 1
UNION
SELECT a, b+1, a * c -- a^(b+1) = a * a^b
FROM power -- reads the working table (contains two rows)
WHERE a * c < 100
)
FROM power; -- reads the recurring table (contains two rows)
We start with two rows in the non-recursive part: one with a key (base) of 2
and another with a key of 3
. In the recursive part, we multiply the intermediate power by the base. This produces two rows, again with the keys 2
and 3
.
Unlike traditional recursive CTEs, we do not append these two new rows to the recurring table. Instead, we update the existing rows with keys 2
and 3
in the recurring table, overwriting their payload values. We thus keep only two rows throughout the entire
computation, each holding the current power value for its key (or base) in column a
.
Iteration | Output of Recursive Step | Working Table | Intermediate Table | Recurring Table |
---|---|---|---|---|
0 | SELECT 2, 0, 1 SELECT 3, 0, 1 |
∅ (no rows) | (2,0,1) (3,0,1) |
(2,0,1) (3,0,1) |
1 | 2 * 1 = 2 3 * 1 = 3 |
(2,0,1) (3,0,1) |
(2,1,2) (3,1,3) |
(2,1,2) (3,1,3) |
2 | 2 * 2 = 4 3 * 3 = 9 |
(2,1,2) (3,1,3) |
(2,2,4) (3,2,9) |
(2,2,4) (3,2,9) |
3 | 2 * 4 = 8 3 * 9 = 27 |
(2,2,4) (3,2,9) |
(2,3,8) (3,3,27) |
(2,3,8) (3,3,27) |
4 | 2 * 8 = 16 3 * 27 = 81 |
(2,3,8) (3,3,27) |
(2,4,16) (3,4,81) |
(2,4,16) (3,4,81) |
5 | 2 * 16 = 32 3 * 81 = 243 |
(2,4,16) (3,4,81) |
(2,5,32) | (2,5,32) (3,4,81) |
6 | 2 * 32 = 64 | (2,5,32) | (2,6,64) | (2,6,64) (3,4,81) |
7 | 2 * 64 = 128 | (2,6,64) | ∅ stop! | (2,6,64) (3,4,81) |
As we can see, the size of the recurring table remains constant. Irrelevant computation history is overwritten which leads to reduced memory usage. The final recurring table reads:
┌───────┬───────┬───────┐
│ a │ b │ c │ -- a^b = c
│ int32 │ int32 │ int32 │
├───────┼───────┼───────┤
│ 2 │ 6 │ 64 │
│ 3 │ 4 │ 81 │
└───────┴───────┴───────┘
This behavior is especially useful with algorithms in which we are interested in the latest, best, or smallest value for a given key. The maximum number of rows in the recurring table is now bounded by the number of unique keys. Since the number of distinct keys in use is under control of the recursive part, this can be a powerful advantage when working with sizable datasets.
A Change of Key
Now, should you happen to be interested in the history of the computation and are willing to invest the memory space, all that is required is a change of key. With
WITH RECURSIVE power(a, b, c)
USING KEY (a,b)
AS ( -- formerly: USING KEY (a)
...
)
FROM power
ORDER BY a, b;
the iteration counter (or exponent) in column b
is considered to be part of the
key, too. The recurring table will thus track unique (a,b)
combinations (i.e., base, exponent) and we are able to trace what was going
on during iteration:
┌───────┬───────┬───────┐
│ a │ b │ c │
│ int32 │ int32 │ int32 │
├───────┼───────┼───────┤
│ 2 │ 0 │ 1 │
│ 2 │ 1 │ 2 │
│ 2 │ 2 │ 4 │
│ 2 │ 3 │ 8 │
│ 2 │ 4 │ 16 │
│ 2 │ 5 │ 32 │
│ 2 │ 6 │ 64 │
│ 3 │ 0 │ 1 │
│ 3 │ 1 │ 3 │
│ 3 │ 2 │ 9 │
│ 3 │ 3 │ 27 │
│ 3 │ 4 │ 81 │
└───────┴───────┴───────┘
Accessing Relevant History
Another major difference to vanilla recursive CTEs: now that the size of the recurring table is under control, we can afford to reference it directly in the recursive part of the CTE. This allows us to access any intermediate result that has not been overwritten yet – regardless of which iteration computed these results. No amnesia anymore! To access the recurring table, simply prefix the CTE name with the pseudo-schema name recurring
:
WITH RECURSIVE t(...) USING KEY (...) AS (
...
FROM recurring.t -- reads the recurring table while we iterate
)
...
USING KEY
Can Unlock Performance Advantages
To further highlight the differences between vanilla and key-based CTEs, let's consider a more complex example using a graph dataset – a social network graph.
The graphs were derived from the LDBC Social Network Benchmark (SNB), which provides a generator for synthetic social network data. To make the graphs more manageable for use with the
WITH RECURSIVE
query, we further filtered them by person name, reducing their density.
In this dataset, nodes represent people while edges correspond to relationships between them.
The tables we’re working with are Person(id)
, which contains all existing ids in the network and
knows(person1id, person2id)
in which each row holds a pair of people knowing each other.
If you would like to try this out for yourself, begin by attaching the database to any DuckDB session.
ATTACH 'https://blobs.duckdb.org/data/using-key-graph.duckdb'; USE 'using-key-graph';
Our goal is to compute the shortest path between all pairs of people in the social network.
Since this is a problem with inherent quadratic complexity, we need to closely eye runtime and
memory requirements. For each pair, we start by adding a row that has one person as the start
node and the other as the target node. We then iteratively explore all individuals known by the
start node, continuing the traversal until the target person is reached. Paths through the
network are encoded by via
nodes: to reach the target from the start node,
first proceed to the via
node (an immediate neighbor of the start node) – once there,
use that node's via
entry to continue your traversal.
A recursive CTE that implements this approach reads as follows:
WITH RECURSIVE paths(here, current, via, len, there, completed, found) AS (
SELECT n1.id AS here, n1.id AS current, NULL::INT64 AS via,
0 AS len, n2.id AS there, false AS completed, false AS found
FROM Person AS n1 JOIN Person AS n2 ON (n2.id <> n1.id)
UNION ALL
SELECT paths.here,
person2id AS current,
coalesce(paths.via, knows.person2id) AS via,
paths.len+1 AS len,
paths.there,
bool_or(knows.person2id = paths.there)
OVER (PARTITION BY paths.here, paths.there
ROWS BETWEEN UNBOUNDED PRECEDING
AND UNBOUNDED FOLLOWING) AS completed,
knows.person2id = paths.there AS found
FROM paths
JOIN knows ON (paths.current = knows.person1id AND NOT paths.completed)
)
SELECT here, there, via, len
FROM paths WHERE found
ORDER BY len, here, there;
The working table of recursive CTE
paths
has seven columns:
here
refers to the person where the traversal starts,there
refers to the target person we aim to reach,via
indicates the immediate neighbour of the start node of the current path,len
is the length of the current path, which is incremented by one for each step taken in the traversal.The remaining three columns control the traversal logic and optimize the computation.
current
refers to the node that is currently being explored during the traversal,found
indicates whether the current path has successfully reached the target person,completed
tracks whether any path with the samehere
andthere
has already reached the target. This prevents further traversal for that pair once a shortest path has been found, thus avoiding the exploration of longer paths.
If we search in a large graph with many edges, the union table in a vanilla recursive CTE can grow large, potentially exceeding memory limits. This not only causes significant performance issues but can even lead to query crashes in extreme cases.
The new key-based CTEs can avoid this problem by changing how the state of the search is maintained. This enables a range of new algorithms to be expressed efficiently in SQL, including algorithms for finding the shortest paths in large graphs.
One such algorithm is Distance Vector Routing (DVR), a method for computing the shortest paths in a network based on node-local routing table that indicate where to “hop next”.
- In DVR, each node maintains a routing table that records the length of the paths that reach other nodes.
- We use the recurring table to store these routing tables for all nodes in the network:
a row
(here, there, via, len)
indicates that the first hop on the path from nodehere
to nodethere
is nodevia
. The length of the overall path islen
. - When a shorter path to a target node is found, the corresponding routing table entry is updated.
- Routing updates found in the last iteration are distributed to neighboring nodes through the working table.
To check if a newly incoming routing update improves a currently known path, we perform a lookup in the recurring table. If the new length is smaller than that of the known path, we update the entry and propagate the update to our immediate neighbors. This mechanism allows for efficient path finding even in very large graphs – see below how DVR outperforms the above approach based on a vanilla recursive CTEs.
WITH RECURSIVE dvr(here, there, via, len) USING KEY (here, there) AS (
-- initialize routing tables for all nodes, only the routes to
-- immediate neighbors are known at this time
SELECT n.person1id AS here, n.person2id AS there, n.person2id AS via, 1::DOUBLE AS len
FROM knows AS n
UNION
(SELECT n.person1id AS here, dvr.there, dvr.here AS via, 1 + dvr.len AS len
FROM dvr -- working table - routing updates shared by neighbors
JOIN knows AS n ON
(n.person2id = dvr.here AND -- send update only to immediate neighbors
n.person1id <> dvr.there) -- no need to store a route to myself
LEFT JOIN recurring.dvr AS rec ON -- recurring table (current routing tables)
(rec.here = n.person1id AND rec.there = dvr.there) -- identify affected routing table entry
WHERE 1 + dvr.len < coalesce(rec.len, 'Infinity'::DOUBLE) -- does the routing update improve the entry in the routing table?
ORDER BY len -- shortest path first
)
)
FROM dvr
ORDER BY len, here, there;
The larger the social network graphs (A to G), the more pronounced is the performance
gap between vanilla recursive CTEs (REC) and the new USING KEY
variant (KEY).
The table below reports the number of rows processed during each iteration:
graph | nodes | edges | KEY | REC |
---|---|---|---|---|
A | 184 | 233 | 744 | 352,906 |
B | 322 | 903 | 8,232 | 40,732,577 |
C | 424 | 1,446 | 19,213 | 605,859,791 |
D | 484 | 2,049 | 30,871 | ❌ |
E | 1,119 | 8,809 | 255,425 | ❌ |
F | 1,481 | 14,256 | 491,880 | ❌ |
G | 1,618 | 16,619 | 607,926 | ❌ |
Even for the smallest graph, Graph A, the difference is significant: the REC CTE produces around 350,000 rows, while KEY generates only 744 rows. As the graph size increases, the gap becomes even more striking. In Graph C, with 424 nodes and 1,446 edges, the REC approach processes nearly 1 billion rows, while the KEY method handles less than 20,000 rows. Although this isn't the largest graph in our benchmark, the REC approach is already approaching out-of-memory conditions (OOM, ❌).
This substantial difference in memory usage is only part of the story. The performance of the REC approach also degrades quickly. While both CTEs perform similarly on small graphs, REC becomes significantly slower as the graph grows – eventually crashing – whereas KEY continues to scale smoothly:
And that’s the beauty of the new USING KEY
CTEs.
It enables more efficient expression of complex iterative algorithms by
providing key-based control over the size of intermediate results that are
passed from iteration to iteration – memory pressure lessens, run time
performance can go up significantly. If you're working with recursive CTEs
in DuckDB, be sure to take advantage of this powerful addition – it may
make a significant difference in your queries.