# Range Joins in DuckDB

*TL;DR: DuckDB has fully parallelised range joins that can efficiently join millions of range predicates.*

Range intersection joins are an important operation in areas such as
temporal analytics,
and occur when two inequality conditions are present in a join predicate.
Database implementations often rely on slow `O(N^2)`

algorithms that compare every pair of rows
for these operations.
Instead, DuckDB leverages its fast sorting logic to implement two highly optimized parallel join operators
for these kinds of range predicates, resulting in 20-30x faster queries.
With these operators, DuckDB can be used effectively in more time-series-oriented use cases.

## Introduction

Joining tables row-wise is one of the fundamental and distinguishing operations of the relational model.
A join connects two tables horizontally using some Boolean condition called a *predicate*.
This sounds straightforward, but how fast the join can be performed depends on the expressions in the predicate.
This has lead to the creation of different join algorithms that are optimised for different predicate types.

In this post, we will explain several join algorithms and their capabilities. In particular, we will describe a newly added “range join” algorithm that makes connecting tables on overlapping time intervals or multiple ordering conditions much faster.

### Flight Data

No, this part isn’t about ducks, but about air group flight statistics from the Battlestar Galactica reboot.
We have a couple of tables we will be using: `Pilots`

, `Crafts`

, `Missions`

and `Battles`

.
Some data was lost when the fleet dispersed, but hopefully this is enough to provide some “real life” examples!

The `Pilots`

table contains the pilots and their data that does not change (name, call sign, serial number):

id | callsign | name | serial |
---|---|---|---|

1 | Apollo | Lee Adama | 234567 |

2 | Starbuck | Kara Thrace | 462753 |

3 | Boomer | Sharon Valeri | 312743 |

4 | Kat | Louanne Katraine | 244977 |

5 | Hotdog | Brendan Costanza | 304871 |

6 | Husker | William Adama | 204971 |

… | … | … | … |

The `Crafts`

table contains all the various fighting craft
(ignoring the “Ship Of Theseus” problem of recycled parts!):

id | type | tailno |
---|---|---|

1 | Viper | N7242C |

2 | Viper | 2794NC |

3 | Raptor | 312 |

4 | Blackbird | N9999C |

… | … | … |

The `Missions`

table contains all the missions flown by pilots.
Missions have a `begin`

and `end`

time logged with the flight deck.
We will use some common pairings
(and an unusual mission at the end where Commander Adama flew his old Viper):

pid | cid | begin | end |
---|---|---|---|

2 | 2 | 3004-05-04 13:22:12 | 3004-05-04 15:05:49 |

1 | 2 | 3004-05-04 10:00:00 | 3004-05-04 18:19:12 |

3 | 3 | 3004-05-04 13:33:52 | 3004-05-05 19:12:21 |

6 | 1 | 3008-03-20 08:14:37 | 3008-03-20 10:21:15 |

… | … | … | … |

The `Battles`

table contains the time window of each
battle with the Cylons.

battle | begin | end |
---|---|---|

Fall of the Colonies | 3004-05-04 13:21:45 | 3004-05-05 02:47:16 |

Red Moon | 3004-05-28 07:55:27 | 3004-05-28 08:12:19 |

Tylium Asteroid | 3004-06-09 09:00:00 | 3004-06-09 11:14:29 |

Resurrection Ship | 3004-10-28 22:00:00 | 3004-10-28 23:47:05 |

… | … | … |

These last two tables (`Missions`

and `Battles`

) are examples of *state tables*.
An object in a state table has a state that runs between two time points.
For the battles, the state is just yes/no.
For the missions, the state is a pilot/craft combination.

### Equality Predicates

The most common type of join involves comparing one or more pairs of expressions for equality,
often a primary key and a foreign key.
For example, if we want a list of the craft flown by the pilots,
we can join the `Pilots`

table to the `Craft`

table through the `Missions`

table:

```
SELECT callsign, count(*), tailno
FROM Pilots p, Missions m, Crafts c
WHERE p.id = m.pid
AND c.id = m.cid
GROUP BY ALL
ORDER BY 2 DESC
```

This will give us a table like:

callsign | count(*) | tailno |
---|---|---|

Starbuck | 127 | 2794NC |

Boomer | 55 | R1234V |

Apollo | 3 | N7242C |

Husker | 1 | N7242C |

… | … | … |

### Range Predicates

The thing to notice in this example is that the conditions joining the tables are equalities connected with `AND`

s.
But relational joins can be defined using *any* Boolean predicate - even ones without equality or `AND`

.

One common operation in temporal databases is intersecting two state tables. Suppose we want to find the time intervals when each pilot was engaged in combat so we can compute combat hours for seniority? Vipers are launched quickly, but not before the battle has started, and there can be malfunctions or pilots may be delayed getting to the flight deck.

```
SELECT callsign, battle,
GREATEST(m.begin, b.begin) AS begin,
LEAST(m.end, b.end) AS end
FROM Pilots p, Missions m, Crafts c, Battles b
WHERE m.begin < b.end
AND b.begin < m.end
AND p.id = m.pid
AND c.id = m.cid
```

This join creates a set of records containing the call sign and period in combat for each pilot.
It handles the case where a pilot returns for a new craft, excludes patrol flights,
and even handles the situation when a patrol flight turns into combat!
This is because intersecting state tables this way produces a *joint state table* -
an important temporal database operation.
Here are a few rows from the result:

callsign | battle | begin | end |
---|---|---|---|

Starbuck | Fall of the Colonies | 3004-05-04 13:22:12 | 3004-05-04 15:05:49 |

Apollo | Fall of the Colonies | 3004-05-04 13:21:45 | 3004-05-04 18:19:12 |

Boomer | Fall of the Colonies | 3004-05-04 13:33:52 | 3004-05-05 02:47:16 |

… | … | … | … |

Apollo was already in flight when the first Cylon attack came,
so the query puts his `begin`

time for the battle at the start of the battle,
not when he launched for the decomissioning flyby.
Starbuck and Boomer were scrambled after the battle started,
but Boomer did not return until after the battle was effectively over,
so her `end`

time is moved back to the official end of the battle.

What is important here is that the join condition between the pilot/mission/craft relation and the battle table has no equalities in it. This kind of join is traditionally very expensive to compute, but as we will see, there are ways of speeding it up.

### Infinite Time

One common problem with populating state tables is how to represent the open edges. For example, the begin time for the first state might not be known, or the current state may not have ended yet.

Often such values are represented by `NULL`

s,
but this complicates the intersection query because comparing with `NULL`

yields `NULL`

.
This issue can be worked around by using `COALESE(end, <large timestamp>)`

,
but that adds a computation to every row, most of which don’t need it.
Another approach is to just use `<large timestamp>`

directly instead of the `NULL`

,
which solves the expression computation problem but introduces an arbitrary time value.
This value may give strange results when used in computations.

DuckDB provides a third alterantive from Postgres that can be used for these situations:
infinite time values.
Infinite time values will compare as expected, but arithmetic with them will produce `NULL`

s or infinities,
indicating that the computation is not well defined.

## Common Join Algorithms

To see why these joins can be expensive, let’s start by looking at the two most common join algorithms.

### Hash Joins

Joins with at least one equality condition `AND`

ed to the rest of the conditions are called *equi-joins*.
They are usually implemented using a hash table like this:

```
result = []
hashes = {}
for b in build:
hashes[b.pk] = b
for p in probe:
result.append((p, hashes[p.fk], ))
```

The expressions from one side (the *build* side) are computed and hashed,
then the corresponding expressions from the other side (the *probe* side)
are looked up in the hash table and checked for a match.

We can modify this a bit when only *some* of the `AND`

ed conditions are equalities
by checking the other conditions once we find the equalities in the hash table.
The important point is that we can use a hash table to make the join run time `O(N)`

.
This modification is a general technique that can be used with any join algorithm which reduces the possible matches.

### Nested Loop Joins

Since relational joins can be defined using *any* Boolean predicate - even one without equality or `AND`

,
hash joins do not always work.
The join algorithm of last resort in these situations is called a *Nested Loop Join* (or NLJ for short),
and consists of just comparing every row from the probe side with every row from the build side:

```
result = []
for p in probe:
for b in build
if compare(p, b):
result.append((p, b, ))
```

This is `O(M x N)`

in the number of rows, which can be very slow if the tables are large.
Even worse, most practical analytic queries (such as the combat hours example above)
will not return anything like this many results, so a lot of effort may be wasted.
But without an algorithm that is tuned for a kind of predicate,
this is what we would have to use.

## Range Joins

When we have a range comparison (one of `<`

, `<=`

`>`

, `>=`

) as one of the join conditions,
we can take advantage of the ordering it implies by sorting the input relations on some of the join conditions.
Sorting is `O(N log N)`

, which suggests that this could be faster than an NLJ,
and indeed this turns out to be the case.

### Piecewise Merge Join

Before the advent of hash joins, databases would often sort the join inputs to find matches.
For equi-joins, a repeated binary search would then find the matching values on the build side in `O(M log N)`

time.
This is called a *Merge Join*, and it runs faster than `O(M x N)`

, but not as fast as the `O(N)`

time of a hash join.
Still, in the case where we have a single range comparison,
the binary search lets us find the first match for a probe value.
We can then find all the remaining matches by looking after the first one.

If we also sort the probe side, we can even know where to start the search for the next probe value
because it will be after where we found the previous value.
This is how *Piecewise Merge Join* (PWMJ) works:
We sort the build side so that the values are ordered by the predicate (either `ASC`

or `DESC`

),
then sort each probe chunk the same way so we can quickly scan through sets of values to find possible matches.
This can be significantly faster than NLJ for these types of queries.
If there are more join conditions, we can then check the generated matches to make sure all conditions are met
because once again the sorting has significantly reduced the number of checks that have to be made.

### Inequality Join (IEJoin)

For two range conditions (like the combat pay query), there are even faster algorithms available. We have recently added a new join called IEJoin, which sorts on two predicates to really speed things up.

The way that IEJoin works is to first sort both tables on the values for the first condition and merge the two sort keys into a combined table that tracks the two input tables’ row numbers. Next, it sorts the positions in the combined table on the second range condition. It can then quickly scan for matches that pass both conditions. And just like for hash joins, we can check any remaining conditions because we have hopefully significantly reduced the number pairs we have to test.

#### Walk Through

Because the algorithm is a bit tricky, let’s step through a small example.
(If you are reading the paper, this is a simplified version of the “Union Arrays” optimisation from §4.3,
but I find this version of the algorithm is much easier to understand than the version in §3.1.)
We are going to look at `Qp`

from the paper, which is a self join on the table “West”:

West | t_id | time | cost | cores |
---|---|---|---|---|

s1 | 404 | 100 | 6 | 4 |

s2 | 498 | 140 | 11 | 2 |

s3 | 676 | 80 | 10 | 1 |

s4 | 742 | 90 | 5 | 4 |

We are looking for pairs of billing ids where the second id had a shorter time than the first, but a higher cost:

```
SELECT s1.t_id, s2.t_id AS t_id2
FROM west s1, west s2
WHERE s1.time > s2.time
AND s1.cost < s2.cost
```

There are two pairs that meet this criteria:

t_id | t_id2 |
---|---|

404 | 676 |

742 | 676 |

(This is an example of another kind of double range query where we are looking for anomalies.)

First, we sort both input tables on the first condition key (`time`

).
(We sort `DESC`

because we want the values to satisfy the join condition (`>`

) from left to right.)

Because they are sorted the same way,
we can merge the condition keys from the sorted tables into a new table called `L1`

after marking each row with the table it came from (using negative row numbers to indicate the right table):

L1 | s2 | s2 | s1 | s1 | s4 | s4 | s3 | s3 |
---|---|---|---|---|---|---|---|---|

time | 140 | 140 | 100 | 100 | 90 | 90 | 80 | 80 |

cost | 11 | 11 | 6 | 6 | 5 | 5 | 10 | 10 |

rid | 1 | -1 | 2 | -2 | 3 | -3 | 4 | -4 |

The `rid`

column lets us map rows in `L1`

back to the original table.

Next, we build a second table `L2`

with the second condition key (`cost`

) and the row positions (`P`

) of `L1`

(not the row numbers from the original tables!)
We sort `L2`

on `cost`

(`DESC`

again this time because now we want the join condition to hold from right to left):

L2 | s2 | s2 | s3 | s3 | s1 | s1 | s4 | s4 |
---|---|---|---|---|---|---|---|---|

cost | 11 | 11 | 10 | 10 | 6 | 6 | 5 | 5 |

P | 0 | 1 | 6 | 7 | 2 | 3 | 4 | 5 |

The sorted column of `L1`

row positions is called the *permutation array*,
and we can use it to find the corresponding position of the `time`

value for a given `cost`

.

At this point we have two tables (`L1`

and `L2`

),
each sorted on one of the join conditions and pointing back to the tables it was derived from.
Moreover, the sort orders have been chosen so that the condition holds from left to right
(resp. right to left).
Since the conditions are transitive,
this means that whenever we have a value that satisfies a condition at a point in the table,
it also satisfies it for everything to the right (resp. left)!

With this setup, we can scan `L2`

from left to right
looking for rows that match both conditions using two indexes:

`i`

iterates across`L2`

from left to right;`off2`

tracks`i`

and is used to identify`costs`

that satisfy the join condition compared to`i`

. (Note that for loose inequalities, this could be to the right of`i`

);

We use a bitmap `B`

to track which rows in `L1`

that the `L2`

scan
has already identified as satisfying the `cost`

condition compared to the `L2`

scan position `i`

.

Because we only want matches between one left and one right row, we can skip matches where the `rid`

s have different signs.
To leverage this observation, we only process values of `i`

that are in the left hand table (`rid[P[i]]`

is positive),
and we only mark bits for rows in the right hand table (`rid[P[i]]`

is negative).
In this example, the right side rows are the odd numbered values in `P`

(which are conveniently also the odd values of `i`

),
which makes them easy to track in the example.

For the other rows, here is what happens:

i | off2 | cost[i] | cost[off2] | P[i] | rid[P[i]] | B | Result |
---|---|---|---|---|---|---|---|

0 | 0 | 11 | 11 | 0 | 1 | `00000000` |
[] |

2 | 0..2 | 10 | 11..10 | 6 | 4 | `01000000` |
[] |

4 | 2..4 | 6 | 10..6 | 2 | 2 | `01000001` |
[{s4, s3}] |

6 | 4..6 | 5 | 6..5 | 4 | 3 | `01010001` |
[{s1, s3}] |

Whenever we find `cost`

s that satisfy the condition to the left of the scan location (between `off2`

and `i`

),
we use `P[off2]`

to mark the bits in `B`

corresponding to those positions in `L1`

that reference right side rows.
This records that the `cost`

condition is satisfied for those rows.
Then whenever we have a position `P[i]`

in `L1`

,
we can scan `B`

to the right to find values that also satisfy the `cost`

condition.
This works because everything to the right of `P[i]`

in `L1`

satisfies the `price`

condition
thanks the sort order of `L1`

and the transitivity of the comparison operations.

In more detail:

- When
`i`

and`off2`

are`0`

, the`cost`

condition`<`

is not satisfied, so nothing happens; - When
`i`

is`1`

, we are looking at a row from the right side of the join, so we skip it and move on; - When
`i`

is`2`

, we are now looking at a row from the left side, so we bring`off2`

forward until the`cost`

condition fails, marking`B`

where it succeeds at`P[1] = [1]`

; - We then scan the
`time`

values in`L1`

right from position`P[i=2] = 6`

and find no matches in`B`

; - When
`i`

is`4`

, we bring`off2`

forward again, marking`B`

at`P[3] = [7]`

; - We then scan
`time`

from position`2`

and find matches at`[6,7]`

, one of which (`6`

) is from the right side table; - When
`i`

is`6`

, we bring`off2`

forward again, marking`B`

at`P[5] = [3]`

; - We then scan
`time`

from position`4`

and again find matches at`[6,7]`

; - Finally, when
`i`

runs off the end, we have no new`cost`

values, so nothing happens;

What makes this fast is that we only have to check a few bits to find the matches. When we do need to perform comparisons, we can use the fast radix comparison code from our sorting code, which doesn’t require special templated versions for every data type. This not only reduces the code size and complexity, it “future-proofs” it against new data types.

#### Further Details

That walk through is a slightly simplified, single threaded version of the actual algorithm. There are a few more details that may be of interest:

- Scanning large, mostly empty bit maps can be slow, so we use the Bloom filter optimisation from §4.2.
- The published algorithm assumes that there are no duplicate
`L1`

values in either table. To handle the general case, we use an exponential search to find the first`L1`

value that satisfies the predicate with respect to the current position and scan right from that point; - We also adapted the distributed Algorithm 3 from §5 by joining pairs of the sorted blocks generated by the sort code on separate threads. This allows us to fully parallelise the operator by first using parallel sorting and then by breaking up the join into independent pieces;
- Breaking up the pieces for parallel execution also allows us to spool join blocks that are not being processed to disk, making the join scalable.

## Special Joins

One of the nice things about IEJoin is that it is very general and implements a number of more specialised join types reasonably efficiently.
For example, the state intersection query above is an example of an *interval join*
where we are looking to join on the intersection of two intervals.

Another specialised join that can be accelerated with `IEJoin`

is a *band join*.
This can be used to join values that are “close” to each other

```
SELECT r.id, s.id
FROM r, s
WHERE r.value - s.value BETWEEN a AND b
```

This translates into a double inequality join condition:

```
SELECT r.id, s.id
FROM r, s
WHERE s.value + a <= r.value AND r.value <= s.value + b
```

which is exactly the type of join expression that IEJoin handles.

## Performance

So how fast is the IEJoin? It is so fast that it is difficult to compare it to the previous range join algorithms because the improvements are so large that the other algorithms do not complete in a reasonable amount of time!

### Simple Measurements

To give an example, here are the run times for a 100K self join of some employee tax and salary data, where the goal is to find the 1001 pairs of employees where one has a higher salary but the other has a higher tax rate:

```
SELECT
r.id,
s.id
FROM Employees r
JOIN Employees s
ON r.salary < s.salary
AND r.tax > s.tax;
```

Algorithm | Time (s) |
---|---|

NLJ | 21.440 |

PWMJ | 38.698 |

IEJoin | 0.280 |

Another example is a self join to find 3772 overlapping events in a 30K event table:

```
SELECT
r.id,
s.id
FROM events r
JOIN events s
ON r.start <= s.end
AND r.end >= s.start
AND r.id <> s.id
```

Algorithm | Time (s) |
---|---|

NLJ | 6.985 |

PWMJ | 4.780 |

IEJoin | 0.226 |

In both cases we see performance improvements of 20-100x, which is very helpful when you run a lot of queries like these!

### Optimisation Measurements

A third example demonstrates the importance of the join pair filtering and exponential search optimisations. The data is a state table of library circulation data from another interval join paper, and the query is a point-in-period temporal query used to generate Figure 4d:

```
SELECT x, COUNT(*) AS y
FROM books,
(SELECT x FROM range('2013-01-01'::TIMESTAMP, '2014-01-01'::TIMESTAMP, INTERVAL 1 DAY) tbl(x)) dates
WHERE checkout <= x AND x <= return
GROUP BY ALL
ORDER BY 1
```

The result is a count of the number of books checked out at midnight on each day. These are the runtimes on an 18 core iMac Pro:

Improvement | Time (s) | CPU |
---|---|---|

Unoptimised | > 30m | ~100% |

Filtering | 119.76s | 269% |

Exponential | 11.21s | 571% |

The query joins a 35M row table with a 365 row table, so most of the data comes from the left hand side.
By avoiding setting bits for the matching rows in the left table, we eliminate almost all `L1`

checks.
This dramatically reduces the runtime and improved the CPU utilisation.

The data also has a large number of rows corresponding to books that were checked out at the start of the year,
which all have the same `checkout`

date.
Searching left linearly in the first block to find the first match for the scan
resulted in repeated runs of ~120K comparisons.
This caused the runtime to be completely dominated by processing the first block.
By reducing the number of comparisons for these rows from an average of ~60K to 16,
the runtime dropped by a factor of 10 and the CPU utilisation doubled.

## Conclusion and Feedback

In this blog post, we explained the new DuckDB range join improvements provided by the new IEJoin operator. This should greatly improve the response time of state table joins and anomaly detection joins. We hope this makes your DuckDB experience even better - and please let us know if you run into any problems! Feel free to reach out on our GitHub page, or our Discord server.