Spatial Joins in DuckDB
TL;DR: DuckDB v1.3.0 significantly improved the scalability of geospatial joins with a dedicated SPATIAL_JOIN
operator.
Introduction
Spatial joins are join operations that match rows based on the (geo-)spatial relationship between column(s). In practice they are often used to answer questions like “which of these points are within which of these polygons?”. Being able to connect datasets based on the physical location(s) they model is fundamental to the whole field of geospatial data science, but at a high level, also an extremely powerful way to correlate and enrich otherwise disparate data sources, anchor insights to the real world, and more often than not tell a great story with your analysis.
Ever since its inception, DuckDB's spatial
extension has provided a GEOMETRY
column type to represent locations, regions and shapes, along with plenty of spatial predicate functions to use when JOIN
ing. However, it wasn't until recently in DuckDB v1.3.0 that spatial joins really became scalable thanks to the introduction of the dedicated SPATIAL_JOIN
query operator.
We managed to hastily sneak in some information about this new operator towards the end of the v1.3.0 release notes, but we figured we got enough to talk about to justify a separate blog post. So in this post, we will take a closer look at some of the challenges involved in optimizing spatial joins, and how the new SPATIAL_JOIN
operator addresses them to achieve new levels of efficiency.
What Does a Spatial Join Look Like Again?
Performing a spatial join in SQL is actually very straightforward. There is no special syntax or magic incantation required. Like you probably could've guessed, you just have to JOIN
two tables with a GEOMETRY
column using some spatial predicate. Below is a simple example that joins two tables some_table
and another_table
on the condition that the geometries in some_table.geom
and another_table.geom
“intersect”, using the ST_Intersects
spatial predicate function:
SELECT *
FROM some_table
JOIN another_table
ON ST_Intersects(some_table.geom, another_table.geom);
Let's try to do something more advanced. We're going to analyze the NYC Citi Bike Trip dataset, which contains around 58 million rows representing rental bike trips in New York City, including the start and end locations of each trip. We want to find neighborhoods in NYC that have the most bike trips starting in them. Therefore, we join the bike trip data with a dataset of NYC neighborhood polygons. We then count and group the number of trips starting in each neighborhood, and finally sort and limit the results to return the top-3 neighborhoods in which most trips originate. We've compiled the datasets and extracted the relevant columns for this example into two tables, rides
and hoods
, into a 218 MB DuckDB database file.
Let's make sure the spatial
extension is installed and loaded:
INSTALL spatial; -- Install the spatial extension
LOAD spatial; -- Load the spatial extension
Our query looks like this:
SELECT neighborhood, count(*) AS num_rides
FROM rides
JOIN hoods ON ST_Intersects(rides.start_geom, hoods.geom)
GROUP BY neighborhood
ORDER BY num_rides DESC
LIMIT 3;
┌──────────────┬───────────┐
│ neighborhood │ num_rides │
│ varchar │ int64 │
├──────────────┼───────────┤
│ Midtown │ 6253835 │
│ Chelsea │ 6238693 │
│ East Village │ 4047396 │
└──────────────┴───────────┘
This query joins the 58,033,724 rides with the 310 neighborhood polygons and takes about 30 seconds to run on my laptop (MacBook with an M3 Pro CPU and 36 GB RAM).
While 30 seconds might not seem that impressive at first glance (a HASH_JOIN
over similarly sized input would finish an order of magnitude faster), I'm actually pleased DuckDB is able to execute a spatial join at this scale in a time frame that is, if not great, still acceptable for exploratory analysis (on a laptop no less!). To understand the difference in execution time, compared to, e.g., a HASH_JOIN
, we first need to take a closer look at how DuckDB used to do spatial joins and why spatial joins are so challenging to optimize. Then, we'll dive into how the new SPATIAL_JOIN
operator changes the game.
How DuckDB (Used to) Do Spatial Joins
Spatial Predicates
The first thing to understand is that a spatial predicate is really just a function that evaluates some spatial relation between two geometries and returns true
or false
, e.g., “does a
contain b
” or “is a
within distance x
of b
”. In theory you could write your own spatial predicate function, but in practice, you will probably use one of the many that come with the spatial
extension. The nuances of the different spatial predicates are beyond the scope of this post, but here is a quick overview of the most commonly used ones, taking two geometries a
and b
:
Function | Description |
---|---|
ST_Intersects(a, b) |
Whether a intersects b |
ST_Contains(a, b) |
Whether a contains b |
ST_ContainsProperly(a, b) |
Whether a contains b without b touching a 's boundary |
ST_Within(a, b) |
Whether a is within b |
ST_Overlaps(a, b) |
Whether a overlaps b |
ST_Touches(a, b) |
Whether a touches b |
ST_Equals(a, b) |
Whether a is equal to b |
ST_Crosses(a, b) |
Whether a crosses b |
ST_Covers(a, b) |
Whether a covers b |
ST_CoveredBy(a, b) |
Whether a is covered by b |
ST_DWithin(a, b, x) |
Whether a is within distance x of b |
We use
ST_Intersects
in the example query above to keep it simple, even ifST_Contains
orST_ContainsProperly
might be more appropriate for point-in-polygon joins.
Since all the above spatial predicates live in the spatial
extension, DuckDB's query planner doesn't really know anything about them, except their names and that they are functions that take two GEOMETRY
arguments and return a boolean value. This means that vanilla DuckDB on its own can't apply any special optimizations to them, and has to treat them like any other function in the database.
Optimization Challenges
When DuckDB plans a join where the join condition is an arbitrary function, it normally can't use any of the advanced built-in join strategies like HASH_JOIN
or RANGE_JOIN
which are optimized for equality or range comparisons. Instead DuckDB has to fall back to the simplest join strategy, which is to perform a nested loop join (NLJ), i.e., to evaluate the join condition for every possible pair of rows in the two tables. In pseudo-code, a nested loop join would look something like this:
for row_a in table_a:
for row_b in table_b:
if join_condition(row_a, row_b):
emit(row_a, row_b)
This is the most general way to implement a join, but it is also the least efficient. Since the complexity of the join is O(nm)
, where n
and m
are the number of rows in the two tables, the time it takes to perform the join grows quadratically with the size of the input.
Complexity Challenges
Obviously, quadratic complexity quickly becomes infeasible for large joins, but DuckDB's raw execution power usually makes it bearable when joining small- to medium-sized tables.
However, what makes the nested loop join strategy impractical for spatial joins in particular, even at small- to medium-scale is that spatial predicates can be very computationally expensive to evaluate in the first place. This is mostly due to the sheer complexity of the algorithms involved, but also due to the denormalized nature of geometries, in which a single geometry can contain a very large number of points. Also, besides the inherent theoretical complexity, the reality in spatial
is that most of the spatial predicates implemented with the help of third-party libraries require a deserialization step to convert the internal binary format of the GEOMETRY
column to an executable data structure. This usually also requires allocating memory outside of DuckDB's own memory management system, which increases pressure and lock contention on the global memory allocator, limiting the effectiveness of additional parallelism.
To illustrate how this plays out in practice, let's take a look at the query plan for the example query above, but with the SPATIAL_JOIN
optimization disabled to force DuckDB to use a nested loop join instead:
Experiment
LOAD spatial; -- Load the spatial extension
-- Disable the spatial join optimizer
SET disabled_optimizers = 'extension';
-- Print the new plan, using EXPLAIN to verify
-- that we are using a nested loop join
EXPLAIN SELECT neighborhood, count(*) AS num_rides
FROM rides
JOIN hoods ON ST_Intersects(rides.start_geom, hoods.geom)
GROUP BY neighborhood
ORDER BY num_rides DESC
LIMIT 3;
Result of EXPLAIN
┌───────────────────────────┐
│ TOP_N │
│ Top: 3 │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ HASH_GROUP_BY │
│ ... │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ PROJECTION │
│ ... │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ BLOCKWISE_NL_JOIN │
│ ──────────────────── │
│ Join Type: INNER │
│ ├──────────────┐
│ Condition: │ │
│ ST_Intersects(start_geom, │ │
│ geom) │ │
└─────────────┬─────────────┘ │
┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│ SEQ_SCAN ││ SEQ_SCAN │
│ ──────────────────── ││ ──────────────────── │
│ Table: rides ││ Table: hoods │
│ ... ││ ... │
│ ~58033724 Rows ││ ~310 Rows │
└───────────────────────────┘└───────────────────────────┘
Executing this query, with the BLOCKWISE_NL_JOIN
(blockwise nested loop join), now takes 30 minutes on my laptop. I've also run the query with a subset of number of rows in the rides
table to see how the performance scales with the size of the input. The results are as follows:
Number of rows in rides |
Execution time (s) | Execution time (min) |
---|---|---|
1,000,000 | 30.8 s | 0.5 min |
10,000,000 | 310.3 s | 5.2 min |
58,033,724 | 1799.6 s | 30.0 min |
This is obviously not great. Running a 1/58th of the full dataset already takes longer than the original query with the spatial join optimization enabled. And processing the full dataset takes almost half an hour to run! This is a clear indication that the nested loop join strategy is not suitable for spatial joins at scale, and we really needed to do something about it.
The IE-Join Optimization
In order to improve the situation, one of the first optimizations implemented in the early versions of the spatial
extension was a re-write rule for the query planner.
Since evaluating the spatial predicate function itself is so expensive, its a good idea to try to do some cheap filtering of potential join matches first in order to reduce the number of pairs that actually need to go through the precise spatial relationship check. Luckily for us, it is relatively easy to do this for most spatial predicates since they all have a common property. Almost all spatial predicates imply intersection in some way, and if two geometries intersect, their bounding boxes must also intersect.
The bounding box, sometimes called minimum bounding rectangle (MBR) of a geometry is the smallest rectangle that fully contains the geometry. Because bounding boxes basically represent the minimum and maximum x and y coordinates of all vertices in the geometry, they are relatively efficient to compute by just scanning over the geometry a single pass. Additionally, checking if two bounding boxes intersect is also very cheap, as it can be done in constant time with a few simple less-than and greater-than comparisons.
Since bounding box intersection can be expressed as a series of inequality checks, we can make use of DuckDB's existing inequality join functionality to perform a bounding-box join! Explaining the inner workings of inequality joins is beyond the scope of this post because Richard have already done a great job of that in his blog post on range joins, but the gist is that it is a join type that can efficiently join two tables using a series of inequality operators like <
, >
, <=
, >=
, etc.
To leverage this, we made the following changes to spatial
:
- Cache an approximate bounding box for each geometry in the serialized binary representation of the
GEOMETRY
column so that we don't have to recompute it every time we need to evaluate a spatial predicate. - Introduce a re-write rule to change any inner nested loop join operators with a spatial predicate into two new operators:
PIECEWISE_MERGE_JOIN
, a “IE” (inequality) range-join operator joining on the intersection of the bounding boxes, written as a series ofBETWEEN
clauses on the min/max values of the bounding boxesFILTER
operator that filters the resulting rows on the actual spatial predicate function.
To show how this affects the query plan, we're going to go back in time and use DuckDB v1.2.0 (which does contain the PIECEWISE_MERGE_JOIN
optimization, but not the SPATIAL_JOIN
operator) to re-run the query from above.
LOAD spatial; -- Load the spatial extension
-- Check that we are using DuckDB v1.2.0
PRAGMA version;
┌─────────────────┬────────────┐
│ library_version │ source_id │
│ varchar │ varchar │
├─────────────────┼────────────┤
│ v1.2.0 │ 5f5512b827 │
└─────────────────┴────────────┘
-- Print the new query plan, using EXPLAIN
EXPLAIN SELECT neighborhood, count(*) AS num_rides
FROM rides
JOIN hoods ON ST_Intersects(rides.start_geom, hoods.geom)
GROUP BY neighborhood
ORDER BY num_rides DESC
LIMIT 3;
Result of EXPLAIN
┌───────────────────────────┐
│ TOP_N │
│ ──────────────────── │
│ Top: 3 │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ HASH_GROUP_BY │
│ ──────────────────── │
│ ... │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ PROJECTION │
│ ... │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ FILTER │
│ ──────────────────── │
│ ST_Intersects(start_geom, │ -- Here we evaluate the actual
│ geom) │ -- precise spatial predicate
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ PIECEWISE_MERGE_JOIN │
│ ──────────────────── │
│ Join Type: INNER │
│ │
│ Conditions: │
│ ST_XMin(ST_Extent_Approx │ -- Very complex join condition,
│ (start_geom)) <= ST_XMax │ -- but its all just inequality checks!
│ (ST_Extent_Approx(geom)) │
│ ST_XMax(ST_Extent_Approx │
│ (start_geom)) >= ST_XMin ├──────────────┐
│ (ST_Extent_Approx(geom)) │ │
│ ST_YMin(ST_Extent_Approx │ │
│ (start_geom)) <= ST_YMax │ │
│ (ST_Extent_Approx(geom)) │ │
│ ST_YMax(ST_Extent_Approx │ │
│ (start_geom)) >= ST_YMin │ │
│ (ST_Extent_Approx(geom)) │ │
│ │ │
│ ~58033724 Rows │ │
└─────────────┬─────────────┘ │
┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│ SEQ_SCAN ││ SEQ_SCAN │
│ ──────────────────── ││ ──────────────────── │
│ ... ││ ... │
│ ~58033724 Rows ││ ~310 Rows │
└───────────────────────────┘└───────────────────────────┘
As you can see, the PIECEWISE_MERGE_JOIN
operator has replaced the BLOCKWISE_NL_JOIN
operator, and the join condition has been rewritten to use the approximate bounding boxes of the geometries (ST_Extent_Approx
) instead of directly joining on the spatial predicate. The FILTER
operator is added on top to only apply the spatial predicate functions on the rows that pass the initial bounding box check.
This looks a lot more complicated than the original nested loop join, so how does it perform?
Number of rows in rides |
Execution time |
---|---|
1,000,000 | 2.3 s |
10,000,000 | 19.6 s |
58,033,724 | 107.6 s |
Much better, we managed to reduce the execution time from 30 minutes to just under 2 minutes for the full dataset! This was a huge improvement. Without requiring any custom operator code, we were able to leverage DuckDB's existing inequality join functionality to significantly speed up spatial joins as well.
However, this approach still has some drawbacks:
- This re-write is only possible to do for
INNER
joins. - We now have to store bounding boxes for each geometry, which increases the memory footprint of the
GEOMETRY
column, and we also have to recompute it whenever create or modify aGEOMETRY
. - The
PIECEWISE_MERGE_JOIN
operator has to sort the two input tables, which requires a lot of memory and can be really slow for large tables once the data no longer fits in memory.
DuckDB has since optimized sorting considerably, and spatial
and its functions have also received various optimizations that improve performance slightly. Nonetheless this was always somewhat of a stop-gap solution to make spatial joins at least usable in the common case. How can we do better?
The Spatial Join Operator
R-Tree Indexing
To quickly recap how an r-tree works, it is a balanced tree data structure where internal nodes contain bounding boxes that cover the bounding boxes of their child nodes, and leaf nodes contain a bounding box for the geometry and a pointer to the row in the table they correspond to. This makes it possible to quickly look up all geometries that intersect a given bounding box by traversing the tree from the top down, only following branches that intersect the search box.
One of the motivations for implementing the RTREE
index in the spatial
extension last year, was to eventually be able to replace PIECEWISE_MERGE_JOIN
optimization with an index-based join strategy. However, after implementing the RTREE
, we realized that creating an R-tree based index on top of existing data is actually surprisingly fast. Instead of requiring the user to always have to load data into a table and then index it just to accelerate their spatial joins, what if we just create a temporary R-tree index on-the-fly when executing the join? Kind of like how we create a temporary hash table when performing a HASH_JOIN
.
Joining with R-Trees
This is exactly what the new SPATIAL_JOIN
operator does. It takes two input tables, buffers all the input of the “right” table (i.e., table that the join-order-optimizer expects to be smaller), builds an R-tree index on top of the collected input, and then performs a join by looking up each row of the “left” table in the R-tree index as they arrive. The idea is exactly the same as with a HASH_JOIN
, but instead of using a hash table, we use an R-tree as our acceleration data structure to look up the matching rows.
While being a lot more complex to implement compared to the PIECEWISE_MERGE_JOIN
-based approach, this has several advantages:
- The query re-write rule is much simpler, we just have to detect a nested loop join with a spatial predicate and replace it with a
SPATIAL_JOIN
operator. No other modifications to the query plan are needed. - Only the smaller side input needs to be materialized and (sorted to construct the R-tree). The left side can be streamed in parallel, and matching rows can be emitted immediately as they are found.
- The hierarchical structure of the R-tree allows us to prune more of the right side input more accurately
- Since all logic is in one operator, we don't need to rely on a separate
FILTER
operator and can therefore supportLEFT
,RIGHT
andFULL OUTER
joins as well asINNER
joins.
Experiment
Just to show what this plan looks like, this is what you get when you run the same query but on DuckDB v1.3.0 or later with the SPATIAL_JOIN
operator enabled:
LOAD spatial; -- Load the spatial extension
-- Print the new query plan, using EXPLAIN
EXPLAIN SELECT neighborhood, count(*) AS num_rides
FROM rides
JOIN hoods ON ST_Intersects(rides.start_geom, hoods.geom)
GROUP BY neighborhood
ORDER BY num_rides DESC
LIMIT 3;
Result of EXPLAIN
┌───────────────────────────┐
│ TOP_N │
│ ──────────────────── │
│ Top: 3 │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ HASH_GROUP_BY │
│ ──────────────────── │
│ ... │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ PROJECTION │
│ ──────────────────── │
│ ... │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ SPATIAL_JOIN │
│ ──────────────────── │
│ Join Type: INNER │
│ │
│ Conditions: ├──────────────┐
│ ST_Intersects(start_geom, │ │
│ geom) │ │
│ │ │
│ ~58033724 Rows │ │
└─────────────┬─────────────┘ │
┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│ SEQ_SCAN ││ SEQ_SCAN │
│ ──────────────────── ││ ──────────────────── │
│ ... ││ ... │
│ ~58033724 Rows ││ ~310 Rows │
└───────────────────────────┘└───────────────────────────┘
Nice, we're back to a single join condition again. So how does this perform? We already saw in the first example that DuckDB v1.3.2 was able to execute the original query in about 30 seconds using the SPATIAL_JOIN
operator, but here are the precise results for all the different sizes and join strategies for comparison. All queries were run on the same Apple MacBook M3 Pro with 36 GB of memory and timings averaged over 3 runs:
Number of rows in rides |
Nested loop join | Piecewise merge join | Spatial join |
---|---|---|---|
1,000,000 | 30.8 s | 2.3 s | 0.5 s |
10,000,000 | 310.3 s | 19.6s s | 4.8 s |
58,033,724 | 1799.6 s | 107.6 s | 28.7 s |
The SPATIAL_JOIN
operator is able to execute the full 58 million row dataset faster than the original naive nested loop join could execute the 1 million row dataset. That's a 58× improvement!
Limitations and Future work
Even though we've made a lot of improvements, spatial joins in DuckDB are not quite perfect just yet. Like always, there's plenty more to do, and we have a few ideas for how to improve the SPATIAL_JOIN
operator even further.
Larger-Than-Memory Build Sides
The current implementation of the SPATIAL_JOIN
operator buffers the entire right side input in memory to build the R-tree index. This means that it can only handle right side inputs that fit in memory. We plan to add support for larger-than-memory R-trees by partitioning the right side into smaller separate R-trees, and then merging the results of the join from each partition, similar to how the HASH_JOIN
operator handles larger-than-memory hash tables.
Increased Parallelism
DuckDB's execution engine dynamically adjusts the number of threads used in a query based on the number of rows scanned, but since the spatial predicates are so expensive to evaluate, we could potentially benefit from more parallelism as the spatial joins are mostly CPU bound. In particular, we are considering buffering and partitioning the left side of the join as well, which would allow us to spawn our own worker tasks within the operator to evaluate the join condition across a number of threads decoupled from the cardinality of the input. This would increase memory usage and prevent the join from being streamed, but we've been able to employ a similar technique when constructing the HNSW
(Hierarchical Navigable Small Worlds) index over in the vss
extension to great effect.
Faster Predicate Functions
Most predicate functions in the spatial
extension are implemented using third-party libraries which carry some overhead as we are unable to integrate them as tightly with DuckDB's execution engine and memory management, often requiring some sort of deserialization or conversion step. This can be quite expensive, especially for large geometries. To demonstrate impact of this in practice, we can compare the performance of joining on ST_Intersects(x, y)
with the functionally equivalent ST_DWithin(x, y, 0)
, by comparing the previous results with the timings of the following query:
SELECT neighborhood, count(*) AS num_rides
FROM rides
JOIN hoods ON ST_DWithin(rides.start_geom, hoods.geom, 0)
GROUP BY neighborhood
ORDER BY num_rides DESC
LIMIT 3;
Number of rows in rides |
Spatial join (Intersects ) |
Spatial join (DWithin ) |
---|---|---|
1,000,000 | 0.5 s | 0.1 s |
10,000,000 | 4.8 s | 0.7 s |
58,033,724 | 28.7 s | 4.3 s |
Even though ST_DWithin
is technically doing more work, its implementation in the spatial
extension was recently changed to our own highly optimized native implementation which avoids extra memory allocation and copying. This clearly demonstrates how optimizing the spatial predicate functions themselves can significantly improve the performance of spatial joins. We are actively working on adding optimized versions of the rest of the commonly used spatial predicates (like ST_Intersects
, ST_Contains
, ST_Within
, etc.) to the spatial
extension and expect to see similar performance improvements across the board.
Advanced Join Conditions
The SPATIAL_JOIN
operator currently only supports the case where the join condition is a single spatial predicate function. We plan to add support for more complex join conditions that can include comparisons, arithmetic operations, and other functions in combination with the spatial predicates. This will allow users to express more complex spatial relationships in their joins, e.g., JOIN ... ON ST_Intersects(a.geom, b.geom) AND a.id = b.id
.
ANTI
/ SEMI
Joins
The SPATIAL_JOIN
operator currently only supports INNER
, LEFT
, RIGHT
and FULL OUTER
joins. We plan to add support for ANTI
and SEMI
joins in the future.
Conclusion
Spatial joins are a powerful way to connect and enrich datasets based on their geospatial relationship. DuckDB's spatial
extension provides a wide assortment of spatial predicate functions that can be used to perform spatial joins in SQL. While the performance of spatial joins has been challenging to deal with in the past due to the complexity of the spatial predicates and the limitations of the join strategies, the new SPATIAL_JOIN
operator introduced in DuckDB v1.3.0 significantly raises the efficiency and scalability of these workloads, making spatial joins a first-class citizen in DuckDB's execution engine.