PostGEESE? Introducing The DuckDB Spatial Extension
TL;DR: DuckDB now has an official Spatial extension to enable geospatial processing.
Geospatial data has become increasingly important and prevalent in modern-day applications and data engineering workflows, with use-cases ranging from location-based services to environmental monitoring.
While there are many great and specialized tools for working with geospatial data, integrating geospatial capabilities directly into DuckDB has multiple advantages. For one, you get to operate, transform and join your geospatial data alongside your regular, unstructured or time-series data using DuckDBs rich type system and extensions like JSON
and ICU
. Secondly, spatial queries involving geometric predicates and relations translate surprisingly well to SQL, which is all about expressing relations after all! Not to mention all the other benefits provided by DuckDB such as transactional semantics, high performance multi-threaded vectorized execution and larger-than-memory data processing.
Therefore, we're very excited to announce that DuckDB now has a Spatial extension packed with features easily installable from the DuckDB CLI and other DuckDB clients. Simply execute INSTALL spatial; LOAD spatial;
from within DuckDB and you're good to go!
No, we're not calling it GeoDuck either, that's just gross
What's in it?
The core of the extension is a GEOMETRY
type based on the "Simple Features" geometry model and accompanying functions such as ST_Area
, ST_Intersects
. It also provides methods for reading and writing geospatial data formats and converting between coordinate reference systems (details later in the post!). While we're not ready to commit to full compliance with the OGC Simple Feature Access and SQL/MM Standards yet, if you've worked with geospatial functionality in other database systems such as PostGIS or SpatiaLite, you should feel right at home.
Most of the implemented functions are based on the trifecta of foundational geospatial libraries, GEOS, GDAL and PROJ, which provide algorithms, format conversions and coordinate reference system transformations respectively. In particular, we leverage GDAL to provide a set of table and copy functions that enable import and export of tables from and to 50+ different geospatial data formats (so far!), including the most common ones such as Shapefiles, GeoJSON, GeoPackage, KML, GML, WKT, WKB, etc.
Check for yourself by running:
SELECT * FROM st_drivers();
short_name long_name can_create can_copy can_open help_url ESRI Shapefile ESRI Shapefile true false true https://gdal.org/drivers/vector/shapefile.html MapInfo File MapInfo File true false true https://gdal.org/drivers/vector/mitab.html UK .NTF UK .NTF false false true https://gdal.org/drivers/vector/ntf.html LVBAG Kadaster LV BAG Extract 2.0 false false true https://gdal.org/drivers/vector/lvbag.html S57 IHO S-57 (ENC) true false true https://gdal.org/drivers/vector/s57.html DGN Microstation DGN true false true https://gdal.org/drivers/vector/dgn.html OGR_VRT VRT - Virtual Datasource false false true https://gdal.org/drivers/vector/vrt.html Memory Memory true false true CSV Comma Separated Value (.csv) true false true https://gdal.org/drivers/vector/csv.html GML Geography Markup Language (GML) true false true https://gdal.org/drivers/vector/gml.html GPX GPX true false true https://gdal.org/drivers/vector/gpx.html KML Keyhole Markup Language (KML) true false true https://gdal.org/drivers/vector/kml.html GeoJSON GeoJSON true false true https://gdal.org/drivers/vector/geojson.html GeoJSONSeq GeoJSON Sequence true false true https://gdal.org/drivers/vector/geojsonseq.html ESRIJSON ESRIJSON false false true https://gdal.org/drivers/vector/esrijson.html TopoJSON TopoJSON false false true https://gdal.org/drivers/vector/topojson.html OGR_GMT GMT ASCII Vectors (.gmt) true false true https://gdal.org/drivers/vector/gmt.html GPKG GeoPackage true true true https://gdal.org/drivers/vector/gpkg.html SQLite SQLite / Spatialite true false true https://gdal.org/drivers/vector/sqlite.html WAsP WAsP .map format true false true https://gdal.org/drivers/vector/wasp.html OpenFileGDB ESRI FileGDB true false true https://gdal.org/drivers/vector/openfilegdb.html DXF AutoCAD DXF true false true https://gdal.org/drivers/vector/dxf.html CAD AutoCAD Driver false false true https://gdal.org/drivers/vector/cad.html FlatGeobuf FlatGeobuf true false true https://gdal.org/drivers/vector/flatgeobuf.html Geoconcept Geoconcept true false true GeoRSS GeoRSS true false true https://gdal.org/drivers/vector/georss.html VFK Czech Cadastral Exchange Data Format false false true https://gdal.org/drivers/vector/vfk.html PGDUMP PostgreSQL SQL dump true false false https://gdal.org/drivers/vector/pgdump.html OSM OpenStreetMap XML and PBF false false true https://gdal.org/drivers/vector/osm.html GPSBabel GPSBabel true false true https://gdal.org/drivers/vector/gpsbabel.html WFS OGC WFS (Web Feature Service) false false true https://gdal.org/drivers/vector/wfs.html OAPIF OGC API - Features false false true https://gdal.org/drivers/vector/oapif.html EDIGEO French EDIGEO exchange format false false true https://gdal.org/drivers/vector/edigeo.html SVG Scalable Vector Graphics false false true https://gdal.org/drivers/vector/svg.html ODS Open Document/ LibreOffice / OpenOffice Spreadsheet true false true https://gdal.org/drivers/vector/ods.html XLSX MS Office Open XML spreadsheet true false true https://gdal.org/drivers/vector/xlsx.html Elasticsearch Elastic Search true false true https://gdal.org/drivers/vector/elasticsearch.html Carto Carto true false true https://gdal.org/drivers/vector/carto.html AmigoCloud AmigoCloud true false true https://gdal.org/drivers/vector/amigocloud.html SXF Storage and eXchange Format false false true https://gdal.org/drivers/vector/sxf.html Selafin Selafin true false true https://gdal.org/drivers/vector/selafin.html JML OpenJUMP JML true false true https://gdal.org/drivers/vector/jml.html PLSCENES Planet Labs Scenes API false false true https://gdal.org/drivers/vector/plscenes.html CSW OGC CSW (Catalog Service for the Web) false false true https://gdal.org/drivers/vector/csw.html VDV VDV-451/VDV-452/INTREST Data Format true false true https://gdal.org/drivers/vector/vdv.html MVT Mapbox Vector Tiles true false true https://gdal.org/drivers/vector/mvt.html NGW NextGIS Web true true true https://gdal.org/drivers/vector/ngw.html MapML MapML true false true https://gdal.org/drivers/vector/mapml.html TIGER U.S. Census TIGER/Line false false true https://gdal.org/drivers/vector/tiger.html AVCBin Arc/Info Binary Coverage false false true https://gdal.org/drivers/vector/avcbin.html AVCE00 Arc/Info E00 (ASCII) Coverage false false true https://gdal.org/drivers/vector/avce00.html
Initially we have prioritized providing a breadth of capabilities by wrapping existing libraries. We're planning to implement more of the core functions and algorithms natively in the future to enable faster performance and more efficient memory management.
As an initial step in this direction, we provide a set of non-standard specialized columnar DuckDB native geometry types such as POINT_2D
, BOX_2D
, etc. that should provide better compression and faster execution in exchange for some flexibility, but work around these are still very much experimental.
Example Usage
The following demonstrates how you can use the spatial extension to read and export multiple geospatial data formats, transform geometries between different coordinate reference systems and work with spatial property and predicate functions. While this example may be slightly contrived, we want to showcase the power of the currently available features. You can find the datasets used in this example in the spatial extension repository.
Let's start by installing and loading the spatial extension and the parquet extension. Now we can import the NYC taxi ride data provided in parquet format, as well as the accompanying taxi zone data from a shapefile, using the ST_Read
table function provided by the spatial extension. These taxi zones break NYC into polygons that represent regions, for example the Newark Airport. We then create a table for the rides and a table for the zones. Note that ST_Read
produces a table with a wkb_geometry
column that contains the geometry data encoded as a WKB (Well-Known Binary) blob, which we then convert to the GEOMETRY
type using the ST_GeomFromWKB
function.
This may all seem a bit much if you are not familiar with the geospatial ecosystem, but rest assured this is all you really need to get started. In short: – Shapefile (.shp, .shx, .dbf) is a common format for storing geometry vector data and auxiliary metadata such as indexes and attributes. – WKB (Well Known Binary), while not really a file format in itself, is a common binary encoding of vector geometry data, used in e.g., GeoParquet. Comes in multiple flavors, but we're only concerned with "standard" WKB for now. –
GEOMETRY
is a DuckDB type that represents a Simple Features geometry object, which is based on a set of standards modeling vector geometry data as points, linestrings, polygons or collections of such. This is the core data type used by the spatial extension, and what most of the provided functions take and return.
INSTALL spatial;
INSTALL parquet;
LOAD spatial;
LOAD parquet;
CREATE TABLE rides AS SELECT *
FROM './spatial/test/data/nyc_taxi/yellow_tripdata_2010-01-limit1mil.parquet';
-- Load the NYC taxi zone data from a shapefile using the gdal-based ST_Read function
CREATE TABLE zones AS SELECT zone, LocationId, borough, ST_GeomFromWKB(wkb_geometry) AS geom
FROM ST_Read('./spatial/test/data/nyc_taxi/taxi_zones/taxi_zones.shx');
Let's compare the trip distance to the linear distance between the pickup and dropoff points to figure out how efficient the taxi drivers are (or how dirty the data is, since some diffs seem to be negative). We transform the coordinates from "WGS84" (given by the identifier EPSG:4326), also commonly known as simply latitude/longitude to the "NAD83 / New York Long Island ftUS" (identified as ESRI:102718) coordinate reference system which is a projection with minimal distortion around New York. We then calculate the distance using the ST_Distance
function. In This case we get the distance in feet since we've converted the coordinates to NAD83 but we can easily convert it into to miles (5280 ft/mile) which is the unit used in the rides dataset so we can compare them correctly.
Wait, what's all this about coordinate reference systems and projections?
The earth is not flat, but sometimes it is useful to pretend it is for the sake of simplicity by "projecting" the coordinates onto a flat surface. The "parameters" of a projection – e.g., where the "origin" is located, what unit coordinates are in, or how the earth's shape is approximated – are encapsulated by a "Spatial Reference System" or "Coordinate Reference System" (CRS) which is usually referenced by a shorthand identifier composed of an authority and a code, e.g., "EPSG:4326" or "ESRI:102718". Projections are always lossy, so its important to use a CRS that is well suited for the "area of interest" your data is in. The spatial extension uses the PROJ library to handle coordinate reference systems and projections.
Trips with a distance shorter than the aerial distance are likely to be erroneous, so we use this query to filter out some bad data. The query below takes advantage of DuckDB's ability to refer to column aliases defined within the same select statement. This is a small example of how DuckDB's rich SQL dialect can simplify geospatial analysis.
CREATE TABLE cleaned_rides AS SELECT
ST_Point(pickup_latitude, pickup_longitude) AS pickup_point,
ST_Point(dropoff_latitude, dropoff_longitude) AS dropoff_point,
dropoff_datetime::TIMESTAMP - pickup_datetime::TIMESTAMP AS time,
trip_distance,
ST_Distance(
ST_Transform(pickup_point, 'EPSG:4326', 'ESRI:102718'),
ST_Transform(dropoff_point, 'EPSG:4326', 'ESRI:102718')) / 5280
AS aerial_distance,
trip_distance - aerial_distance AS diff
FROM rides
WHERE diff > 0
ORDER BY diff DESC;
SELECT * FROM rides LIMIT 10;
vendor_id | pickup_datetime | dropoff_datetime | passenger_count | trip_distance | pickup_longitude | pickup_latitude | rate_code | store_and_fwd_flag | dropoff_longitude | dropoff_latitude | payment_type | fare_amount | surcharge | mta_tax | tip_amount | tolls_amount | total_amount |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
VTS | 2010-01-01 00:00:17 | 2010-01-01 00:00:17 | 3 | 0.0 | -73.87105699999998 | 40.773522 | 1 | -73.871048 | 40.773545 | CAS | 45.0 | 0.0 | 0.5 | 0.0 | 0.0 | 45.5 | |
VTS | 2010-01-01 00:00:20 | 2010-01-01 00:00:20 | 1 | 0.05 | -73.97512999999998 | 40.789973 | 1 | -73.97498799999998 | 40.790598 | CAS | 2.5 | 0.5 | 0.5 | 0.0 | 0.0 | 3.5 | |
CMT | 2010-01-01 00:00:23 | 2010-01-01 00:00:25 | 1 | 0.0 | -73.999431 | 40.71216 | 1 | 0 | -73.99915799999998 | 40.712421 | No | 2.5 | 0.5 | 0.5 | 0.0 | 0.0 | 3.5 |
CMT | 2010-01-01 00:00:33 | 2010-01-01 00:00:55 | 1 | 0.0 | -73.97721699999998 | 40.749633 | 1 | 0 | -73.97732899999998 | 40.749629 | Cas | 2.5 | 0.5 | 0.5 | 0.0 | 0.0 | 3.5 |
VTS | 2010-01-01 00:01:00 | 2010-01-01 00:01:00 | 1 | 0.0 | -73.942313 | 40.784332 | 1 | -73.942313 | 40.784332 | Cre | 10.0 | 0.0 | 0.5 | 2.0 | 0.0 | 12.5 | |
VTS | 2010-01-01 00:01:06 | 2010-01-01 00:01:06 | 2 | 0.38 | -73.97463 | 40.756687 | 1 | -73.979872 | 40.759143 | CAS | 3.7 | 0.5 | 0.5 | 0.0 | 0.0 | 4.7 | |
VTS | 2010-01-01 00:01:07 | 2010-01-01 00:01:07 | 2 | 0.23 | -73.987358 | 40.718475 | 1 | -73.98518 | 40.720468 | CAS | 2.9 | 0.5 | 0.5 | 0.0 | 0.0 | 3.9 | |
CMT | 2010-01-01 00:00:02 | 2010-01-01 00:01:08 | 1 | 0.1 | -73.992807 | 40.741418 | 1 | 0 | -73.995799 | 40.742596 | No | 2.9 | 0.5 | 0.5 | 0.0 | 0.0 | 3.9 |
VTS | 2010-01-01 00:01:23 | 2010-01-01 00:01:23 | 1 | 0.6099999999999999 | -73.98003799999998 | 40.74306 | 1 | -73.974862 | 40.750387 | CAS | 3.7 | 0.5 | 0.5 | 0.0 | 0.0 | 4.7 | |
VTS | 2010-01-01 00:01:34 | 2010-01-01 00:01:34 | 1 | 0.02 | -73.954122 | 40.801173 | 1 | -73.95431499999998 | 40.800897 | CAS | 45.0 | 0.0 | 0.5 | 0.0 | 0.0 | 45.5 |
SELECT * FROM zones LIMIT 10;
zone | LocationID | borough | geom |
---|---|---|---|
Newark Airport | 1 | EWR | POLYGON (…) |
Jamaica Bay | 2 | Queens | MULTIPOLYGON (…) |
Allerton/Pelham Gardens | 3 | Bronx | POLYGON (…) |
Alphabet City | 4 | Manhattan | POLYGON (…) |
Arden Heights | 5 | Staten Island | POLYGON (…) |
Arrochar/Fort Wadsworth | 6 | Staten Island | POLYGON (…) |
Astoria | 7 | Queens | POLYGON (…) |
Astoria Park | 8 | Queens | POLYGON (…) |
Auburndale | 9 | Queens | POLYGON (…) |
Baisley Park | 10 | Queens | POLYGON (…) |
It should be noted that this is not entirely accurate since the
ST_Distance
function we use does not take into account the curvature of the earth. However, we'll accept it as a good enough approximation for our purposes. Spherical and geodesic distance calculations are on the roadmap!
Now let's join the taxi rides with the taxi zones to get the start and end zone for each ride. We use the ST_Within
function as our join condition to check if a pickup or dropoff point is within a taxi zone polygon. Again we need to transform the coordinates from WGS84 to the NAD83 since the taxi zone data also use that projection. Spatial joins like these are the bread and butter of geospatial data processing, but we don't currently have any optimizations in place (such as spatial indexes) to speed up these queries, which is why we only use a subset of the data for the following step.
-- Since we don't have spatial indexes yet, use a smaller dataset for the join.
DELETE FROM cleaned_rides WHERE rowid > 5000;
CREATE TABLE joined AS
SELECT
pickup_point,
dropoff_point,
start_zone.zone AS start_zone,
end_zone.zone AS end_zone,
trip_distance,
time,
FROM cleaned_rides
JOIN zones AS start_zone
ON ST_Within(ST_Transform(pickup_point, 'EPSG:4326', 'ESRI:102718'), start_zone.geom)
JOIN zones AS end_zone
ON ST_Within(ST_Transform(dropoff_point, 'EPSG:4326', 'ESRI:102718'), end_zone.geom);
SELECT * FROM joined USING SAMPLE 10 ROWS;
pickup_point | dropoff_point | start_zone | end_zone | trip_distance | time |
---|---|---|---|---|---|
POINT (40.722223 -73.98385299999998) | POINT (40.715507 -73.992438) | East Village | Lower East Side | 10.3 | 00:19:16 |
POINT (40.648687 -73.783522) | POINT (40.649567 -74.005812) | JFK Airport | Sunset Park West | 23.57 | 00:28:00 |
POINT (40.761603 -73.96661299999998) | POINT (40.760232 -73.96344499999998) | Upper East Side South | Sutton Place/Turtle Bay North | 17.6 | 00:27:05 |
POINT (40.697212 -73.937495) | POINT (40.652377 -73.93983299999998) | Stuyvesant Heights | East Flatbush/Farragut | 13.55 | 00:24:00 |
POINT (40.721462 -73.993583) | POINT (40.774205 -73.90441699999998) | Lower East Side | Steinway | 28.75 | 01:03:00 |
POINT (40.716955 -74.004328) | POINT (40.754688 -73.991612) | TriBeCa/Civic Center | Garment District | 18.4 | 00:46:12 |
POINT (40.740052 -73.994918) | POINT (40.75439 -73.98587499999998) | Flatiron | Garment District | 24.2 | 00:35:25 |
POINT (40.763017 -73.95949199999998) | POINT (40.763615 -73.959182) | Lenox Hill East | Lenox Hill West | 18.4 | 00:33:46 |
POINT (40.865663 -73.927458) | POINT (40.86537 -73.927352) | Washington Heights North | Washington Heights North | 10.47 | 00:27:00 |
POINT (40.738408 -73.980345) | POINT (40.696038 -73.955493) | Gramercy | Bedford | 16.4 | 00:21:47 |
We can export the joined table to a GeoJSONSeq
file using the GDAL copy function, passing in a GDAL layer creation option. Since GeoJSON
only supports a single GEOMETRY
per record, we use the ST_MakeLine
function to combine the pickup and dropoff points into a single line geometry. The default coordinate reference system for GeoJSON
is WGS84, but the coordinate pairs are expected to be in longitude/latitude, so we need to flip the geometry using the ST_FlipCoordinates
function.
COPY (
SELECT
ST_AsWKB(ST_FlipCoordinates(ST_MakeLine(pickup_point, dropoff_point)))
AS wkb_geometry,
start_zone,
end_zone,
time::VARCHAR AS trip_time
FROM joined)
TO 'joined.geojsonseq'
WITH (FORMAT GDAL, DRIVER 'GeoJSONSeq', LAYER_CREATION_OPTIONS 'WRITE_BBOX=YES');
head -n 10 joined.geojsonseq
{ "type": "Feature", "properties": { "start_zone": "JFK Airport", "end_zone": "Park Slope", "trip_time": "00:52:00" }, "geometry": { "type": "LineString", "coordinates": [ [ -73.789923, 40.643515 ], [ -73.97608, 40.680395 ] ] } }
{ "type": "Feature", "properties": { "start_zone": "JFK Airport", "end_zone": "Park Slope", "trip_time": "00:35:00" }, "geometry": { "type": "LineString", "coordinates": [ [ -73.776445, 40.645422 ], [ -73.98427, 40.670782 ] ] } }
{ "type": "Feature", "properties": { "start_zone": "JFK Airport", "end_zone": "Park Slope", "trip_time": "00:45:42" }, "geometry": { "type": "LineString", "coordinates": [ [ -73.776878, 40.645065 ], [ -73.992153, 40.662571 ] ] } }
{ "type": "Feature", "properties": { "start_zone": "JFK Airport", "end_zone": "Park Slope", "trip_time": "00:36:00" }, "geometry": { "type": "LineString", "coordinates": [ [ -73.788028, 40.641508 ], [ -73.97584, 40.670927 ] ] } }
{ "type": "Feature", "properties": { "start_zone": "JFK Airport", "end_zone": "Park Slope", "trip_time": "00:47:58" }, "geometry": { "type": "LineString", "coordinates": [ [ -73.781855, 40.644749 ], [ -73.980129, 40.663663 ] ] } }
{ "type": "Feature", "properties": { "start_zone": "JFK Airport", "end_zone": "Park Slope", "trip_time": "00:32:10" }, "geometry": { "type": "LineString", "coordinates": [ [ -73.787494, 40.641559 ], [ -73.974694, 40.673479 ] ] } }
{ "type": "Feature", "properties": { "start_zone": "JFK Airport", "end_zone": "Park Slope", "trip_time": "00:36:59" }, "geometry": { "type": "LineString", "coordinates": [ [ -73.790138, 40.643342 ], [ -73.982721, 40.662379 ] ] } }
{ "type": "Feature", "properties": { "start_zone": "JFK Airport", "end_zone": "Park Slope", "trip_time": "00:32:00" }, "geometry": { "type": "LineString", "coordinates": [ [ -73.786952, 40.641248 ], [ -73.97421, 40.676237 ] ] } }
{ "type": "Feature", "properties": { "start_zone": "JFK Airport", "end_zone": "Park Slope", "trip_time": "00:33:21" }, "geometry": { "type": "LineString", "coordinates": [ [ -73.783892, 40.648514 ], [ -73.979283, 40.669721 ] ] } }
{ "type": "Feature", "properties": { "start_zone": "JFK Airport", "end_zone": "Park Slope", "trip_time": "00:35:45" }, "geometry": { "type": "LineString", "coordinates": [ [ -73.776643, 40.645272 ], [ -73.978873, 40.66723 ] ] } }
And there we have it! We pulled tabular data from parquet, combined it with geospatial data in a shapefile, cleaned and analyzed that combined data, and output it to a human readable geospatial format. The full set of currently supported functions and their implementation status can be found over at the docs in this table
What's next?
While it's probably going to take a while for us to catch up to the full set of functions provided by e.g., PostGIS, we believe that DuckDB's vectorized execution model and columnar storage format will enable a whole new class of optimizations for geospatial processing that we've just begun exploring. Improving the performance of spatial joins and predicates is therefore high on our list of priorities.
There are also some limitations with our GEOMETRY
type that we would eventually like to tackle, such as the fact that we don't support additional Z and M dimensions, or don't support the full range of geometry sub-types that are mandated by the OGC standard, like curves or polyhedral surfaces.
We're also interested in supporting spherical and ellipsoidal calculations in the near future, perhaps in the form of a dedicated GEOGRAPHY
type.
WASM builds are also just around the corner!
Please take a look at the GitHub repository for the full roadmap and to see what we're currently working on. If you would like to help build this capability, please reach out on GitHub!
Conclusion
The DuckDB Spatial extension is another step towards making DuckDB a swiss army knife for data engineering and analytics. This extension provides a flexible and familiar GEOMETRY
type, reprojectable between thousands of coordinate reference systems, coupled with the capability to export and import geospatial data between more than 50 different data sources. All embedded into a single extension with minimal runtime dependencies. This enables DuckDB to fit seamlessly into your existing GIS workflows regardless of which geospatial data formats or projections you're working with.
We are excited to hear what you make of the DuckDB spatial extension. It's still early days but we hope to have a lot more to share in the future as we continue making progress! If you have any questions, suggestions, ideas or issues, please don't hesitate to reach out to us on Discord or GitHub!