PostGEESE? Introducing The DuckDB Spatial Extension

Author Avatar
Max Gabrielsson2023-04-28

TL;DR: DuckDB now has an official Spatial extension to enable geospatial processing.

Geospatial data has become increasingly important and prevalent in modern-day applications and data engineering workflows, with use-cases ranging from location-based services to environmental monitoring.

While there are many great and specialized tools for working with geospatial data, integrating geospatial capabilities directly into DuckDB has multiple advantages. For one, you get to operate, transform and join your geospatial data alongside your regular, unstructured or time-series data using DuckDBs rich type system and extensions like JSON and ICU. Secondly, spatial queries involving geometric predicates and relations translate surprisingly well to SQL, which is all about expressing relations after all! Not to mention all the other benefits provided by DuckDB such as transactional semantics, high performance multi-threaded vectorized execution and larger-than-memory data processing.

Therefore, we're very excited to announce that DuckDB now has a Spatial extension packed with features easily installable from the DuckDB CLI and other DuckDB clients. Simply execute INSTALL spatial; LOAD spatial; from within DuckDB and you're good to go!

No, we're not calling it GeoDuck either, that's just gross

What's in it?

The core of the extension is a GEOMETRY type based on the "Simple Features" geometry model and accompanying functions such as ST_Area, ST_Intersects. It also provides methods for reading and writing geospatial data formats and converting between coordinate reference systems (details later in the post!). While we're not ready to commit to full compliance with the OGC Simple Feature Access and SQL/MM Standards yet, if you've worked with geospatial functionality in other database systems such as PostGIS or SpatiaLite, you should feel right at home.

Most of the implemented functions are based on the trifecta of foundational geospatial libraries, GEOS, GDAL and PROJ, which provide algorithms, format conversions and coordinate reference system transformations respectively. In particular, we leverage GDAL to provide a set of table and copy functions that enable import and export of tables from and to 50+ different geospatial data formats (so far!), including the most common ones such as Shapefiles, GeoJSON, GeoPackage, KML, GML, WKT, WKB, etc.

Check for yourself by running:

SELECT * FROM st_drivers();
short_name long_name can_create can_copy can_open help_url
ESRI Shapefile ESRI Shapefile true false true https://gdal.org/drivers/vector/shapefile.html
MapInfo File MapInfo File true false true https://gdal.org/drivers/vector/mitab.html
UK .NTF UK .NTF false false true https://gdal.org/drivers/vector/ntf.html
LVBAG Kadaster LV BAG Extract 2.0 false false true https://gdal.org/drivers/vector/lvbag.html
S57 IHO S-57 (ENC) true false true https://gdal.org/drivers/vector/s57.html
DGN Microstation DGN true false true https://gdal.org/drivers/vector/dgn.html
OGR_VRT VRT - Virtual Datasource false false true https://gdal.org/drivers/vector/vrt.html
Memory Memory true false true  
CSV Comma Separated Value (.csv) true false true https://gdal.org/drivers/vector/csv.html
GML Geography Markup Language (GML) true false true https://gdal.org/drivers/vector/gml.html
GPX GPX true false true https://gdal.org/drivers/vector/gpx.html
KML Keyhole Markup Language (KML) true false true https://gdal.org/drivers/vector/kml.html
GeoJSON GeoJSON true false true https://gdal.org/drivers/vector/geojson.html
GeoJSONSeq GeoJSON Sequence true false true https://gdal.org/drivers/vector/geojsonseq.html
ESRIJSON ESRIJSON false false true https://gdal.org/drivers/vector/esrijson.html
TopoJSON TopoJSON false false true https://gdal.org/drivers/vector/topojson.html
OGR_GMT GMT ASCII Vectors (.gmt) true false true https://gdal.org/drivers/vector/gmt.html
GPKG GeoPackage true true true https://gdal.org/drivers/vector/gpkg.html
SQLite SQLite / Spatialite true false true https://gdal.org/drivers/vector/sqlite.html
WAsP WAsP .map format true false true https://gdal.org/drivers/vector/wasp.html
OpenFileGDB ESRI FileGDB true false true https://gdal.org/drivers/vector/openfilegdb.html
DXF AutoCAD DXF true false true https://gdal.org/drivers/vector/dxf.html
CAD AutoCAD Driver false false true https://gdal.org/drivers/vector/cad.html
FlatGeobuf FlatGeobuf true false true https://gdal.org/drivers/vector/flatgeobuf.html
Geoconcept Geoconcept true false true  
GeoRSS GeoRSS true false true https://gdal.org/drivers/vector/georss.html
VFK Czech Cadastral Exchange Data Format false false true https://gdal.org/drivers/vector/vfk.html
PGDUMP PostgreSQL SQL dump true false false https://gdal.org/drivers/vector/pgdump.html
OSM OpenStreetMap XML and PBF false false true https://gdal.org/drivers/vector/osm.html
GPSBabel GPSBabel true false true https://gdal.org/drivers/vector/gpsbabel.html
WFS OGC WFS (Web Feature Service) false false true https://gdal.org/drivers/vector/wfs.html
OAPIF OGC API - Features false false true https://gdal.org/drivers/vector/oapif.html
EDIGEO French EDIGEO exchange format false false true https://gdal.org/drivers/vector/edigeo.html
SVG Scalable Vector Graphics false false true https://gdal.org/drivers/vector/svg.html
ODS Open Document/ LibreOffice / OpenOffice Spreadsheet true false true https://gdal.org/drivers/vector/ods.html
XLSX MS Office Open XML spreadsheet true false true https://gdal.org/drivers/vector/xlsx.html
Elasticsearch Elastic Search true false true https://gdal.org/drivers/vector/elasticsearch.html
Carto Carto true false true https://gdal.org/drivers/vector/carto.html
AmigoCloud AmigoCloud true false true https://gdal.org/drivers/vector/amigocloud.html
SXF Storage and eXchange Format false false true https://gdal.org/drivers/vector/sxf.html
Selafin Selafin true false true https://gdal.org/drivers/vector/selafin.html
JML OpenJUMP JML true false true https://gdal.org/drivers/vector/jml.html
PLSCENES Planet Labs Scenes API false false true https://gdal.org/drivers/vector/plscenes.html
CSW OGC CSW (Catalog Service for the Web) false false true https://gdal.org/drivers/vector/csw.html
VDV VDV-451/VDV-452/INTREST Data Format true false true https://gdal.org/drivers/vector/vdv.html
MVT Mapbox Vector Tiles true false true https://gdal.org/drivers/vector/mvt.html
NGW NextGIS Web true true true https://gdal.org/drivers/vector/ngw.html
MapML MapML true false true https://gdal.org/drivers/vector/mapml.html
TIGER U.S. Census TIGER/Line false false true https://gdal.org/drivers/vector/tiger.html
AVCBin Arc/Info Binary Coverage false false true https://gdal.org/drivers/vector/avcbin.html
AVCE00 Arc/Info E00 (ASCII) Coverage false false true https://gdal.org/drivers/vector/avce00.html

Initially we have prioritized providing a breadth of capabilities by wrapping existing libraries. We're planning to implement more of the core functions and algorithms natively in the future to enable faster performance and more efficient memory management.

As an initial step in this direction, we provide a set of non-standard specialized columnar DuckDB native geometry types such as POINT_2D, BOX_2D, etc. that should provide better compression and faster execution in exchange for some flexibility, but work around these are still very much experimental.

Example Usage

The following demonstrates how you can use the spatial extension to read and export multiple geospatial data formats, transform geometries between different coordinate reference systems and work with spatial property and predicate functions. While this example may be slightly contrived, we want to showcase the power of the currently available features. You can find the datasets used in this example in the spatial extension repository.

Let's start by installing and loading the spatial extension and the parquet extension. Now we can import the NYC taxi ride data provided in parquet format, as well as the accompanying taxi zone data from a shapefile, using the ST_Read table function provided by the spatial extension. These taxi zones break NYC into polygons that represent regions, for example the Newark Airport. We then create a table for the rides and a table for the zones. Note that ST_Read produces a table with a wkb_geometry column that contains the geometry data encoded as a WKB (Well-Known Binary) blob, which we then convert to the GEOMETRY type using the ST_GeomFromWKB function.

This may all seem a bit much if you are not familiar with the geospatial ecosystem, but rest assured this is all you really need to get started. In short: – Shapefile (.shp, .shx, .dbf) is a common format for storing geometry vector data and auxiliary metadata such as indexes and attributes. – WKB (Well Known Binary), while not really a file format in itself, is a common binary encoding of vector geometry data, used in e.g., GeoParquet. Comes in multiple flavors, but we're only concerned with "standard" WKB for now. – GEOMETRY is a DuckDB type that represents a Simple Features geometry object, which is based on a set of standards modeling vector geometry data as points, linestrings, polygons or collections of such. This is the core data type used by the spatial extension, and what most of the provided functions take and return.

INSTALL spatial;
INSTALL parquet;
LOAD spatial;
LOAD parquet;

CREATE TABLE rides AS SELECT * 
FROM './spatial/test/data/nyc_taxi/yellow_tripdata_2010-01-limit1mil.parquet';

-- Load the NYC taxi zone data from a shapefile using the gdal-based ST_Read function
CREATE TABLE zones AS SELECT zone, LocationId, borough, ST_GeomFromWKB(wkb_geometry) AS geom 
FROM ST_Read('./spatial/test/data/nyc_taxi/taxi_zones/taxi_zones.shx');

Let's compare the trip distance to the linear distance between the pickup and dropoff points to figure out how efficient the taxi drivers are (or how dirty the data is, since some diffs seem to be negative). We transform the coordinates from "WGS84" (given by the identifier EPSG:4326), also commonly known as simply latitude/longitude to the "NAD83 / New York Long Island ftUS" (identified as ESRI:102718) coordinate reference system which is a projection with minimal distortion around New York. We then calculate the distance using the ST_Distance function. In This case we get the distance in feet since we've converted the coordinates to NAD83 but we can easily convert it into to miles (5280 ft/mile) which is the unit used in the rides dataset so we can compare them correctly.

Wait, what's all this about coordinate reference systems and projections?

The earth is not flat, but sometimes it is useful to pretend it is for the sake of simplicity by "projecting" the coordinates onto a flat surface. The "parameters" of a projection – e.g., where the "origin" is located, what unit coordinates are in, or how the earth's shape is approximated – are encapsulated by a "Spatial Reference System" or "Coordinate Reference System" (CRS) which is usually referenced by a shorthand identifier composed of an authority and a code, e.g., "EPSG:4326" or "ESRI:102718". Projections are always lossy, so its important to use a CRS that is well suited for the "area of interest" your data is in. The spatial extension uses the PROJ library to handle coordinate reference systems and projections.

Trips with a distance shorter than the aerial distance are likely to be erroneous, so we use this query to filter out some bad data. The query below takes advantage of DuckDB's ability to refer to column aliases defined within the same select statement. This is a small example of how DuckDB's rich SQL dialect can simplify geospatial analysis.

CREATE TABLE cleaned_rides AS SELECT 
    ST_Point(pickup_latitude, pickup_longitude) AS pickup_point,
    ST_Point(dropoff_latitude, dropoff_longitude) AS dropoff_point,
    dropoff_datetime::TIMESTAMP - pickup_datetime::TIMESTAMP AS time,
    trip_distance,
    ST_Distance(
        ST_Transform(pickup_point, 'EPSG:4326', 'ESRI:102718'), 
        ST_Transform(dropoff_point, 'EPSG:4326', 'ESRI:102718')) / 5280 
    AS aerial_distance, 
    trip_distance - aerial_distance AS diff 
FROM rides 
WHERE diff > 0
ORDER BY diff DESC;
SELECT * FROM rides LIMIT 10;
vendor_id pickup_datetime dropoff_datetime passenger_count trip_distance pickup_longitude pickup_latitude rate_code store_and_fwd_flag dropoff_longitude dropoff_latitude payment_type fare_amount surcharge mta_tax tip_amount tolls_amount total_amount
VTS 2010-01-01 00:00:17 2010-01-01 00:00:17 3 0.0 -73.87105699999998 40.773522 1   -73.871048 40.773545 CAS 45.0 0.0 0.5 0.0 0.0 45.5
VTS 2010-01-01 00:00:20 2010-01-01 00:00:20 1 0.05 -73.97512999999998 40.789973 1   -73.97498799999998 40.790598 CAS 2.5 0.5 0.5 0.0 0.0 3.5
CMT 2010-01-01 00:00:23 2010-01-01 00:00:25 1 0.0 -73.999431 40.71216 1 0 -73.99915799999998 40.712421 No 2.5 0.5 0.5 0.0 0.0 3.5
CMT 2010-01-01 00:00:33 2010-01-01 00:00:55 1 0.0 -73.97721699999998 40.749633 1 0 -73.97732899999998 40.749629 Cas 2.5 0.5 0.5 0.0 0.0 3.5
VTS 2010-01-01 00:01:00 2010-01-01 00:01:00 1 0.0 -73.942313 40.784332 1   -73.942313 40.784332 Cre 10.0 0.0 0.5 2.0 0.0 12.5
VTS 2010-01-01 00:01:06 2010-01-01 00:01:06 2 0.38 -73.97463 40.756687 1   -73.979872 40.759143 CAS 3.7 0.5 0.5 0.0 0.0 4.7
VTS 2010-01-01 00:01:07 2010-01-01 00:01:07 2 0.23 -73.987358 40.718475 1   -73.98518 40.720468 CAS 2.9 0.5 0.5 0.0 0.0 3.9
CMT 2010-01-01 00:00:02 2010-01-01 00:01:08 1 0.1 -73.992807 40.741418 1 0 -73.995799 40.742596 No 2.9 0.5 0.5 0.0 0.0 3.9
VTS 2010-01-01 00:01:23 2010-01-01 00:01:23 1 0.6099999999999999 -73.98003799999998 40.74306 1   -73.974862 40.750387 CAS 3.7 0.5 0.5 0.0 0.0 4.7
VTS 2010-01-01 00:01:34 2010-01-01 00:01:34 1 0.02 -73.954122 40.801173 1   -73.95431499999998 40.800897 CAS 45.0 0.0 0.5 0.0 0.0 45.5
SELECT * FROM zones LIMIT 10;
zone LocationID borough geom
Newark Airport 1 EWR POLYGON (…)
Jamaica Bay 2 Queens MULTIPOLYGON (…)
Allerton/Pelham Gardens 3 Bronx POLYGON (…)
Alphabet City 4 Manhattan POLYGON (…)
Arden Heights 5 Staten Island POLYGON (…)
Arrochar/Fort Wadsworth 6 Staten Island POLYGON (…)
Astoria 7 Queens POLYGON (…)
Astoria Park 8 Queens POLYGON (…)
Auburndale 9 Queens POLYGON (…)
Baisley Park 10 Queens POLYGON (…)

It should be noted that this is not entirely accurate since the ST_Distance function we use does not take into account the curvature of the earth. However, we'll accept it as a good enough approximation for our purposes. Spherical and geodesic distance calculations are on the roadmap!

Now let's join the taxi rides with the taxi zones to get the start and end zone for each ride. We use the ST_Within function as our join condition to check if a pickup or dropoff point is within a taxi zone polygon. Again we need to transform the coordinates from WGS84 to the NAD83 since the taxi zone data also use that projection. Spatial joins like these are the bread and butter of geospatial data processing, but we don't currently have any optimizations in place (such as spatial indexes) to speed up these queries, which is why we only use a subset of the data for the following step.

-- Since we don't have spatial indexes yet, use a smaller dataset for the join.
DELETE FROM cleaned_rides WHERE rowid > 5000;

CREATE TABLE joined AS 
SELECT 
    pickup_point,
    dropoff_point,
    start_zone.zone AS start_zone,
    end_zone.zone AS end_zone, 
    trip_distance,
    time,
FROM cleaned_rides 
JOIN zones AS start_zone 
ON ST_Within(ST_Transform(pickup_point, 'EPSG:4326', 'ESRI:102718'), start_zone.geom) 
JOIN zones AS end_zone 
ON ST_Within(ST_Transform(dropoff_point, 'EPSG:4326', 'ESRI:102718'), end_zone.geom);
SELECT * FROM joined USING SAMPLE 10 ROWS;
pickup_point dropoff_point start_zone end_zone trip_distance time
POINT (40.722223 -73.98385299999998) POINT (40.715507 -73.992438) East Village Lower East Side 10.3 00:19:16
POINT (40.648687 -73.783522) POINT (40.649567 -74.005812) JFK Airport Sunset Park West 23.57 00:28:00
POINT (40.761603 -73.96661299999998) POINT (40.760232 -73.96344499999998) Upper East Side South Sutton Place/Turtle Bay North 17.6 00:27:05
POINT (40.697212 -73.937495) POINT (40.652377 -73.93983299999998) Stuyvesant Heights East Flatbush/Farragut 13.55 00:24:00
POINT (40.721462 -73.993583) POINT (40.774205 -73.90441699999998) Lower East Side Steinway 28.75 01:03:00
POINT (40.716955 -74.004328) POINT (40.754688 -73.991612) TriBeCa/Civic Center Garment District 18.4 00:46:12
POINT (40.740052 -73.994918) POINT (40.75439 -73.98587499999998) Flatiron Garment District 24.2 00:35:25
POINT (40.763017 -73.95949199999998) POINT (40.763615 -73.959182) Lenox Hill East Lenox Hill West 18.4 00:33:46
POINT (40.865663 -73.927458) POINT (40.86537 -73.927352) Washington Heights North Washington Heights North 10.47 00:27:00
POINT (40.738408 -73.980345) POINT (40.696038 -73.955493) Gramercy Bedford 16.4 00:21:47

We can export the joined table to a GeoJSONSeq file using the GDAL copy function, passing in a GDAL layer creation option. Since GeoJSON only supports a single GEOMETRY per record, we use the ST_MakeLine function to combine the pickup and dropoff points into a single line geometry. The default coordinate reference system for GeoJSON is WGS84, but the coordinate pairs are expected to be in longitude/latitude, so we need to flip the geometry using the ST_FlipCoordinates function.

COPY (
    SELECT 
        ST_AsWKB(ST_FlipCoordinates(ST_MakeLine(pickup_point, dropoff_point))) 
        AS wkb_geometry,
        start_zone,
        end_zone,
        time::VARCHAR AS trip_time 
    FROM joined) 
TO 'joined.geojsonseq' 
WITH (FORMAT GDAL, DRIVER 'GeoJSONSeq', LAYER_CREATION_OPTIONS 'WRITE_BBOX=YES');
head -n 10 joined.geojsonseq
{ "type": "Feature", "properties": { "start_zone": "JFK Airport", "end_zone": "Park Slope", "trip_time": "00:52:00" }, "geometry": { "type": "LineString", "coordinates": [ [ -73.789923, 40.643515 ], [ -73.97608, 40.680395 ] ] } }
{ "type": "Feature", "properties": { "start_zone": "JFK Airport", "end_zone": "Park Slope", "trip_time": "00:35:00" }, "geometry": { "type": "LineString", "coordinates": [ [ -73.776445, 40.645422 ], [ -73.98427, 40.670782 ] ] } }
{ "type": "Feature", "properties": { "start_zone": "JFK Airport", "end_zone": "Park Slope", "trip_time": "00:45:42" }, "geometry": { "type": "LineString", "coordinates": [ [ -73.776878, 40.645065 ], [ -73.992153, 40.662571 ] ] } }
{ "type": "Feature", "properties": { "start_zone": "JFK Airport", "end_zone": "Park Slope", "trip_time": "00:36:00" }, "geometry": { "type": "LineString", "coordinates": [ [ -73.788028, 40.641508 ], [ -73.97584, 40.670927 ] ] } }
{ "type": "Feature", "properties": { "start_zone": "JFK Airport", "end_zone": "Park Slope", "trip_time": "00:47:58" }, "geometry": { "type": "LineString", "coordinates": [ [ -73.781855, 40.644749 ], [ -73.980129, 40.663663 ] ] } }
{ "type": "Feature", "properties": { "start_zone": "JFK Airport", "end_zone": "Park Slope", "trip_time": "00:32:10" }, "geometry": { "type": "LineString", "coordinates": [ [ -73.787494, 40.641559 ], [ -73.974694, 40.673479 ] ] } }
{ "type": "Feature", "properties": { "start_zone": "JFK Airport", "end_zone": "Park Slope", "trip_time": "00:36:59" }, "geometry": { "type": "LineString", "coordinates": [ [ -73.790138, 40.643342 ], [ -73.982721, 40.662379 ] ] } }
{ "type": "Feature", "properties": { "start_zone": "JFK Airport", "end_zone": "Park Slope", "trip_time": "00:32:00" }, "geometry": { "type": "LineString", "coordinates": [ [ -73.786952, 40.641248 ], [ -73.97421, 40.676237 ] ] } }
{ "type": "Feature", "properties": { "start_zone": "JFK Airport", "end_zone": "Park Slope", "trip_time": "00:33:21" }, "geometry": { "type": "LineString", "coordinates": [ [ -73.783892, 40.648514 ], [ -73.979283, 40.669721 ] ] } }
{ "type": "Feature", "properties": { "start_zone": "JFK Airport", "end_zone": "Park Slope", "trip_time": "00:35:45" }, "geometry": { "type": "LineString", "coordinates": [ [ -73.776643, 40.645272 ], [ -73.978873, 40.66723 ] ] } }

And there we have it! We pulled tabular data from parquet, combined it with geospatial data in a shapefile, cleaned and analyzed that combined data, and output it to a human readable geospatial format. The full set of currently supported functions and their implementation status can be found over at the docs in this table

What's next?

While it's probably going to take a while for us to catch up to the full set of functions provided by e.g., PostGIS, we believe that DuckDB's vectorized execution model and columnar storage format will enable a whole new class of optimizations for geospatial processing that we've just begun exploring. Improving the performance of spatial joins and predicates is therefore high on our list of priorities.

There are also some limitations with our GEOMETRY type that we would eventually like to tackle, such as the fact that we don't support additional Z and M dimensions, or don't support the full range of geometry sub-types that are mandated by the OGC standard, like curves or polyhedral surfaces.

We're also interested in supporting spherical and ellipsoidal calculations in the near future, perhaps in the form of a dedicated GEOGRAPHY type.

WASM builds are also just around the corner!

Please take a look at the GitHub repository for the full roadmap and to see what we're currently working on. If you would like to help build this capability, please reach out on GitHub!

Conclusion

The DuckDB Spatial extension is another step towards making DuckDB a swiss army knife for data engineering and analytics. This extension provides a flexible and familiar GEOMETRY type, reprojectable between thousands of coordinate reference systems, coupled with the capability to export and import geospatial data between more than 50 different data sources. All embedded into a single extension with minimal runtime dependencies. This enables DuckDB to fit seamlessly into your existing GIS workflows regardless of which geospatial data formats or projections you're working with.

We are excited to hear what you make of the DuckDB spatial extension. It's still early days but we hope to have a lot more to share in the future as we continue making progress! If you have any questions, suggestions, ideas or issues, please don't hesitate to reach out to us on Discord or GitHub!