PostGEESE? Introducing The DuckDB Spatial Extension
TL;DR: DuckDB now has an official Spatial extension to enable geospatial processing.
Geospatial data has become increasingly important and prevalent in modern-day applications and data engineering workflows, with use-cases ranging from location-based services to environmental monitoring.
While there are many great and specialized tools for working with geospatial data, integrating geospatial capabilities directly into DuckDB has multiple advantages. For one, you get to operate, transform and join your geospatial data alongside your regular, unstructured or time-series data using DuckDBs rich type system and extensions like JSON
and ICU
. Secondly, spatial queries involving geometric predicates and relations translate surprisingly well to SQL, which is all about expressing relations after all! Not to mention all the other benefits provided by DuckDB such as transactional semantics, high performance multi-threaded vectorized execution and larger-than-memory data processing.
Therefore, we're very excited to announce that DuckDB now has a Spatial extension packed with features easily installable from the DuckDB CLI and other DuckDB clients. Simply execute:
INSTALL spatial;
LOAD spatial;
And you're good to go!
No, we're not calling it GeoDuck either, that's just gross.
What's in It?
The core of the extension is a GEOMETRY
type based on the "Simple Features" geometry model and accompanying functions such as ST_Area
, ST_Intersects
. It also provides methods for reading and writing geospatial data formats and converting between coordinate reference systems (details later in the post!). While we're not ready to commit to full compliance with the OGC Simple Feature Access and SQL/MM Standards yet, if you've worked with geospatial functionality in other database systems such as PostGIS or SpatiaLite, you should feel right at home.
Most of the implemented functions are based on the trifecta of foundational geospatial libraries, GEOS, GDAL and PROJ, which provide algorithms, format conversions and coordinate reference system transformations respectively. In particular, we leverage GDAL to provide a set of table and copy functions that enable import and export of tables from and to 50+ different geospatial data formats (so far!), including the most common ones such as Shapefiles, GeoJSON, GeoPackage, KML, GML, WKT, WKB, etc.
Check for yourself by running:
SELECT * FROM st_drivers();
short_name | long_name | can_create | can_copy | can_open | help_url |
---|---|---|---|---|---|
ESRI Shapefile | ESRI Shapefile | true | false | true | https://gdal.org/drivers/vector/shapefile.html |
MapInfo File | MapInfo File | true | false | true | https://gdal.org/drivers/vector/mitab.html |
UK .NTF | UK .NTF | false | false | true | https://gdal.org/drivers/vector/ntf.html |
LVBAG | Kadaster LV BAG Extract 2.0 | false | false | true | https://gdal.org/drivers/vector/lvbag.html |
S57 | IHO S-57 (ENC) | true | false | true | https://gdal.org/drivers/vector/s57.html |
DGN | Microstation DGN | true | false | true | https://gdal.org/drivers/vector/dgn.html |
OGR_VRT | VRT - Virtual Datasource | false | false | true | https://gdal.org/drivers/vector/vrt.html |
Memory | Memory | true | false | true | |
CSV | Comma Separated Value (.csv) | true | false | true | https://gdal.org/drivers/vector/csv.html |
GML | Geography Markup Language (GML) | true | false | true | https://gdal.org/drivers/vector/gml.html |
GPX | GPX | true | false | true | https://gdal.org/drivers/vector/gpx.html |
KML | Keyhole Markup Language (KML) | true | false | true | https://gdal.org/drivers/vector/kml.html |
GeoJSON | GeoJSON | true | false | true | https://gdal.org/drivers/vector/geojson.html |
GeoJSONSeq | GeoJSON Sequence | true | false | true | https://gdal.org/drivers/vector/geojsonseq.html |
ESRIJSON | ESRIJSON | false | false | true | https://gdal.org/drivers/vector/esrijson.html |
TopoJSON | TopoJSON | false | false | true | https://gdal.org/drivers/vector/topojson.html |
OGR_GMT | GMT ASCII Vectors (.gmt) | true | false | true | https://gdal.org/drivers/vector/gmt.html |
GPKG | GeoPackage | true | true | true | https://gdal.org/drivers/vector/gpkg.html |
SQLite | SQLite / Spatialite | true | false | true | https://gdal.org/drivers/vector/sqlite.html |
WAsP | WAsP .map format | true | false | true | https://gdal.org/drivers/vector/wasp.html |
OpenFileGDB | ESRI FileGDB | true | false | true | https://gdal.org/drivers/vector/openfilegdb.html |
DXF | AutoCAD DXF | true | false | true | https://gdal.org/drivers/vector/dxf.html |
CAD | AutoCAD Driver | false | false | true | https://gdal.org/drivers/vector/cad.html |
FlatGeobuf | FlatGeobuf | true | false | true | https://gdal.org/drivers/vector/flatgeobuf.html |
Geoconcept | Geoconcept | true | false | true | |
GeoRSS | GeoRSS | true | false | true | https://gdal.org/drivers/vector/georss.html |
VFK | Czech Cadastral Exchange Data Format | false | false | true | https://gdal.org/drivers/vector/vfk.html |
PGDUMP | PostgreSQL SQL dump | true | false | false | https://gdal.org/drivers/vector/pgdump.html |
OSM | OpenStreetMap XML and PBF | false | false | true | https://gdal.org/drivers/vector/osm.html |
GPSBabel | GPSBabel | true | false | true | https://gdal.org/drivers/vector/gpsbabel.html |
WFS | OGC WFS (Web Feature Service) | false | false | true | https://gdal.org/drivers/vector/wfs.html |
OAPIF | OGC API - Features | false | false | true | https://gdal.org/drivers/vector/oapif.html |
EDIGEO | French EDIGEO exchange format | false | false | true | https://gdal.org/drivers/vector/edigeo.html |
SVG | Scalable Vector Graphics | false | false | true | https://gdal.org/drivers/vector/svg.html |
ODS | Open Document/ LibreOffice / OpenOffice Spreadsheet | true | false | true | https://gdal.org/drivers/vector/ods.html |
XLSX | MS Office Open XML spreadsheet | true | false | true | https://gdal.org/drivers/vector/xlsx.html |
Elasticsearch | Elastic Search | true | false | true | https://gdal.org/drivers/vector/elasticsearch.html |
Carto | Carto | true | false | true | https://gdal.org/drivers/vector/carto.html |
AmigoCloud | AmigoCloud | true | false | true | https://gdal.org/drivers/vector/amigocloud.html |
SXF | Storage and eXchange Format | false | false | true | https://gdal.org/drivers/vector/sxf.html |
Selafin | Selafin | true | false | true | https://gdal.org/drivers/vector/selafin.html |
JML | OpenJUMP JML | true | false | true | https://gdal.org/drivers/vector/jml.html |
PLSCENES | Planet Labs Scenes API | false | false | true | https://gdal.org/drivers/vector/plscenes.html |
CSW | OGC CSW (Catalog Service for the Web) | false | false | true | https://gdal.org/drivers/vector/csw.html |
VDV | VDV-451/VDV-452/INTREST Data Format | true | false | true | https://gdal.org/drivers/vector/vdv.html |
MVT | Mapbox Vector Tiles | true | false | true | https://gdal.org/drivers/vector/mvt.html |
NGW | NextGIS Web | true | true | true | https://gdal.org/drivers/vector/ngw.html |
MapML | MapML | true | false | true | https://gdal.org/drivers/vector/mapml.html |
TIGER | U.S. Census TIGER/Line | false | false | true | https://gdal.org/drivers/vector/tiger.html |
AVCBin | Arc/Info Binary Coverage | false | false | true | https://gdal.org/drivers/vector/avcbin.html |
AVCE00 | Arc/Info E00 (ASCII) Coverage | false | false | true | https://gdal.org/drivers/vector/avce00.html |
Initially we have prioritized providing a breadth of capabilities by wrapping existing libraries. We're planning to implement more of the core functions and algorithms natively in the future to enable faster performance and more efficient memory management.
As an initial step in this direction, we provide a set of non-standard specialized columnar DuckDB native geometry types such as POINT_2D
, BOX_2D
, etc. that should provide better compression and faster execution in exchange for some flexibility, but work around these are still very much experimental.
Example Usage
The following demonstrates how you can use the spatial extension to read and export multiple geospatial data formats, transform geometries between different coordinate reference systems and work with spatial property and predicate functions. While this example may be slightly contrived, we want to showcase the power of the currently available features. You can find the datasets used in this example in the spatial extension repository.
Let's import the NYC taxi ride data provided in Parquet format as well as the accompanying taxi zone data from a shapefile, using the ST_Read
table function provided by the spatial extension. These taxi zones break NYC into polygons that represent regions, for example the Newark Airport. We then create a table for the rides and a table for the zones. Note that ST_Read
produces a table with a wkb_geometry
column that contains the geometry data encoded as a WKB (Well-Known Binary) blob, which we then convert to the GEOMETRY
type using the ST_GeomFromWKB
function.
This may all seem a bit much if you are not familiar with the geospatial ecosystem, but rest assured this is all you really need to get started. In short: – Shapefile (.shp, .shx, .dbf) is a common format for storing geometry vector data and auxiliary metadata such as indexes and attributes. – WKB (Well Known Binary), while not really a file format in itself, is a common binary encoding of vector geometry data, used in e.g., GeoParquet. Comes in multiple flavors, but we're only concerned with "standard" WKB for now. –
GEOMETRY
is a DuckDB type that represents a Simple Features geometry object, which is based on a set of standards modeling vector geometry data as points, linestrings, polygons or collections of such. This is the core data type used by the spatial extension, and what most of the provided functions take and return.
INSTALL spatial;
LOAD spatial;
CREATE TABLE rides AS
SELECT *
FROM 'yellow_tripdata_2010-01-limit1mil.parquet';
-- Load the NYC taxi zone data from a shapefile using the gdal-based ST_Read function
CREATE TABLE zones AS
SELECT zone, LocationId, borough, geom
FROM ST_Read('taxi_zones/taxi_zones.shx');
Let's compare the trip distance to the linear distance between the pickup and dropoff points to figure out how efficient the taxi drivers are (or how dirty the data is, since some diffs seem to be negative). We transform the coordinates from "WGS84" (given by the identifier EPSG:4326), also commonly known as simply latitude/longitude to the "NAD83 / New York Long Island ftUS" (identified as ESRI:102718) coordinate reference system which is a projection with minimal distortion around New York. We then calculate the distance using the ST_Distance
function. In This case we get the distance in feet since we've converted the coordinates to NAD83 but we can easily convert it into to miles (5280 ft/mile) which is the unit used in the rides dataset so we can compare them correctly.
Wait, what's all this about coordinate reference systems and projections?
The earth is not flat, but sometimes it is useful to pretend it is for the sake of simplicity by "projecting" the coordinates onto a flat surface. The "parameters" of a projection – e.g., where the "origin" is located, what unit coordinates are in, or how the earth's shape is approximated – are encapsulated by a "Spatial Reference System" or "Coordinate Reference System" (CRS) which is usually referenced by a shorthand identifier composed of an authority and a code, e.g., "EPSG:4326" or "ESRI:102718". Projections are always lossy, so its important to use a CRS that is well suited for the "area of interest" your data is in. The spatial extension uses the PROJ library to handle coordinate reference systems and projections.
Trips with a distance shorter than the aerial distance are likely to be erroneous, so we use this query to filter out some bad data. The query below takes advantage of DuckDB's ability to refer to column aliases defined within the same select statement. This is a small example of how DuckDB's rich SQL dialect can simplify geospatial analysis.
CREATE TABLE cleaned_rides AS
SELECT
ST_Point(pickup_latitude, pickup_longitude) AS pickup_point,
ST_Point(dropoff_latitude, dropoff_longitude) AS dropoff_point,
dropoff_datetime::TIMESTAMP - pickup_datetime::TIMESTAMP AS time,
trip_distance,
ST_Distance(
ST_Transform(pickup_point, 'EPSG:4326', 'ESRI:102718'),
ST_Transform(dropoff_point, 'EPSG:4326', 'ESRI:102718')) / 5280
AS aerial_distance,
trip_distance - aerial_distance AS diff
FROM rides
WHERE diff > 0
ORDER BY diff DESC;
SELECT * FROM rides LIMIT 10;
vendor_id | pickup_datetime | dropoff_datetime | passenger_count | trip_distance | pickup_longitude | pickup_latitude | rate_code | store_and_fwd_flag | dropoff_longitude | dropoff_latitude | payment_type | fare_amount | surcharge | mta_tax | tip_amount | tolls_amount | total_amount |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
VTS | 2010-01-01 00:00:17 | 2010-01-01 00:00:17 | 3 | 0.0 | -73.87105699999998 | 40.773522 | 1 | -73.871048 | 40.773545 | CAS | 45.0 | 0.0 | 0.5 | 0.0 | 0.0 | 45.5 | |
VTS | 2010-01-01 00:00:20 | 2010-01-01 00:00:20 | 1 | 0.05 | -73.97512999999998 | 40.789973 | 1 | -73.97498799999998 | 40.790598 | CAS | 2.5 | 0.5 | 0.5 | 0.0 | 0.0 | 3.5 | |
CMT | 2010-01-01 00:00:23 | 2010-01-01 00:00:25 | 1 | 0.0 | -73.999431 | 40.71216 | 1 | 0 | -73.99915799999998 | 40.712421 | No | 2.5 | 0.5 | 0.5 | 0.0 | 0.0 | 3.5 |
CMT | 2010-01-01 00:00:33 | 2010-01-01 00:00:55 | 1 | 0.0 | -73.97721699999998 | 40.749633 | 1 | 0 | -73.97732899999998 | 40.749629 | Cas | 2.5 | 0.5 | 0.5 | 0.0 | 0.0 | 3.5 |
VTS | 2010-01-01 00:01:00 | 2010-01-01 00:01:00 | 1 | 0.0 | -73.942313 | 40.784332 | 1 | -73.942313 | 40.784332 | Cre | 10.0 | 0.0 | 0.5 | 2.0 | 0.0 | 12.5 | |
VTS | 2010-01-01 00:01:06 | 2010-01-01 00:01:06 | 2 | 0.38 | -73.97463 | 40.756687 | 1 | -73.979872 | 40.759143 | CAS | 3.7 | 0.5 | 0.5 | 0.0 | 0.0 | 4.7 | |
VTS | 2010-01-01 00:01:07 | 2010-01-01 00:01:07 | 2 | 0.23 | -73.987358 | 40.718475 | 1 | -73.98518 | 40.720468 | CAS | 2.9 | 0.5 | 0.5 | 0.0 | 0.0 | 3.9 | |
CMT | 2010-01-01 00:00:02 | 2010-01-01 00:01:08 | 1 | 0.1 | -73.992807 | 40.741418 | 1 | 0 | -73.995799 | 40.742596 | No | 2.9 | 0.5 | 0.5 | 0.0 | 0.0 | 3.9 |
VTS | 2010-01-01 00:01:23 | 2010-01-01 00:01:23 | 1 | 0.6099999999999999 | -73.98003799999998 | 40.74306 | 1 | -73.974862 | 40.750387 | CAS | 3.7 | 0.5 | 0.5 | 0.0 | 0.0 | 4.7 | |
VTS | 2010-01-01 00:01:34 | 2010-01-01 00:01:34 | 1 | 0.02 | -73.954122 | 40.801173 | 1 | -73.95431499999998 | 40.800897 | CAS | 45.0 | 0.0 | 0.5 | 0.0 | 0.0 | 45.5 |
SELECT * FROM zones LIMIT 10;
zone | LocationID | borough | geom |
---|---|---|---|
Newark Airport | 1 | EWR | POLYGON (…) |
Jamaica Bay | 2 | Queens | MULTIPOLYGON (…) |
Allerton/Pelham Gardens | 3 | Bronx | POLYGON (…) |
Alphabet City | 4 | Manhattan | POLYGON (…) |
Arden Heights | 5 | Staten Island | POLYGON (…) |
Arrochar/Fort Wadsworth | 6 | Staten Island | POLYGON (…) |
Astoria | 7 | Queens | POLYGON (…) |
Astoria Park | 8 | Queens | POLYGON (…) |
Auburndale | 9 | Queens | POLYGON (…) |
Baisley Park | 10 | Queens | POLYGON (…) |
It should be noted that this is not entirely accurate since the
ST_Distance
function we use does not take into account the curvature of the earth. However, we'll accept it as a good enough approximation for our purposes. Spherical and geodesic distance calculations are on the roadmap!
Now let's join the taxi rides with the taxi zones to get the start and end zone for each ride. We use the ST_Within
function as our join condition to check if a pickup or dropoff point is within a taxi zone polygon. Again we need to transform the coordinates from WGS84 to the NAD83 since the taxi zone data also use that projection. Spatial joins like these are the bread and butter of geospatial data processing, but we don't currently have any optimizations in place (such as spatial indexes) to speed up these queries, which is why we only use a subset of the data for the following step.
-- Since we don't have spatial indexes yet, use a smaller dataset for the join.
DELETE FROM cleaned_rides WHERE rowid > 5000;
CREATE TABLE joined AS
SELECT
pickup_point,
dropoff_point,
start_zone.zone AS start_zone,
end_zone.zone AS end_zone,
trip_distance,
time,
FROM cleaned_rides
JOIN zones AS start_zone
ON ST_Within(ST_Transform(pickup_point, 'EPSG:4326', 'ESRI:102718'), start_zone.geom)
JOIN zones AS end_zone
ON ST_Within(ST_Transform(dropoff_point, 'EPSG:4326', 'ESRI:102718'), end_zone.geom);
SELECT * FROM joined USING SAMPLE 10 ROWS;
pickup_point | dropoff_point | start_zone | end_zone | trip_distance | time |
---|---|---|---|---|---|
POINT (40.722223 -73.98385299999998) | POINT (40.715507 -73.992438) | East Village | Lower East Side | 10.3 | 00:19:16 |
POINT (40.648687 -73.783522) | POINT (40.649567 -74.005812) | JFK Airport | Sunset Park West | 23.57 | 00:28:00 |
POINT (40.761603 -73.96661299999998) | POINT (40.760232 -73.96344499999998) | Upper East Side South | Sutton Place/Turtle Bay North | 17.6 | 00:27:05 |
POINT (40.697212 -73.937495) | POINT (40.652377 -73.93983299999998) | Stuyvesant Heights | East Flatbush/Farragut | 13.55 | 00:24:00 |
POINT (40.721462 -73.993583) | POINT (40.774205 -73.90441699999998) | Lower East Side | Steinway | 28.75 | 01:03:00 |
POINT (40.716955 -74.004328) | POINT (40.754688 -73.991612) | TriBeCa/Civic Center | Garment District | 18.4 | 00:46:12 |
POINT (40.740052 -73.994918) | POINT (40.75439 -73.98587499999998) | Flatiron | Garment District | 24.2 | 00:35:25 |
POINT (40.763017 -73.95949199999998) | POINT (40.763615 -73.959182) | Lenox Hill East | Lenox Hill West | 18.4 | 00:33:46 |
POINT (40.865663 -73.927458) | POINT (40.86537 -73.927352) | Washington Heights North | Washington Heights North | 10.47 | 00:27:00 |
POINT (40.738408 -73.980345) | POINT (40.696038 -73.955493) | Gramercy | Bedford | 16.4 | 00:21:47 |
We can export the joined table to a GeoJSONSeq
file using the GDAL copy function, passing in a GDAL layer creation option. Since GeoJSON
only supports a single GEOMETRY
per record, we use the ST_MakeLine
function to combine the pickup and dropoff points into a single line geometry. The default coordinate reference system for GeoJSON
is WGS84, but the coordinate pairs are expected to be in longitude/latitude, so we need to flip the geometry using the ST_FlipCoordinates
function.
COPY (
SELECT
ST_MakeLine(pickup_point, dropoff_point)
.ST_FlipCoordinates()
.ST_AsWKB()
AS wkb_geometry,
start_zone,
end_zone,
time::VARCHAR AS trip_time
FROM joined)
TO 'joined.geojsonseq'
WITH (
FORMAT GDAL,
DRIVER 'GeoJSONSeq',
LAYER_CREATION_OPTIONS 'WRITE_BBOX=YES'
);
head -n 10 joined.geojsonseq
{ "type": "Feature", "properties": { "start_zone": "JFK Airport", "end_zone": "Park Slope", "trip_time": "00:52:00" }, "geometry": { "type": "LineString", "coordinates": [ [ -73.789923, 40.643515 ], [ -73.97608, 40.680395 ] ] } }
{ "type": "Feature", "properties": { "start_zone": "JFK Airport", "end_zone": "Park Slope", "trip_time": "00:35:00" }, "geometry": { "type": "LineString", "coordinates": [ [ -73.776445, 40.645422 ], [ -73.98427, 40.670782 ] ] } }
{ "type": "Feature", "properties": { "start_zone": "JFK Airport", "end_zone": "Park Slope", "trip_time": "00:45:42" }, "geometry": { "type": "LineString", "coordinates": [ [ -73.776878, 40.645065 ], [ -73.992153, 40.662571 ] ] } }
{ "type": "Feature", "properties": { "start_zone": "JFK Airport", "end_zone": "Park Slope", "trip_time": "00:36:00" }, "geometry": { "type": "LineString", "coordinates": [ [ -73.788028, 40.641508 ], [ -73.97584, 40.670927 ] ] } }
{ "type": "Feature", "properties": { "start_zone": "JFK Airport", "end_zone": "Park Slope", "trip_time": "00:47:58" }, "geometry": { "type": "LineString", "coordinates": [ [ -73.781855, 40.644749 ], [ -73.980129, 40.663663 ] ] } }
{ "type": "Feature", "properties": { "start_zone": "JFK Airport", "end_zone": "Park Slope", "trip_time": "00:32:10" }, "geometry": { "type": "LineString", "coordinates": [ [ -73.787494, 40.641559 ], [ -73.974694, 40.673479 ] ] } }
{ "type": "Feature", "properties": { "start_zone": "JFK Airport", "end_zone": "Park Slope", "trip_time": "00:36:59" }, "geometry": { "type": "LineString", "coordinates": [ [ -73.790138, 40.643342 ], [ -73.982721, 40.662379 ] ] } }
{ "type": "Feature", "properties": { "start_zone": "JFK Airport", "end_zone": "Park Slope", "trip_time": "00:32:00" }, "geometry": { "type": "LineString", "coordinates": [ [ -73.786952, 40.641248 ], [ -73.97421, 40.676237 ] ] } }
{ "type": "Feature", "properties": { "start_zone": "JFK Airport", "end_zone": "Park Slope", "trip_time": "00:33:21" }, "geometry": { "type": "LineString", "coordinates": [ [ -73.783892, 40.648514 ], [ -73.979283, 40.669721 ] ] } }
{ "type": "Feature", "properties": { "start_zone": "JFK Airport", "end_zone": "Park Slope", "trip_time": "00:35:45" }, "geometry": { "type": "LineString", "coordinates": [ [ -73.776643, 40.645272 ], [ -73.978873, 40.66723 ] ] } }
And there we have it! We pulled tabular data from Parquet, combined it with geospatial data in a shapefile, cleaned and analyzed that combined data, and output it to a human readable geospatial format. The full set of currently supported functions and their implementation status can be found over at the docs in this table.
What's Next?
While it's probably going to take a while for us to catch up to the full set of functions provided by e.g., PostGIS, we believe that DuckDB's vectorized execution model and columnar storage format will enable a whole new class of optimizations for geospatial processing that we've just begun exploring. Improving the performance of spatial joins and predicates is therefore high on our list of priorities.
There are also some limitations with our GEOMETRY
type that we would eventually like to tackle, such as the fact that we don't support additional Z and M dimensions, or don't support the full range of geometry sub-types that are mandated by the OGC standard, like curves or polyhedral surfaces.
We're also interested in supporting spherical and ellipsoidal calculations in the near future, perhaps in the form of a dedicated GEOGRAPHY
type.
WASM builds are also just around the corner!
Please take a look at the GitHub repository for the full roadmap and to see what we're currently working on. If you would like to help build this capability, please reach out on GitHub!
Conclusion
The DuckDB Spatial extension is another step towards making DuckDB a swiss army knife for data engineering and analytics. This extension provides a flexible and familiar GEOMETRY
type, reprojectable between thousands of coordinate reference systems, coupled with the capability to export and import geospatial data between more than 50 different data sources. All embedded into a single extension with minimal runtime dependencies. This enables DuckDB to fit seamlessly into your existing GIS workflows regardless of which geospatial data formats or projections you're working with.
We are excited to hear what you make of the DuckDB spatial extension. It's still early days but we hope to have a lot more to share in the future as we continue making progress! If you have any questions, suggestions, ideas or issues, please don't hesitate to reach out to us on Discord or GitHub!