DuckLake 0.3 with Iceberg Interoperability and Geometry Support
TL;DR: We are releasing version 0.3 of the DuckLake Specification and the ducklake
DuckDB extension.
The focus of the team in the last two months has been on making DuckLake as robust as possible.
Besides this, we have implemented a bunch of features that will make the experience of running DuckLake even smoother.
In this post, we will discuss both the updates in the DuckDB ducklake
extension and in the DuckLake v0.3 specification.
New Features in the ducklake
Extension
The ducklake
extension in DuckDB v1.4.0 ships several new features. To try out the features described in this blog post, make sure that you have DuckDB v1.4.0 installed and all of your extensions are up-to-date. This can be achieved by running:
UPDATE EXTENSIONS;
Interoperability with Iceberg
Powered by the DuckDB iceberg
extension, it is possible to copy data between DuckLake to Iceberg.
Setup for Iceberg and DuckLake
INSTALL iceberg; LOAD iceberg;
INSTALL httpfs; LOAD httpfs;
-- Basic Iceberg setup from https://github.com/duckdb/duckdb-iceberg/blob/main/scripts/start-rest-catalog.sh
CREATE SECRET (
TYPE S3,
KEY_ID 'admin',
SECRET 'password',
ENDPOINT '127.0.0.1:9000',
URL_STYLE 'path',
USE_SSL false
);
ATTACH '' AS iceberg_datalake (
TYPE iceberg,
CLIENT_ID 'admin',
CLIENT_SECRET 'password',
ENDPOINT 'http://127.0.0.1:8181'
);
ATTACH 'ducklake:my_ducklake.ducklake' AS ducklake (DATA_PATH 'data/');
Now you can copy from Iceberg to DuckLake.
CREATE SCHEMA iceberg_datalake.default;
CREATE TABLE iceberg_datalake.default.iceberg_table AS
SELECT a FROM range(4) t(a);
COPY FROM DATABASE iceberg_datalake TO ducklake;
Copying from DuckLake to Iceberg also works, given that the schemas are already created in Iceberg.
-- Assuming Iceberg catalog is empty since the COPY command does
-- not replace tables
CREATE SCHEMA iceberg_datalake.main;
CREATE TABLE ducklake.default.ducklake_table AS
SELECT a FROM range(4) t(a);
COPY FROM DATABASE ducklake TO iceberg_datalake;
These examples are data copies (i.e., deep copies) of the latest snapshot, which means that only data is ported from Iceberg to DuckLake and vice versa. Metadata-only copies are also supported from Iceberg to DuckLake. The main difference is that metadata only copies do not copy over the underlying data, only the metadata, including all the snapshot history. This means that you can query previous snapshots of an Iceberg table as if it was a DuckLake table.
CREATE SCHEMA iceberg_datalake.default;
CREATE TABLE iceberg_datalake.default.iceberg_table AS
SELECT a FROM range(4) t(a); -- version 0
INSERT INTO iceberg_datalake.default.iceberg_table
SELECT a FROM range(4, 10) t(a); -- version 1
CALL iceberg_to_ducklake('iceberg_datalake', 'ducklake');
Now you can query any version of the iceberg table as if it was a DuckLake table.
FROM ducklake.default.iceberg_table AT (VERSION => 0);
┌───────┐
│ a │
│ int64 │
├───────┤
│ 0 │
│ 1 │
│ 2 │
│ 3 │
└───────┘
MERGE INTO Statement
Since DuckDB 1.4 supports MERGE INTO
this functionality can also be used in the ducklake
extension since this release. This is a very common statement in OLAP systems that do not support primary keys but still want to support upserting (i.e., UPDATE
plus INSERT
) functionality.
CREATE TABLE Stock(item_id INTEGER, balance INTEGER);
INSERT INTO Stock VALUES (10, 2200), (20, 1900);
WITH new_stocks(item_id, volume) AS (VALUES (20, 2200), (30, 1900))
MERGE INTO Stock USING new_stocks USING (item_id)
WHEN MATCHED THEN UPDATE SET balance = balance + volume
WHEN NOT MATCHED THEN INSERT VALUES (new_stocks.item_id, new_stocks.volume);
FROM Stock;
┌─────────┬─────────┐
│ item_id │ balance │
│ int32 │ int32 │
├─────────┼─────────┤
│ 10 │ 2200 │
│ 20 │ 4100 │
│ 30 │ 1900 │
└─────────┴─────────┘
MERGE INTO
also supports more complex conditions and DELETE
statements.
WITH deletes(item_id, delete_threshold) AS (VALUES (10, 3000))
MERGE INTO Stock USING deletes USING (item_id)
WHEN MATCHED AND balance < delete_threshold THEN DELETE;
FROM Stock;
┌─────────┬─────────┐
│ item_id │ balance │
│ int32 │ int32 │
├─────────┼─────────┤
│ 20 │ 4100 │
│ 30 │ 1900 │
└─────────┴─────────┘
CHECKPOINT Statement
DuckLake now also supports the CHECKPOINT
statement. In DuckLake, this statement runs a series of maintenance functions in a sequential order. This includes flushing inlined data, expiring snapshots, compacting files, and rewriting heavily deleted files as well as cleaning up old or orphaned files. The CHECKPOINT
statement can be configured via some global options that can be set via the ducklake.set_option
function.
ATTACH 'ducklake:my_ducklake.ducklake' AS my_ducklake;
USE my_ducklake;
CHECKPOINT;
Faster Inserts
In cases where the output of each thread in a certain query is of a desired size (i.e., not too small), setting the option per_thread_output
will most likely speed up the insertion process in DuckLake. If you have a large number of threads and the output of each thread is not very significant, using per_thread_output
will likely not improve the performance that much and will generate an undesirable amount of small files that will hinder read performance. You can benefit a lot from this option if you are running a setup where DuckDB is in an EC2 instance and has a very high network bandwidth (up to 100 Gbps) to write to S3.
In the following benchmark we did find ~25% improvement when enabling per_thread_output
:
.timer on
ATTACH 'ducklake:my_ducklake.ducklake' AS ducklake;
CREATE TABLE sample_table AS SELECT * FROM range(1_000_000_000);
-- 4.5 seconds
CREATE TABLE slow_copy AS SELECT * FROM sample_table;
-- enable the option
CALL ducklake.set_option('per_thread_output', true);
-- 3.4 seconds
CREATE TABLE fast_copy AS SELECT * FROM sample_table;
This feature was contributed by community member Julian Meyers.
Updates in the DuckLake Specification
Support for Geometry Types
Geometry types are now supported in DuckLake. This means that you can now use most of the functionality in the spatial
extension of DuckDB with DuckLake as a backend. See the documentation on supported geometry primitives.
INSTALL spatial;
LOAD spatial;
ATTACH 'ducklake:my_ducklake.ducklake' AS ducklake;
CREATE TABLE geometry_table (polygons GEOMETRY);
INSERT INTO geometry_table VALUES ('POLYGON((0 0, 0 1, 1 1, 1 0, 0 0))');
SELECT
polygons,
ST_Area(polygons) AS area
FROM geometry_table;
┌─────────────────────────────────────┬────────┐
│ polygons │ area │
│ geometry │ double │
├─────────────────────────────────────┼────────┤
│ POLYGON ((0 0, 0 1, 1 1, 1 0, 0 0)) │ 1.0 │
└─────────────────────────────────────┴────────┘
Please note that the geometry support is currently experimental, and it's missing features such as filter pushdown, data inlining and coordinate system tracking.
Author Commits
We have also enabled some functionality to be able to author commits whenever you execute a new transaction in DuckLake. This functionality helps greatly with audit needs.
Migration from v0.2 to v0.3
You don’t have to worry about manually migrating between different versions of the DuckLake specification.
The moment you attach to your existing DuckLake with the new ducklake
extension installed, the extension will update the metadata catalog to the 0.3 version using the following SQL script.
Click here to see the SQL script that's used by the ducklake
extension for migrating from DuckLake v0.2 to v0.3.
ALTER TABLE ducklake_name_mapping ADD COLUMN IF NOT EXISTS is_partition BOOLEAN DEFAULT false;
ALTER TABLE ducklake_snapshot_changes ADD COLUMN IF NOT EXISTS author VARCHAR DEFAULT NULL;
ALTER TABLE ducklake_snapshot_changes ADD COLUMN IF NOT EXISTS commit_message VARCHAR DEFAULT NULL;
ALTER TABLE ducklake_snapshot_changes ADD COLUMN IF NOT EXISTS commit_extra_info VARCHAR DEFAULT NULL;
UPDATE ducklake_metadata SET value = '0.3' WHERE key = 'version';
CREATE TABLE IF NOT EXISTS ducklake_schema_versions(begin_snapshot BIGINT, schema_version BIGINT);
INSERT INTO ducklake_schema_versions SELECT * FROM (SELECT MIN(snapshot_id), schema_version FROM ducklake_snapshot GROUP BY schema_version ORDER BY schema_version) t WHERE NOT EXISTS (SELECT 1 FROM ducklake_schema_versions);
ALTER TABLE IF EXISTS ducklake_file_column_statistics RENAME TO ducklake_file_column_stats;
ALTER TABLE ducklake_file_column_stats ADD COLUMN IF NOT EXISTS extra_stats VARCHAR DEFAULT NULL;
ALTER TABLE ducklake_table_column_stats ADD COLUMN IF NOT EXISTS extra_stats VARCHAR DEFAULT NULL;
Guides and Roadmap
In the last month, we have also been preparing some guides to help you adopt DuckLake and cover some topics that are not features of DuckLake but will help you run your DuckLake smoothly.
Finally, we have also released a roadmap showing the features planned in upcoming DuckLake versions.