DuckLake 0.2
TL;DR: We released the DuckLake v0.2 standard, which is supported by DuckDB v1.3.1's ducklake
extension.
We released DuckLake a little over a month ago, and we were overwhelmed by the positive response from the community. Naturally, we also received a bunch of feature requests. With DuckLake version 0.2, we are happy to present the improvements made since!
We would like to start with a quick reminder: the term DuckLake refers to both the DuckLake open standard and the ducklake
DuckDB extension. In this blog post, we discuss the updates affecting both of these.
Updates in the DuckDB ducklake
Extension
Secrets for Managing Ducklake Credentials
Secrets can now be used to manage DuckLake credentials. The secrets can be created using the standard secret infrastructure, and contain all information relevant to connect to a given DuckLake instance. When connecting to DuckLake, the secret name can be passed in as a parameter. The connection information is then loaded from the secret.
Named Secrets
You can pass a named secret to DuckLake as follows:
CREATE SECRET my_secret (
TYPE ducklake,
METADATA_PATH 'metadata.db',
DATA_PATH 's3://my-s3-bucket/'
);
ATTACH 'ducklake:my_secret'
Unnamed Secrets
In addition to named secrets, there can also be one unnamed secret. When connecting using an empty connection string, this unnamed secret is used to connect:
CREATE SECRET (
TYPE ducklake,
METADATA_PATH 'metadata.db',
DATA_PATH 's3://my-s3-bucket/'
);
Settings
This release adds support for many more settings for DuckLake, in particular around the writing of Parquet files. The below list of settings is now supported. Settings can be scoped either globally for the entire DuckLake, or set at a per-schema and per-table level. The settings are persisted to the DuckLake in the ducklake_metadata
table.
parquet_compression
: Compression algorithm for Parquet files (uncompressed, snappy, gzip, zstd, brotli, lz4).parquet_version
: Parquet format version (1 or 2).parquet_compression_level
: Compression level for Parquet files.parquet_row_group_size
: Number of rows per row group in Parquet files.parquet_row_group_size_bytes
: Number of bytes per row group in Parquet files.target_file_size
: The target data file size for insertion and compaction operations.
Settings can be set using the set_option
method:
CALL ducklake.set_option('parquet_compression', 'zstd');
The settings for a given DuckLake can be read using the new options
function.
FROM ducklake.options();
List Files
We added the ducklake_list_files
function. This function returns a list of data and corresponding delete files that must be scanned to scan the data for a table for a given snapshot. This can be helpful to integrate DuckLake into other systems.
FROM ducklake_list_files('ducklake', 'file');
Directly Attaching DuckLake
This release also adds support for directly attaching DuckLake using the duckdb
CLI and other clients:
duckdb ducklake:my_file.db
Updates in the DuckLake 0.2 Standard
Relative Schema/Table Paths
In the old DuckLake standard, paths were always only relative to the global data path. In the DuckLake v0.2 standard, the location to which data and delete files are written now have three layers:
- Data paths are relative to the table path
- Table paths are relative to the schema path
- Schema paths are relative to the global data path
This allows data files to be written in a more structured manner. By default, the schema and table name are set as the path to which the files are written. For example:
main/
my_table/
ducklake-uuid1.parquet
ducklake-uuid2.parquet
The paths are stored in the ducklake_table
and ducklake_schema
tables.
By writing all files for a given table in a given subdirectory, it is now possible to use prefix-based access control at the object store level to grant users access to only specific schemas or tables in the database.
Name Mapping, and Adding Existing Parquet Files
DuckLake by default uses field ids to perform column mapping. When writing Parquet files through DuckLake, each column contains a field id that indicates to which top-level column it belongs. This allows DuckLake to perform metadata-only alter operations such as renaming and dropping fields.
When registering existing Parquet files written through other means or by other writers, these files generally do not have field identifiers written to them. In order to facilitate using these Parquet files in DuckLake, v0.2 adds a new way of mapping columns in the form of name mapping. This allows registering Parquet files as follows:
ATTACH 'ducklake:my_ducklake.db' AS my_ducklake;
CREATE TABLE my_ducklake.people(id BIGINT, name VARCHAR);
COPY (SELECT 42 AS id, 'Mark' AS name) TO 'people.parquet';
CALL ducklake_add_data_files('my_ducklake', 'people', 'people.parquet');
FROM my_ducklake.people;
┌───────┬─────────┐
│ id │ name │
│ int64 │ varchar │
├───────┼─────────┤
│ 42 │ Mark │
└───────┴─────────┘
Every file that is added to DuckLake has an optional mapping_id
, that tells the system the mapping of column name to field id. As this is done on a per-file basis, Parquet files can always be added to DuckLake without restriction. All DuckLake operations are supported on added files, including schema evolution and data change feeds.
Settings
DuckLake v0.2 adds support for scoped settings. The settings stored in the ducklake_metadata
table now have an optional scope field, together with a scope_id
. This allows settings to be scoped at either the schema or table level – instead of requiring all settings to be scoped globally.
Partition Transforms
This release adds support for the year/month/day/hour partition transforms, allowing these functions to be used to partition data directly instead of having to create separate columns containing these fields.
Migration Guide
When connecting DuckDB to a DuckLake that runs v0.1, a migration is automatically triggered that updates the DuckLake to v0.2. The following SQL queries are issued in order to perform the migration.
ALTER TABLE ducklake_schema ADD COLUMN path VARCHAR DEFAULT '';
ALTER TABLE ducklake_schema ADD COLUMN path_is_relative BOOLEAN DEFAULT true;
ALTER TABLE ducklake_table ADD COLUMN path VARCHAR DEFAULT '';
ALTER TABLE ducklake_table ADD COLUMN path_is_relative BOOLEAN DEFAULT true;
ALTER TABLE ducklake_metadata ADD COLUMN scope VARCHAR;
ALTER TABLE ducklake_metadata ADD COLUMN scope_id BIGINT;
ALTER TABLE ducklake_data_file ADD COLUMN mapping_id BIGINT;
CREATE TABLE ducklake_column_mapping(mapping_id BIGINT, table_id BIGINT, type VARCHAR);
CREATE TABLE ducklake_name_mapping(mapping_id BIGINT, column_id BIGINT, source_name VARCHAR, target_field_id BIGINT, parent_column BIGINT);
UPDATE ducklake_partition_column
SET column_id = (SELECT list(column_id ORDER BY column_order)
FROM ducklake_column
WHERE table_id = ducklake_partition_column.table_id AND parent_column IS NULL AND end_snapshot IS NULL)[ducklake_partition_column.column_id + 1];
UPDATE ducklake_metadata
SET value = '0.2'
WHERE key = 'version';