The Duck Lineage extension automatically captures query lineage events and sends them to an Open Lineage backend.
Maintainer(s):
thijs-s
Installing and Loading
INSTALL duck_lineage FROM community;
LOAD duck_lineage;
Example
-- Point to a running Open Lineage backend
SET duck_lineage_url = 'http://localhost:5000/api/v1/lineage';
-- Run some queries
CREATE TABLE greetings (id INTEGER, message VARCHAR);
INSERT INTO greetings VALUES (1, 'Hello'), (2, 'World');
SELECT * FROM greetings;
-- The lineage information for the above queries will be sent to the Open Lineage backend automatically!
About duck_lineage
The duck_lineage extension automatically captures data lineage from every DuckDB query and emits OpenLineage events to any compatible backend (e.g., Marquez, Atlan, DataHub).
Features:
- Automatic lineage capture — no query modification required
- OpenLineage START/COMPLETE/FAIL event lifecycle for every query
- Input and output dataset extraction from logical query plans
- Schema facets with column names and types for all tracked datasets
- SQL query facet attached to every event
- Output statistics facet (row count) on COMPLETE events
- Lifecycle state change tracking (CREATE, DROP, ALTER, OVERWRITE, RENAME, TRUNCATE)
- Symlinks facet for dataset identity resolution
- File-based source tracking (read_csv, read_parquet, COPY TO)
- DuckLake catalog support with automatic namespace resolution from DATA_PATH
- Asynchronous event delivery via background worker thread
- Exponential backoff retry with configurable max retries
- Configurable event queue with overflow protection
- API key authentication for OpenLineage backends
- Parent run facet via OPENLINEAGE_PARENT_* environment variables
- Debug logging mode
Tracked operations:
- INSERT, UPDATE, DELETE, MERGE
- CREATE TABLE, CREATE TABLE AS, CREATE VIEW, CREATE INDEX
- DROP, ALTER
- COPY TO
- SELECT (read-only lineage)
Configuration (via SET statements):
- duck_lineage_url — OpenLineage backend endpoint
- duck_lineage_namespace — default dataset namespace
- duck_lineage_api_key — authentication key
- duck_lineage_debug — enable debug logging
- duck_lineage_max_retries — retry attempts for failed HTTP requests (default: 3)
- duck_lineage_max_queue_size — max queued events before dropping (default: 10000)
- duck_lineage_timeout — HTTP request timeout in seconds (default: 10)
Limitations:
- Lineage captured from Prepared Statements is less detailed
- No column-level lineage (dataset-level granularity only)
- Requires an external OpenLineage-compatible backend for event storage
Added Settings
| name | description | input_type | scope | aliases |
|---|---|---|---|---|
| duck_lineage_api_key | API Key for OpenLineage backend | VARCHAR | GLOBAL | [] |
| duck_lineage_debug | Enable debug logging for OpenLineage events | BOOLEAN | GLOBAL | [] |
| duck_lineage_max_queue_size | Maximum number of events to queue before dropping | BIGINT | GLOBAL | [] |
| duck_lineage_max_retries | Maximum retry attempts for failed HTTP requests | BIGINT | GLOBAL | [] |
| duck_lineage_namespace | Namespace for OpenLineage events | VARCHAR | GLOBAL | [] |
| duck_lineage_timeout | HTTP request timeout in seconds | BIGINT | GLOBAL | [] |
| duck_lineage_url | URL of the OpenLineage backend | VARCHAR | GLOBAL | [] |