Search Shortcut cmd + k | ctrl + k
duck_lineage

The Duck Lineage extension automatically captures query lineage events and sends them to an Open Lineage backend.

Maintainer(s): thijs-s

Installing and Loading

INSTALL duck_lineage FROM community;
LOAD duck_lineage;

Example

-- Point to a running Open Lineage backend
SET duck_lineage_url = 'http://localhost:5000/api/v1/lineage';

-- Run some queries
CREATE TABLE greetings (id INTEGER, message VARCHAR);
INSERT INTO greetings VALUES (1, 'Hello'), (2, 'World');

SELECT * FROM greetings;

-- The lineage information for the above queries will be sent to the Open Lineage backend automatically!

About duck_lineage

The duck_lineage extension automatically captures data lineage from every DuckDB query and emits OpenLineage events to any compatible backend (e.g., Marquez, Atlan, DataHub).

Features:

  • Automatic lineage capture — no query modification required
  • OpenLineage START/COMPLETE/FAIL event lifecycle for every query
  • Input and output dataset extraction from logical query plans
  • Schema facets with column names and types for all tracked datasets
  • SQL query facet attached to every event
  • Output statistics facet (row count) on COMPLETE events
  • Lifecycle state change tracking (CREATE, DROP, ALTER, OVERWRITE, RENAME, TRUNCATE)
  • Symlinks facet for dataset identity resolution
  • File-based source tracking (read_csv, read_parquet, COPY TO)
  • DuckLake catalog support with automatic namespace resolution from DATA_PATH
  • Asynchronous event delivery via background worker thread
  • Exponential backoff retry with configurable max retries
  • Configurable event queue with overflow protection
  • API key authentication for OpenLineage backends
  • Parent run facet via OPENLINEAGE_PARENT_* environment variables
  • Debug logging mode

Tracked operations:

  • INSERT, UPDATE, DELETE, MERGE
  • CREATE TABLE, CREATE TABLE AS, CREATE VIEW, CREATE INDEX
  • DROP, ALTER
  • COPY TO
  • SELECT (read-only lineage)

Configuration (via SET statements):

  • duck_lineage_url — OpenLineage backend endpoint
  • duck_lineage_namespace — default dataset namespace
  • duck_lineage_api_key — authentication key
  • duck_lineage_debug — enable debug logging
  • duck_lineage_max_retries — retry attempts for failed HTTP requests (default: 3)
  • duck_lineage_max_queue_size — max queued events before dropping (default: 10000)
  • duck_lineage_timeout — HTTP request timeout in seconds (default: 10)

Limitations:

  • Lineage captured from Prepared Statements is less detailed
  • No column-level lineage (dataset-level granularity only)
  • Requires an external OpenLineage-compatible backend for event storage

Added Settings

name description input_type scope aliases
duck_lineage_api_key API Key for OpenLineage backend VARCHAR GLOBAL []
duck_lineage_debug Enable debug logging for OpenLineage events BOOLEAN GLOBAL []
duck_lineage_max_queue_size Maximum number of events to queue before dropping BIGINT GLOBAL []
duck_lineage_max_retries Maximum retry attempts for failed HTTP requests BIGINT GLOBAL []
duck_lineage_namespace Namespace for OpenLineage events VARCHAR GLOBAL []
duck_lineage_timeout HTTP request timeout in seconds BIGINT GLOBAL []
duck_lineage_url URL of the OpenLineage backend VARCHAR GLOBAL []