This is a practical, step-by-step guide to using DuckDB’s MERGE statement (introduced in v1.4.0) to perform upserts and build Slowly Changing Dimension Type 2 (SCD Type 2) tables. Type 2 SCDs let you keep full historical versions of records while clearly identifying the current version, perfect for audit trails, data warehousing, and analytical workloads. Type 2 SCDs are practical when you want to know previous values of your primary key data, when it changed and for how long it was in a particular state.
Why Use MERGE in DuckDB?
- Single SQL statement for
INSERT,UPDATE, and softDELETE(upsert and expire). - Much cleaner and faster than equivalent Python/Pandas logic.
- Full history tracking without hard deletes.
- Works directly on Parquet, CSV, databases, thanks to DuckDB's connectivity!
Prerequisites
- DuckDB ≥ 1.4.0
- Basic SQL knowledge
Key Terminology
| Term | Meaning |
|---|---|
| Target table | The main/master table you are updating (e.g., master_ducks) |
| Source table | The incoming/new data (e.g., incoming_ducks) |
| MERGE INTO | Specifies the target table |
| USING | Specifies the source table/query |
| ON | Join condition (usually primary/business key + current flag) |
| WHEN MATCHED | Row exists in both → typically UPDATE (or DELETE) |
| WHEN NOT MATCHED BY TARGET | New row (insert) |
| WHEN NOT MATCHED BY SOURCE | Row disappeared → soft-delete/expire old version |
| RETURNING merge_action | Optional: shows what happened to each row (INSERT/UPDATE/DELETE) |
Build an SCD Type 2 Dimension Table
We’ll track ducks and preserve history whenever their name, breed, or location changes.
DuckDB has a frontend notebook UI, this is great for managing several SQL statements and segmenting your code. The UI ships with the DuckDB CLI, so if you have the CLI installed you can use the front end. To start the notebook front end just run:
duckdb -uiand you can navigate to http://localhost:4213/ to start writing your SQL code inside of your notebooks. Just copy and paste the following code blocks to follow this guide.
Step 1: Create the Incoming (source) Table
This table represents today’s transactional data.
CREATE TABLE IF NOT EXISTS incoming_ducks (
duck_id INTEGER,
duck_name VARCHAR,
breed VARCHAR,
location VARCHAR,
begin_date DATE,
end_date DATE,
is_current BOOLEAN
);
INSERT INTO incoming_ducks VALUES
(101, 'Quackers', 'Mallard', 'Pond B', CURRENT_DATE - INTERVAL '1 day', NULL, true),
(102, 'Waddles', 'Pekin', 'Pond A', CURRENT_DATE - INTERVAL '1 day', NULL, true),
(104, 'Splash', 'Muscovy', 'Pond C', CURRENT_DATE - INTERVAL '1 day', NULL, true),
(105, 'Puddles', 'Indian Runner', 'Relocated', CURRENT_DATE - INTERVAL '1 day', NULL, true);
Step 2: Create the Master (target) Table
This table represents the type 2 SCD data, (i.e. transaction data with history).
CREATE TABLE IF NOT EXISTS master_ducks (
record_id INTEGER PRIMARY KEY,
duck_id INTEGER NOT NULL,
duck_name VARCHAR,
breed VARCHAR,
location VARCHAR,
begin_date DATE NOT NULL,
end_date DATE,
is_current BOOLEAN NOT NULL DEFAULT true
);
CREATE SEQUENCE IF NOT EXISTS duck_record_seq START 1;
INSERT INTO master_ducks VALUES
(nextval('duck_record_seq'), 101, 'Quackers', 'Mallard', 'Pond A', CURRENT_DATE - INTERVAL '2 days', NULL, true),
(nextval('duck_record_seq'), 102, 'Waddles', 'Pekin', 'Pond A', CURRENT_DATE - INTERVAL '2 days', NULL, true),
(nextval('duck_record_seq'), 103, 'Feathers', 'Rouen', 'Pond B', CURRENT_DATE - INTERVAL '2 days', NULL, true),
(nextval('duck_record_seq'), 105, 'Puddles', 'Indian Runner', 'Pond A', CURRENT_DATE - INTERVAL '2 days', NULL, true);
Step 3: Perform the Merge Statement
This statement will perform the merge, it will check for differences between the data of target and source and follow the WHEN MATCHED or WHEN NOT MATCHED logic specifed.
MERGE INTO master_ducks AS target
USING incoming_ducks AS source
ON target.duck_id = source.duck_id AND target.is_current = true
WHEN MATCHED AND (
target.duck_name <> source.duck_name OR
target.breed <> source.breed OR
target.location <> source.location
) THEN UPDATE SET
end_date = CURRENT_DATE - INTERVAL '1 day',
is_current = false
WHEN NOT MATCHED BY SOURCE AND target.is_current = true THEN UPDATE SET
end_date = CURRENT_DATE - INTERVAL '1 day',
is_current = false
WHEN NOT MATCHED BY TARGET THEN INSERT (
record_id, duck_id, duck_name, breed, location,
begin_date, end_date, is_current
) VALUES (
nextval('duck_record_seq'),
source.duck_id, source.duck_name, source.breed, source.location,
source.begin_date, source.end_date, source.is_current
)
RETURNING merge_action, *;
Step 4: Insert New Current Versions for Changed Records
This statement inserts the new current records into the master table. While it's possible to achieve the same result using the MERGE statement's RETURNING clause, this two-step approach is more straightforward and easier to understand.
INSERT INTO master_ducks (
record_id, duck_id, duck_name, breed, location,
begin_date, end_date, is_current
)
SELECT
nextval('duck_record_seq'),
source.duck_id,
source.duck_name,
source.breed,
source.location,
CURRENT_DATE AS begin_date,
NULL AS end_date,
true AS is_current
FROM incoming_ducks AS source
INNER JOIN master_ducks AS target
ON source.duck_id = target.duck_id
WHERE target.is_current = false
AND target.end_date = CURRENT_DATE - INTERVAL '1 day';
Step 5: Query The Results
The following queries can be used to examine the data resulting from the MERGE statement.
-- All history
SELECT * FROM master_ducks ORDER BY duck_id, begin_date DESC;
-- Only current records
SELECT * FROM master_ducks WHERE is_current = true;
-- Only expired historical records
SELECT * FROM master_ducks WHERE is_current = false ORDER BY duck_id, begin_date DESC;
Step 6: Examine a Single Duck
To better illustrate the concept lets examine a single duck, to drive home the value add for type 2 SCDs.
If we select from the master table after running the merge statement and the post update insert statement, we can see the individual rows for Quackers.
To view the original row of data that is historical:
SELECT * FROM master_ducks where duck_name = 'Quackers' and is_current = false;
Returns:
| record_id | duck_id | duck_name | breed | location | begin_date | end_date | is_current |
|---|---|---|---|---|---|---|---|
| 1 | 101 | Quackers | Mallard | Pond A | 2025-11-24 | 2025-11-25 | false |
Note:
- The
end dateis NOT NULL, it has the date when this ducks data was updated. - The
is_currentisfalseindicating this is a historical record. - The field that will change is
location, it is currentlyPond Aand will be updated toPond B.
To view the current row of data that is current:
SELECT * FROM master_ducks where duck_name = 'Quackers' and is_current = true;
| record_id | duck_id | duck_name | breed | location | begin_date | end_date | is_current |
|---|---|---|---|---|---|---|---|
| 10 | 101 | Quackers | Mallard | Pond B | 2025-11-26 | NULL | true |
Note:
- The
end dateis NULL, the NULL in this context indicates this is the latest record for thisduck_id. - The
is_currentistruealso indicating this is a current record. - The
locationis nowPond B.
To view all of Quackers data, which will contain both current and non-current rows:
SELECT * FROM master_ducks where duck_name = 'Quackers';
| record_id | duck_id | duck_name | breed | location | begin_date | end_date | is_current |
| 1 | 101 | Quackers | Mallard | Pond A | 2025-11-24 | 2025-11-25 | false |
| 10 | 101 | Quackers | Mallard | Pond B | 2025-11-26 | NULL | true |
Common Patterns and Variations
| Use Case | Clause to Use |
|---|---|
| Simple upsert (no history) | WHEN MATCHED THEN UPDATE and WHEN NOT MATCHED BY TARGET THEN INSERT |
| Upsert and delete missing rows | Add WHEN NOT MATCHED BY SOURCE THEN DELETE |
| Only insert new, never update | Omit WHEN MATCHED |
| Return affected rows | Add RETURNING merge_action, * |
Best Practices
- Remember that
TARGETis the master table andSOURCEis the incoming table or query. - Keep end_date NULL for current rows (makes queries faster).
- Wrap
MERGEandINSERTstatements in a transaction when needed. - Use a primary key or a surrogate key for uniqueness.
- Test with RETURNING first.