DuckDB offers several advanced SQL features and syntactic sugar to make SQL queries more concise. We refer to these colloquially as “friendly SQL”.
Several of these features are also supported in other systems while some are (currently) exclusive to DuckDB.
Clauses
- Creating tables and inserting data:
CREATE OR REPLACE TABLE
: avoidDROP TABLE IF EXISTS
statements in scripts.CREATE TABLE ... AS SELECT
(CTAS): create a new table from the output of a table without manually defining a schema.INSERT INTO ... BY NAME
: this variant of theINSERT
statement allows using column names instead of positions.INSERT OR IGNORE INTO ...
: insert the rows that do not result in a conflict due toUNIQUE
orPRIMARY KEY
constraints.INSERT OR REPLACE INTO ...
: insert the rows that do not result in a conflict due toUNIQUE
orPRIMARY KEY
constraints. For those that result in a conflict, replace the columns of the existing row to the new values of the to-be-inserted row.
- Describing tables and computing statistics:
- Making SQL clauses more compact:
FROM
-first syntax with an optionalSELECT
clause: DuckDB allows queries in the form ofFROM tbl
which selects all columns (performing aSELECT *
statement).GROUP BY ALL
: omit the group-by columns by inferring them from the list of attributes in theSELECT
clause.ORDER BY ALL
: shorthand to order on all columns (e.g., to ensure deterministic results).SELECT * EXCLUDE
: theEXCLUDE
option allows excluding specific columns from the*
expression.SELECT * REPLACE
: theREPLACE
option allows replacing specific columns with different expressions in a*
expression.UNION BY NAME
: perform theUNION
operation along the names of columns (instead of relying on positions).
- Transforming tables:
- Defining SQL-level variables:
Query Features
- Column aliases in
WHERE
,GROUP BY
, andHAVING
. (Note that column aliases cannot be used in theON
clause ofJOIN
clauses.) COLUMNS()
expression can be used to execute the same expression on multiple columns:- Reusable column aliases, e.g.:
SELECT i + 1 AS j, j + 2 AS k FROM range(0, 3) t(i)
- Advanced aggregation features for analytical (OLAP) queries:
count()
shorthand forcount(*)
Literals and Identifiers
- Case-insensitivity while maintaining case of entities in the catalog
- Deduplicating identifiers
- Underscores as digit separators in numeric literals
Data Types
Data Import
- Auto-detecting the headers and schema of CSV files
- Directly querying CSV files and Parquet files
- Loading from files using the syntax
FROM 'my.csv'
,FROM 'my.csv.gz'
,FROM 'my.parquet'
, etc. - Filename expansion (globbing), e.g.:
FROM 'my-data/part-*.parquet'
Functions and Expressions
- Dot operator for function chaining:
SELECT ('hello').upper()
- String formatters:
the
format()
function with thefmt
syntax and theprintf() function
- List comprehensions
- List slicing
- String slicing
STRUCT.*
notation- Simple
LIST
andSTRUCT
creation
Join Types
Trailing Commas
DuckDB allows trailing commas,
both when listing entities (e.g., column and table names) and when constructing LIST
items.
For example, the following query works:
SELECT
42 AS x,
['a', 'b', 'c',] AS y,
'hello world' AS z,
;
"Top-N in Group" Queries
Computing the "top-N rows in a group" ordered by some criteria is a common task in SQL that unfortunately often requires a complex query involving window functions and/or subqueries.
To aid in this, DuckDB provides the aggregate functions max(arg, n)
, min(arg, n)
, arg_max(arg, val, n)
, arg_min(arg, val, n)
, max_by(arg, val, n)
and min_by(arg, val, n)
to efficiently return the "top" n
rows in a group based on a specific column in either ascending or descending order.
For example, let's use the following table:
SELECT * FROM t1;
┌─────────┬───────┐
│ grp │ val │
│ varchar │ int32 │
├─────────┼───────┤
│ a │ 2 │
│ a │ 1 │
│ b │ 5 │
│ b │ 4 │
│ a │ 3 │
│ b │ 6 │
└─────────┴───────┘
We want to get a list of the top-3 val
values in each group grp
. The conventional way to do this is to use a window function in a subquery:
SELECT array_agg(rs.val), rs.grp
FROM
(SELECT val, grp, row_number() OVER (PARTITION BY grp ORDER BY val DESC) AS rid
FROM t1 ORDER BY val DESC) AS rs
WHERE rid < 4
GROUP BY rs.grp;
┌───────────────────┬─────────┐
│ array_agg(rs.val) │ grp │
│ int32[] │ varchar │
├───────────────────┼─────────┤
│ [3, 2, 1] │ a │
│ [6, 5, 4] │ b │
└───────────────────┴─────────┘
But in DuckDB, we can do this much more concisely (and efficiently!):
SELECT max(val, 3) FROM t1 GROUP BY grp;
┌─────────────┐
│ max(val, 3) │
│ int32[] │
├─────────────┤
│ [3, 2, 1] │
│ [6, 5, 4] │
└─────────────┘
Related Blog Posts
- “Friendlier SQL with DuckDB” blog post
- “Even Friendlier SQL with DuckDB” blog post
- “SQL Gymnastics: Bending SQL into Flexible New Shapes” blog post