Even Friendlier SQL with DuckDB
TL;DR: DuckDB continues to push the boundaries of SQL syntax to both simplify queries and make more advanced analyses possible. Highlights include dynamic column selection, queries that start with the FROM clause, function chaining, and list comprehensions. We boldly go where no SQL engine has gone before! For more details, see the documentation for friendly SQL features.
Who says that SQL should stay frozen in time, chained to a 1999 version of the specification? As a comparison, do folks remember what JavaScript felt like before Promises? Those didn’t launch until 2012! It’s clear that innovation at the programming syntax layer can have a profoundly positive impact on an entire language ecosystem.
We believe there are many valid reasons for innovation in the SQL language, among them opportunities to simplify basic queries and also to make more dynamic analyses possible. Many of these features arose from community suggestions! Please let us know your SQL pain points on Discord or GitHub and join us as we change what it feels like to write SQL!
If you have not had a chance to read the first installment in this series, please take a quick look to the prior blog post, “Friendlier SQL with DuckDB”.
The future is now
The first few enhancements in this list were included in the “Ideas for the Future” section of the prior post.
Reusable column aliases
When working with incremental calculated expressions in a select statement, traditional SQL dialects force you to either write out the full expression for each column or create a common table expression (CTE) around each step of the calculation. Now, any column alias can be reused by subsequent columns within the same select statement. Not only that, but these aliases can be used in the where and order by clauses as well.
Old way 1: Repeat yourself
SELECT
'These are the voyages of the starship Enterprise...' AS intro,
instr('These are the voyages of the starship Enterprise...', 'starship')
AS starship_loc
substr('These are the voyages of the starship Enterprise...',
instr('These are the voyages of the starship Enterprise...', 'starship')
+ len('starship') + 1) AS trimmed_intro;
Old way 2: All the CTEs
WITH intro_cte AS (
SELECT
'These are the voyages of the starship Enterprise...' AS intro
), starship_loc_cte AS (
SELECT
intro,
instr(intro, 'starship') AS starship_loc
FROM intro_cte
)
SELECT
intro,
starship_loc,
substr(intro, starship_loc + len('starship') + 1) AS trimmed_intro
FROM starship_loc_cte;
New way
SELECT
'These are the voyages of the starship Enterprise...' AS intro,
instr(intro, 'starship') AS starship_loc,
substr(intro, starship_loc + len('starship') + 1) AS trimmed_intro;
intro | starship_loc | trimmed_intro |
---|---|---|
These are the voyages of the starship Enterprise… | 30 | Enterprise… |
Dynamic column selection
Databases typically prefer strictness in column definitions and flexibility in the number of rows. This can help by enforcing data types and recording column level metadata. However, in data science workflows and elsewhere, it is very common to dynamically generate columns (for example during feature engineering).
No longer do you need to know all of your column names up front! DuckDB can select and even modify columns based on regular expression pattern matching, EXCLUDE
or REPLACE
modifiers, and even lambda functions (see the section on lambda functions below for details!).
Let’s take a look at some facts gathered about the first season of Star Trek. Using DuckDB’s httpfs
extension, we can query a csv dataset directly from GitHub. It has several columns so let’s DESCRIBE
it.
INSTALL httpfs;
LOAD httpfs;
CREATE TABLE trek_facts AS
SELECT *
FROM 'https://blobs.duckdb.org/data/Star_Trek-Season_1.csv';
DESCRIBE trek_facts;
column_name | column_type | null | key | default | extra |
---|---|---|---|---|---|
season_num | BIGINT | YES | NULL | NULL | NULL |
episode_num | BIGINT | YES | NULL | NULL | NULL |
aired_date | DATE | YES | NULL | NULL | NULL |
cnt_kirk_hookups | BIGINT | YES | NULL | NULL | NULL |
cnt_downed_redshirts | BIGINT | YES | NULL | NULL | NULL |
bool_aliens_almost_took_over_planet | BIGINT | YES | NULL | NULL | NULL |
bool_aliens_almost_took_over_enterprise | BIGINT | YES | NULL | NULL | NULL |
cnt_vulcan_nerve_pinch | BIGINT | YES | NULL | NULL | NULL |
cnt_warp_speed_orders | BIGINT | YES | NULL | NULL | NULL |
highest_warp_speed_issued | BIGINT | YES | NULL | NULL | NULL |
bool_hand_phasers_fired | BIGINT | YES | NULL | NULL | NULL |
bool_ship_phasers_fired | BIGINT | YES | NULL | NULL | NULL |
bool_ship_photon_torpedos_fired | BIGINT | YES | NULL | NULL | NULL |
cnt_transporter_pax | BIGINT | YES | NULL | NULL | NULL |
cnt_damn_it_jim_quote | BIGINT | YES | NULL | NULL | NULL |
cnt_im_givin_her_all_shes_got_quote | BIGINT | YES | NULL | NULL | NULL |
cnt_highly_illogical_quote | BIGINT | YES | NULL | NULL | NULL |
bool_enterprise_saved_the_day | BIGINT | YES | NULL | NULL | NULL |
COLUMNS() with regular expressions
The COLUMNS
expression can accept a string parameter that is a regular expression and will return all column names that match the pattern. How did warp change over the first season? Let’s examine any column name that contains the word warp
.
SELECT
episode_num,
COLUMNS('.*warp.*')
FROM trek_facts;
episode_num | cnt_warp_speed_orders | highest_warp_speed_issued |
---|---|---|
0 | 1 | 1 |
1 | 0 | 0 |
2 | 1 | 1 |
3 | 1 | 0 |
… | … | … |
27 | 1 | 1 |
28 | 0 | 0 |
29 | 2 | 8 |
The COLUMNS
expression can also be wrapped by other functions to apply those functions to each selected column. Let’s simplify the above query to look at the maximum values across all episodes:
SELECT
max(COLUMNS('.*warp.*'))
FROM trek_facts;
max(trek_facts.cnt_warp_speed_orders) | max(trek_facts.highest_warp_speed_issued) |
---|---|
5 | 8 |
We can also create a WHERE
clause that applies across multiple columns. All columns must match the filter criteria, which is equivalent to combining them with AND
. Which episodes had at least 2 warp speed orders and at least a warp speed level of 2?
SELECT
episode_num,
COLUMNS('.*warp.*')
FROM trek_facts
WHERE
COLUMNS('.*warp.*') >= 2;
-- cnt_warp_speed_orders >= 2
-- AND
-- highest_warp_speed_issued >= 2
episode_num | cnt_warp_speed_orders | highest_warp_speed_issued |
---|---|---|
14 | 3 | 7 |
17 | 2 | 7 |
18 | 2 | 8 |
29 | 2 | 8 |
COLUMNS() with EXCLUDE and REPLACE
Individual columns can also be either excluded or replaced prior to applying calculations on them. For example, since our dataset only includes season 1, we do not need to find the max
of that column. It would be highly illogical.
SELECT
max(COLUMNS(* EXCLUDE season_num))
FROM trek_facts;
max(trek_facts. episode_num) |
max(trek_facts. aired_date) |
max(trek_facts. cnt_kirk_hookups) |
… | max(trek_facts. bool_enterprise_saved_the_day) |
---|---|---|---|---|
29 | 1967-04-13 | 2 | … | 1 |
The REPLACE
syntax is also useful when applied to a dynamic set of columns. In this example, we want to convert the dates into timestamps prior to finding the maximum value in each column. Previously this would have required an entire subquery or CTE to pre-process just that single column!
SELECT
max(COLUMNS(* REPLACE aired_date::timestamp AS aired_date))
FROM trek_facts;
max(trek_facts. season_num) |
max(trek_facts. episode_num) |
max(aired_date := CAST(aired_date AS TIMESTAMP)) |
… | max(trek_facts. bool_enterprise_saved_the_day) |
---|---|---|---|---|
1 | 29 | 1967-04-13 00:00:00 | … | 1 |
COLUMNS() with lambda functions
The most flexible way to query a dynamic set of columns is through a lambda function. This allows for any matching criteria to be applied to the names of the columns, not just regular expressions. See more details about lambda functions below.
For example, if using the LIKE
syntax is more comfortable, we can select columns matching a LIKE
pattern rather than with a regular expression.
SELECT
episode_num,
COLUMNS(col -> col LIKE '%warp%')
FROM trek_facts
WHERE
COLUMNS(col -> col LIKE '%warp%') >= 2;
episode_num | cnt_warp_speed_orders | highest_warp_speed_issued |
---|---|---|
14 | 3 | 7 |
17 | 2 | 7 |
18 | 2 | 8 |
29 | 2 | 8 |
Automatic JSON to nested types conversion
The first installment in the series mentioned JSON dot notation references as future work. However, the team has gone even further! Instead of referring to JSON-typed columns using dot notation, JSON can now be automatically parsed into DuckDB’s native types for significantly faster performance, compression, as well as that friendly dot notation!
First, install and load the httpfs
and json
extensions if they don't come bundled with the client you are using. Then query a remote JSON file directly as if it were a table!
INSTALL httpfs;
LOAD httpfs;
INSTALL json;
LOAD json;
SELECT
starfleet[10].model AS starship
FROM 'https://raw.githubusercontent.com/vlad-saling/star-trek-ipsum/master/src/content/content.json';
starship |
---|
USS Farragut - NCC-1647 - Ship on which James Kirk served as a phaser station operator. Attacked by the Dikironium Cloud Creature, killing half the crew. ad. |
Now for some new SQL capabilities beyond the ideas from the prior post!
FROM first in SELECT statements
When building a query, the first thing you need to know is where your data is coming FROM
. Well then why is that the second clause in a SELECT
statement?? No longer! DuckDB is building SQL as it should have always been – putting the FROM
clause first! This addresses one of the longest standing complaints about SQL, and the DuckDB team implemented it in 2 days.
FROM my_table SELECT my_column;
Not only that, the SELECT
statement can be completely removed and DuckDB will assume all columns should be SELECT
ed. Taking a look at a table is now as simple as:
FROM my_table;
-- SELECT * FROM my_table
Other statements like COPY
are simplified as well.
COPY (FROM trek_facts) TO 'phaser_filled_facts.parquet';
This has an additional benefit beyond saving keystrokes and staying in a development flow state: autocomplete will have much more context when you begin to choose columns to query. Give the AI a helping hand!
Note that this syntax is completely optional, so your SELECT * FROM
keyboard shortcuts are safe, even if they are obsolete… 🙂
Function chaining
Many SQL blogs advise the use of CTEs instead of subqueries. Among other benefits, they are much more readable. Operations are compartmentalized into discrete chunks and they can be read in order top to bottom instead of forcing the reader to work their way inside out.
DuckDB enables the same interpretability improvement for every scalar function! Use the dot operator to chain functions together, just like in Python. The prior expression in the chain is used as the first argument to the subsequent function.
SELECT
('Make it so')
.upper()
.string_split(' ')
.list_aggr('string_agg','.')
.concat('.') AS im_not_messing_around_number_one;
im_not_messing_around_number_one |
---|
MAKE.IT.SO. |
Now compare that with the old way…
SELECT
concat(
list_aggr(
string_split(
upper('Make it stop'),
' '),
'string_agg','.'),
'.') AS oof;
oof |
---|
MAKE.IT.STOP. |
Union by name
DuckDB aims to blend the best of databases and dataframes. This new syntax is inspired by the concat function in Pandas. Rather than vertically stacking tables based on column position, columns are matched by name and stacked accordingly. Simply replace UNION
with UNION BY NAME
or UNION ALL
with UNION ALL BY NAME
.
For example, we had to add some new alien species proverbs in The Next Generation:
CREATE TABLE proverbs AS
SELECT
'Revenge is a dish best served cold' AS klingon_proverb
UNION ALL BY NAME
SELECT
'You will be assimilated' AS borg_proverb,
'If winning is not important, why keep score?' AS klingon_proverb;
FROM proverbs;
klingon_proverb | borg_proverb |
---|---|
Revenge is a dish best served cold | NULL |
If winning is not important, why keep score? | You will be assimilated |
This approach has additional benefits. As seen above, not only can tables with different column orders be combined, but so can tables with different numbers of columns entirely. This is helpful as schemas migrate, and is particularly useful for DuckDB’s multi-file reading capabilities.
Insert by name
Another common situation where column order is strict in SQL is when inserting data into a table. Either the columns must match the order exactly, or all of the column names must be repeated in two locations within the query.
Instead, add the keywords BY NAME
after the table name when inserting. Any subset of the columns in the table in any order can be inserted.
INSERT INTO proverbs BY NAME
SELECT 'Resistance is futile' AS borg_proverb;
SELECT * FROM proverbs;
klingon_proverb | borg_proverb |
---|---|
Revenge is a dish best served cold | NULL |
If winning is not important, why keep score? | You will be assimilated |
NULL | Resistance is futile |
Dynamic PIVOT and UNPIVOT
Historically, databases are not well-suited for pivoting operations. However, DuckDB’s PIVOT
and UNPIVOT
clauses can create or stack dynamic column names for a truly flexible pivoting capability! In addition to that flexibility, DuckDB also provides both the SQL standard syntax and a friendlier shorthand.
For example, let’s take a look at some procurement forecast data just as the Earth-Romulan war was beginning:
CREATE TABLE purchases (item VARCHAR, year INT, count INT);
INSERT INTO purchases
VALUES ('phasers', 2155, 1035),
('phasers', 2156, 25039),
('phasers', 2157, 95000),
('photon torpedoes', 2155, 255),
('photon torpedoes', 2156, 17899),
('photon torpedoes', 2157, 87492);
FROM purchases;
item | year | count |
---|---|---|
phasers | 2155 | 1035 |
phasers | 2156 | 25039 |
phasers | 2157 | 95000 |
photon torpedoes | 2155 | 255 |
photon torpedoes | 2156 | 17899 |
photon torpedoes | 2157 | 87492 |
It is easier to compare our phaser needs to our photon torpedo needs if each year’s data is visually close together. Let’s pivot this into a friendlier format! Each year should receive its own column (but each year shouldn’t need to be specified in the query!), we want to sum up the total count
, and we still want to keep a separate group (row) for each item
.
CREATE TABLE pivoted_purchases AS
PIVOT purchases
ON year
USING sum(count)
GROUP BY item;
FROM pivoted_purchases;
item | 2155 | 2156 | 2157 |
---|---|---|---|
phasers | 1035 | 25039 | 95000 |
photon torpedoes | 255 | 17899 | 87492 |
Looks like photon torpedoes went on sale…
Now imagine the reverse situation. Scotty in engineering has been visually analyzing and manually constructing his purchases forecast. He prefers things pivoted so it’s easier to read. Now you need to fit it back into the database! This war may go on for a bit, so you may need to do this again next year. Let’s write an UNPIVOT
query to return to the original format that can handle any year.
The COLUMNS
expression will use all columns except item
. After stacking, the column containing the column names from pivoted_purchases
should be renamed to year
, and the values within those columns represent the count
. The result is the same dataset as the original.
UNPIVOT pivoted_purchases
ON COLUMNS(* EXCLUDE item)
INTO
NAME year
VALUE count;
item | year | count |
---|---|---|
phasers | 2155 | 1035 |
phasers | 2156 | 25039 |
phasers | 2157 | 95000 |
photon torpedoes | 2155 | 255 |
photon torpedoes | 2156 | 17899 |
photon torpedoes | 2157 | 87492 |
More examples are included as a part of our DuckDB 0.8.0 announcement post, and the PIVOT
and UNPIVOT
documentation pages highlight more complex queries.
Stay tuned for a future post to cover what is happening behind the scenes!
List lambda functions
List lambdas allow for operations to be applied to each item in a list. These do not need to be pre-defined – they are created on the fly within the query.
In this example, a lambda function is used in combination with the list_transform
function to shorten each official ship name.
SELECT
(['Enterprise NCC-1701', 'Voyager NCC-74656', 'Discovery NCC-1031'])
.list_transform(x -> x.string_split(' ')[1]) AS short_name;
ship_name |
---|
[Enterprise, Voyager, Discovery] |
Lambdas can also be used to filter down the items in a list. The lambda returns a list of booleans, which is used by the list_filter
function to select specific items. The contains
function is using the function chaining described earlier.
SELECT
(['Enterprise NCC-1701', 'Voyager NCC-74656', 'Discovery NCC-1031'])
.list_filter(x -> x.contains('1701')) AS the_original;
the_original |
---|
[Enterprise NCC-1701] |
List comprehensions
What if there was a simple syntax to both modify and filter a list? DuckDB takes inspiration from Python’s approach to list comprehensions to dramatically simplify the above examples. List comprehensions are syntactic sugar – these queries are rewritten into lambda expressions behind the scenes!
Within brackets, first specify the transformation that is desired, then indicate which list should be iterated over, and finally include the filter criteria.
SELECT
[x.string_split(' ')[1]
FOR x IN ['Enterprise NCC-1701', 'Voyager NCC-74656', 'Discovery NCC-1031']
IF x.contains('1701')] AS ready_to_boldly_go;
ready_to_boldly_go |
---|
[Enterprise] |
Exploding struct.*
A struct in DuckDB is a set of key/value pairs. Behind the scenes, a struct is stored with a separate column for each key. As a result, it is computationally easy to explode a struct into separate columns, and now it is also syntactically simple as well! This is another example of allowing SQL to handle dynamic column names.
WITH damage_report AS (
SELECT {'gold_casualties':5, 'blue_casualties':15, 'red_casualties': 10000} AS casualties
)
FROM damage_report
SELECT
casualties.*;
gold_casualties | blue_casualties | red_casualties |
---|---|---|
5 | 15 | 10000 |
Automatic struct creation
DuckDB exposes an easy way to convert any table into a single-column struct. Instead of SELECT
ing column names, SELECT
the table name itself.
WITH officers AS (
SELECT 'Captain' AS rank, 'Jean-Luc Picard' AS name
UNION ALL
SELECT 'Lieutenant Commander', 'Data'
)
FROM officers
SELECT officers;
officers |
---|
{'rank': Captain, 'name': Jean-Luc Picard} |
{'rank': Lieutenant Commander, 'name': Data} |
Union data type
DuckDB utilizes strong typing to provide high performance and enforce data quality. However, DuckDB is also as forgiving as possible using approaches like implicit casting to avoid always having to cast between data types.
Another way DuckDB enables flexibility is the new UNION
data type. A UNION
data type allows for a single column to contain multiple types of values. This can be thought of as an “opt-in” to SQLite’s flexible data typing rules (the opposite direction of SQLite’s recently announced strict tables).
By default DuckDB will seek the common denominator of data types when combining tables together. The below query results in a VARCHAR
column:
SELECT 'The Motion Picture' AS movie UNION ALL
SELECT 2 UNION ALL
SELECT 3 UNION ALL
SELECT 4 UNION ALL
SELECT 5 UNION ALL
SELECT 6 UNION ALL
SELECT 'First Contact';
movie |
---|
varchar |
The Motion Picture |
First Contact |
6 |
5 |
4 |
3 |
2 |
However, if a UNION
type is used, each individual row retains its original data type. A UNION
is defined using key-value pairs with the key as a name and the value as the data type. This also allows the specific data types to be pulled out as individual columns:
CREATE TABLE movies (
movie UNION(num INT, name VARCHAR)
);
INSERT INTO movies VALUES
('The Motion Picture'), (2), (3), (4), (5), (6), ('First Contact');
FROM movies
SELECT
movie,
union_tag(movie) AS type,
movie.name,
movie.num;
movie | type | name | num |
---|---|---|---|
union(num integer, name varchar) | varchar | varchar | int32 |
The Motion Picture | name | The Motion Picture | |
2 | num | 2 | |
3 | num | 3 | |
4 | num | 4 | |
5 | num | 5 | |
6 | num | 6 | |
First Contact | name | First Contact |
Additional friendly features
Several other friendly features are worth mentioning and some are powerful enough to warrant their own blog posts.
DuckDB takes a nod from the describe
function in Pandas and implements a SUMMARIZE
keyword that will calculate a variety of statistics about each column in a dataset for a quick, high-level overview. Simply prepend SUMMARIZE
to any table or SELECT
statement.
Have a look at the correlated subqueries post to see how to use subqueries that refer to each others’ columns. DuckDB’s advanced optimizer improves correlated subquery performance by orders of magnitude, allowing for queries to be expressed as naturally as possible. What was once an anti-pattern for performance reasons can now be used freely!
DuckDB has added more ways to JOIN
tables together that make expressing common calculations much easier. Some like LATERAL
, ASOF
, SEMI
, and ANTI
joins are present in other systems, but have high-performance implementations in DuckDB. DuckDB also adds a new POSITIONAL
join that combines by the row numbers in each table to match the commonly used Pandas capability of joining on row number indexes. See the JOIN
documentation for details, and look out for a blog post describing DuckDB’s state of the art ASOF
joins!
Summary and future work
DuckDB aims to be the easiest database to use. Fundamental architectural decisions to be in-process, have zero dependencies, and have strong typing contribute to this goal, but the friendliness of its SQL dialect has a strong impact as well. By extending the industry-standard PostgreSQL dialect, DuckDB aims to provide the simplest way to express the data transformations you need. These changes range from altering the ancient clause order of the SELECT
statement to begin with FROM
, allowing a fundamentally new way to use functions with chaining, to advanced nested data type calculations like list comprehensions. Each of these features are available in the 0.8.1 release.
Future work for friendlier SQL includes:
- Lambda functions with more than 1 argument, like
list_zip
- Underscores as digit separators (Ex:
1_000_000
instead of1000000
) - Extension user experience, including autoloading
- Improvements to file globbing
- Your suggestions!
Please let us know what areas of SQL can be improved! We welcome your feedback on Discord or GitHub.
Live long and prosper! 🖖