DuckDB's AsOf Joins: Fuzzy Temporal Lookups
TLDR: DuckDB supports AsOf Joins – a way to match nearby values. They are especially useful for searching event tables for temporal analytics.
Do you have time series data that you want to join, but the timestamps don’t quite match? Or do you want to look up a value that changes over time using the times in another table? And did you end up writing convoluted (and slow) inequality joins to get your results? Then this post is for you!
continue readingEven Friendlier SQL with DuckDB
TLDR; DuckDB continues to push the boundaries of SQL syntax to both simplify queries and make more advanced analyses possible. Highlights include dynamic column selection, queries that start with the FROM clause, function chaining, and list comprehensions. We boldly go where no SQL engine has gone before!
Who says that SQL should stay frozen in time, chained to a 1999 version of the specification? As a comparison, do folks remember what JavaScript felt like before Promises? Those didn’t launch until 2012! It’s clear that innovation at the programming syntax layer can have a profoundly positive impact on an entire language ecosystem.
We believe there are many valid reasons for innovation in the SQL language, among them opportunities to simplify basic queries and also to make more dynamic analyses possible. Many of these features arose from community suggestions! Please let us know your SQL pain points on Discord or GitHub and join us as we change what it feels like to write SQL!
If you have not had a chance to read the first installment in this series, please take a quick look here.
The future is now
The first few enhancements in this list were included in the “Ideas for the Future” section of the prior post.
Reusable column aliases
When working with incremental calculated expressions in a select statement, traditional SQL dialects force you to either write out the full expression for each column or create a common table expression (CTE) around each step of the calculation. Now, any column alias can be reused by subsequent columns within the same select statement. Not only that, but these aliases can be used in the where and order by clauses as well.
Old way 1: Repeat yourself
select
'These are the voyages of the starship Enterprise...' AS intro,
instr('These are the voyages of the starship Enterprise...', 'starship') AS starship_loc
substr('These are the voyages of the starship Enterprise...', instr('These are the voyages of the starship Enterprise...', 'starship') + len('starship') + 1) AS trimmed_intro;
Old way 2: All the CTEs
WITH intro_cte AS (
SELECT
'These are the voyages of the starship Enterprise...' AS intro
), starship_loc_cte AS (
SELECT
intro,
instr(intro, 'starship') AS starship_loc
FROM intro_cte
)
SELECT
intro,
starship_loc,
substr(intro, starship_loc + len('starship') + 1) AS trimmed_intro
FROM starship_loc_cte;
New way
SELECT
'These are the voyages of the starship Enterprise...' AS intro,
instr(intro, 'starship') AS starship_loc,
substr(intro, starship_loc + len('starship') + 1) AS trimmed_intro;
intro | starship_loc | trimmed_intro |
---|---|---|
These are the voyages of the starship Enterprise… | 30 | Enterprise… |
Dynamic column selection
Databases typically prefer strictness in column definitions and flexibility in the number of rows. This can help by enforcing data types and recording column level metadata. However, in data science workflows and elsewhere, it is very common to dynamically generate columns (for example during feature engineering).
No longer do you need to know all of your column names up front! DuckDB can select and even modify columns based on regular expression pattern matching, EXCLUDE
or REPLACE
modifiers, and even lambda functions (see the section on lambda functions below for details!).
Let’s take a look at some facts gathered about the first season of Star Trek. Using DuckDB’s httpfs
extension, we can query a csv dataset directly from GitHub. It has several columns so let’s DESCRIBE
it.
INSTALL httpfs;
LOAD httpfs;
CREATE TABLE trek_facts AS
SELECT * FROM 'https://raw.githubusercontent.com/Alex-Monahan/example_datasets/main/Star_Trek-Season_1.csv';
DESCRIBE trek_facts;
column_name | column_type | null | key | default | extra |
---|---|---|---|---|---|
season_num | BIGINT | YES | NULL | NULL | NULL |
episode_num | BIGINT | YES | NULL | NULL | NULL |
aired_date | DATE | YES | NULL | NULL | NULL |
cnt_kirk_hookups | BIGINT | YES | NULL | NULL | NULL |
cnt_downed_redshirts | BIGINT | YES | NULL | NULL | NULL |
bool_aliens_almost_took_over_planet | BIGINT | YES | NULL | NULL | NULL |
bool_aliens_almost_took_over_enterprise | BIGINT | YES | NULL | NULL | NULL |
cnt_vulcan_nerve_pinch | BIGINT | YES | NULL | NULL | NULL |
cnt_warp_speed_orders | BIGINT | YES | NULL | NULL | NULL |
highest_warp_speed_issued | BIGINT | YES | NULL | NULL | NULL |
bool_hand_phasers_fired | BIGINT | YES | NULL | NULL | NULL |
bool_ship_phasers_fired | BIGINT | YES | NULL | NULL | NULL |
bool_ship_photon_torpedos_fired | BIGINT | YES | NULL | NULL | NULL |
cnt_transporter_pax | BIGINT | YES | NULL | NULL | NULL |
cnt_damn_it_jim_quote | BIGINT | YES | NULL | NULL | NULL |
cnt_im_givin_her_all_shes_got_quote | BIGINT | YES | NULL | NULL | NULL |
cnt_highly_illogical_quote | BIGINT | YES | NULL | NULL | NULL |
bool_enterprise_saved_the_day | BIGINT | YES | NULL | NULL | NULL |
COLUMNS() with regular expressions
The COLUMNS
expression can accept a string parameter that is a regular expression and will return all column names that match the pattern. How did warp change over the first season? Let’s examine any column name that contains the word warp
.
SELECT
episode_num,
COLUMNS('.*warp.*')
FROM trek_facts;
episode_num | cnt_warp_speed_orders | highest_warp_speed_issued |
---|---|---|
0 | 1 | 1 |
1 | 0 | 0 |
2 | 1 | 1 |
3 | 1 | 0 |
… | … | … |
27 | 1 | 1 |
28 | 0 | 0 |
29 | 2 | 8 |
The COLUMNS
expression can also be wrapped by other functions to apply those functions to each selected column. Let’s simplify the above query to look at the maximum values across all episodes:
SELECT
MAX(COLUMNS('.*warp.*'))
FROM trek_facts;
max(trek_facts.cnt_warp_speed_orders) | max(trek_facts.highest_warp_speed_issued) |
---|---|
5 | 8 |
We can also create a WHERE
clause that applies across multiple columns. All columns must match the filter criteria, which is equivalent to combining them with AND
. Which episodes had at least 2 warp speed orders and at least a warp speed level of 2?
SELECT
episode_num,
COLUMNS('.*warp.*')
FROM trek_facts
WHERE
COLUMNS('.*warp.*') >= 2;
-- cnt_warp_speed_orders >= 2
-- AND
-- highest_warp_speed_issued >= 2
episode_num | cnt_warp_speed_orders | highest_warp_speed_issued |
---|---|---|
14 | 3 | 7 |
17 | 2 | 7 |
18 | 2 | 8 |
29 | 2 | 8 |
COLUMNS() with EXCLUDE and REPLACE
Individual columns can also be either excluded or replaced prior to applying calculations on them. For example, since our dataset only includes season 1, we do not need to find the MAX
of that column. It would be highly illogical.
SELECT
MAX(COLUMNS(* EXCLUDE season_num))
FROM trek_facts;
max(trek_facts. episode_num) |
max(trek_facts. aired_date) |
max(trek_facts. cnt_kirk_hookups) |
… | max(trek_facts. bool_enterprise_saved_the_day) |
---|---|---|---|---|
29 | 1967-04-13 | 2 | … | 1 |
The REPLACE
syntax is also useful when applied to a dynamic set of columns. In this example, we want to convert the dates into timestamps prior to finding the maximum value in each column. Previously this would have required an entire subquery or CTE to pre-process just that single column!
SELECT
MAX(COLUMNS(* REPLACE aired_date::timestamp AS aired_date))
FROM trek_facts;
max(trek_facts. season_num) |
max(trek_facts. episode_num) |
max(aired_date := CAST(aired_date AS TIMESTAMP)) |
… | max(trek_facts. bool_enterprise_saved_the_day) |
---|---|---|---|---|
1 | 29 | 1967-04-13 00:00:00 | … | 1 |
COLUMNS() with lambda functions
The most flexible way to query a dynamic set of columns is through a lambda function. This allows for any matching criteria to be applied to the names of the columns, not just regular expressions. See more details about lambda functions below.
For example, if using the LIKE
syntax is more comfortable, we can select columns matching a LIKE
pattern rather than with a regular expression.
SELECT
episode_num,
COLUMNS(col -> col LIKE '%warp%')
FROM trek_facts
WHERE
COLUMNS(col -> col LIKE '%warp%') >= 2;
episode_num | cnt_warp_speed_orders | highest_warp_speed_issued |
---|---|---|
14 | 3 | 7 |
17 | 2 | 7 |
18 | 2 | 8 |
29 | 2 | 8 |
Automatic JSON to nested types conversion
The first installment in the series mentioned JSON dot notation references as future work. However, the team has gone even further! Instead of referring to JSON-typed columns using dot notation, JSON can now be automatically parsed into DuckDB’s native types for significantly faster performance, compression, as well as that friendly dot notation!
First, install and load the httpfs
and json
extensions if they don’t come bundled with the client you are using. Then query a remote JSON file directly as if it were a table!
INSTALL httpfs;
LOAD httpfs;
INSTALL json;
LOAD json;
SELECT
starfleet[10].model AS starship
FROM 'https://raw.githubusercontent.com/vlad-saling/star-trek-ipsum/master/src/content/content.json';
starship |
---|
USS Farragut - NCC-1647 - Ship on which James Kirk served as a phaser station operator. Attacked by the Dikironium Cloud Creature, killing half the crew. ad. |
Now for some new SQL capabilities beyond the ideas from the prior post!
FROM first in SELECT statements
When building a query, the first thing you need to know is where your data is coming FROM
. Well then why is that the second clause in a SELECT
statement?? No longer! DuckDB is building SQL as it should have always been - putting the FROM
clause first! This addresses one of the longest standing complaints about SQL, and the DuckDB team implemented it in 2 days.
FROM my_table SELECT my_column;
Not only that, the SELECT
statement can be completely removed and DuckDB will assume all columns should be SELECT
ed. Taking a look at a table is now as simple as:
FROM my_table;
-- SELECT * FROM my_table
Other statements like COPY
are simplified as well.
COPY (FROM trek_facts) TO 'phaser_filled_facts.parquet';
This has an additional benefit beyond saving keystrokes and staying in a development flow state: autocomplete will have much more context when you begin to choose columns to query. Give the AI a helping hand!
Note that this syntax is completely optional, so your SELECT * FROM
keyboard shortcuts are safe, even if they are obsolete… 🙂
Function chaining
Many SQL blogs advise the use of CTEs instead of subqueries. Among other benefits, they are much more readable. Operations are compartmentalized into discrete chunks and they can be read in order top to bottom instead of forcing the reader to work their way inside out.
DuckDB enables the same interpretability improvement for every scalar function! Use the dot operator to chain functions together, just like in Python. The prior expression in the chain is used as the first argument to the subsequent function.
SELECT
('Make it so')
.UPPER()
.string_split(' ')
.list_aggr('string_agg','.')
.concat('.') AS im_not_messing_around_number_one;
im_not_messing_around_number_one |
---|
MAKE.IT.SO. |
Now compare that with the old way…
SELECT
concat(
list_aggr(
string_split(
UPPER('Make it stop'),
' '),
'string_agg','.'),
'.') AS oof;
oof |
---|
MAKE.IT.STOP. |
Union by name
DuckDB aims to blend the best of databases and dataframes. This new syntax is inspired by the concat function in Pandas. Rather than vertically stacking tables based on column position, columns are matched by name and stacked accordingly. Simply replace UNION
with UNION BY NAME
or UNION ALL
with UNION ALL BY NAME
.
For example, we had to add some new alien species proverbs in The Next Generation:
CREATE TABLE proverbs AS
SELECT
'Revenge is a dish best served cold' AS klingon_proverb
UNION ALL BY NAME
SELECT
'You will be assimilated' AS borg_proverb,
'If winning is not important, why keep score?' AS klingon_proverb;
FROM proverbs;
klingon_proverb | borg_proverb |
---|---|
Revenge is a dish best served cold | NULL |
If winning is not important, why keep score? | You will be assimilated |
This approach has additional benefits. As seen above, not only can tables with different column orders be combined, but so can tables with different numbers of columns entirely. This is helpful as schemas migrate, and is particularly useful for DuckDB’s multi-file reading capabilities.
Insert by name
Another common situation where column order is strict in SQL is when inserting data into a table. Either the columns must match the order exactly, or all of the column names must be repeated in two locations within the query.
Instead, add the keywords BY NAME
after the table name when inserting. Any subset of the columns in the table in any order can be inserted.
INSERT INTO proverbs BY NAME
SELECT 'Resistance is futile' AS borg_proverb;
SELECT * FROM proverbs;
klingon_proverb | borg_proverb |
---|---|
Revenge is a dish best served cold | NULL |
If winning is not important, why keep score? | You will be assimilated |
NULL | Resistance is futile |
Dynamic PIVOT and UNPIVOT
Historically, databases are not well-suited for pivoting operations. However, DuckDB’s PIVOT
and UNPIVOT
clauses can create or stack dynamic column names for a truly flexible pivoting capability! In addition to that flexibility, DuckDB also provides both the SQL standard syntax and a friendlier shorthand.
For example, let’s take a look at some procurement forecast data just as the Earth-Romulan war was beginning:
CREATE TABLE purchases (item VARCHAR, year INT, count INT);
INSERT INTO purchases
VALUES ('phasers', 2155, 1035), ('phasers', 2156, 25039), ('phasers', 2157, 95000),
('photon torpedoes', 2155, 255), ('photon torpedoes', 2156, 17899), ('photon torpedoes', 2157, 87492);
FROM purchases;
item | year | count |
---|---|---|
phasers | 2155 | 1035 |
phasers | 2156 | 25039 |
phasers | 2157 | 95000 |
photon torpedoes | 2155 | 255 |
photon torpedoes | 2156 | 17899 |
photon torpedoes | 2157 | 87492 |
It is easier to compare our phaser needs to our photon torpedo needs if each year’s data is visually close together. Let’s pivot this into a friendlier format! Each year should receive its own column (but each year shouldn’t need to be specified in the query!), we want to sum up the total count
, and we still want to keep a separate group (row) for each item
.
CREATE TABLE pivoted_purchases AS
PIVOT purchases
ON year
USING SUM(count)
GROUP BY item;
FROM pivoted_purchases;
item | 2155 | 2156 | 2157 |
---|---|---|---|
phasers | 1035 | 25039 | 95000 |
photon torpedoes | 255 | 17899 | 87492 |
Looks like photon torpedoes went on sale…
Now imagine the reverse situation. Scotty in engineering has been visually analyzing and manually constructing his purchases forecast. He prefers things pivoted so it’s easier to read. Now you need to fit it back into the database! This war may go on for a bit, so you may need to do this again next year. Let’s write an UNPIVOT
query to return to the original format that can handle any year.
The COLUMNS
expression will use all columns except item
. After stacking, the column containing the column names from pivoted_purchases
should be renamed to year
, and the values within those columns represent the count
. The result is the same dataset as the original.
UNPIVOT pivoted_purchases
ON COLUMNS(* EXCLUDE item)
INTO
NAME year
VALUE count;
item | year | count |
---|---|---|
phasers | 2155 | 1035 |
phasers | 2156 | 25039 |
phasers | 2157 | 95000 |
photon torpedoes | 2155 | 255 |
photon torpedoes | 2156 | 17899 |
photon torpedoes | 2157 | 87492 |
More examples are included as a part of our DuckDB 0.8.0 announcement post, and the PIVOT
and UNPIVOT
documentation pages highlight more complex queries.
Stay tuned for a future post to cover what is happening behind the scenes!
List lambda functions
List lambdas allow for operations to be applied to each item in a list. These do not need to be pre-defined - they are created on the fly within the query.
In this example, a lambda function is used in combination with the list_transform
function to shorten each official ship name.
SELECT
(['Enterprise NCC-1701', 'Voyager NCC-74656', 'Discovery NCC-1031'])
.list_transform(x -> x.string_split(' ')[1]) AS short_name;
ship_name |
---|
[Enterprise, Voyager, Discovery] |
Lambdas can also be used to filter down the items in a list. The lambda returns a list of booleans, which is used by the list_filter
function to select specific items. The contains
function is using the function chaining described earlier.
SELECT
(['Enterprise NCC-1701', 'Voyager NCC-74656', 'Discovery NCC-1031'])
.list_filter(x -> x.contains('1701')) AS the_original;
the_original |
---|
[Enterprise NCC-1701] |
List comprehensions
What if there was a simple syntax to both modify and filter a list? DuckDB takes inspiration from Python’s approach to list comprehensions to dramatically simplify the above examples. List comprehensions are syntactic sugar - these queries are rewritten into lambda expressions behind the scenes!
Within brackets, first specify the transformation that is desired, then indicate which list should be iterated over, and finally include the filter criteria.
SELECT
[x.string_split(' ')[1]
FOR x IN ['Enterprise NCC-1701', 'Voyager NCC-74656', 'Discovery NCC-1031']
IF x.contains('1701')] AS ready_to_boldly_go;
ready_to_boldly_go |
---|
[Enterprise] |
Exploding struct.*
A struct in DuckDB is a set of key/value pairs. Behind the scenes, a struct is stored with a separate column for each key. As a result, it is computationally easy to explode a struct into separate columns, and now it is also syntactically simple as well! This is another example of allowing SQL to handle dynamic column names.
WITH damage_report AS (
SELECT {'gold_casualties':5, 'blue_casualties':15, 'red_casualties': 10000} AS casualties
)
FROM damage_report
SELECT
casualties.*;
gold_casualties | blue_casualties | red_casualties |
---|---|---|
5 | 15 | 10000 |
Automatic struct creation
DuckDB exposes an easy way to convert any table into a single-column struct. Instead of SELECT
ing column names, SELECT
the table name itself.
WITH officers AS (
SELECT 'Captain' AS rank, 'Jean-Luc Picard' AS name
UNION ALL
SELECT 'Lieutenant Commander', 'Data'
)
FROM officers
SELECT officers;
officers |
---|
{‘rank’: Captain, ‘name’: Jean-Luc Picard} |
{‘rank’: Lieutenant Commander, ‘name’: Data} |
Union data type
DuckDB utilizes strong typing to provide high performance and enforce data quality. However, DuckDB is also as forgiving as possible using approaches like implicit casting to avoid always having to cast between data types.
Another way DuckDB enables flexibility is the new UNION
data type. A UNION
data type allows for a single column to contain multiple types of values. This can be thought of as an “opt-in” to SQLite’s flexible data typing rules (the opposite direction of SQLite’s recently announced strict tables).
By default DuckDB will seek the common denominator of data types when combining tables together. The below query results in a VARCHAR
column:
SELECT 'The Motion Picture' AS movie UNION ALL
SELECT 2 UNION ALL
SELECT 3 UNION ALL
SELECT 4 UNION ALL
SELECT 5 UNION ALL
SELECT 6 UNION ALL
SELECT 'First Contact';
movie |
---|
varchar |
The Motion Picture |
First Contact |
6 |
5 |
4 |
3 |
2 |
However, if a UNION
type is used, each individual row retains its original data type. A UNION
is defined using key-value pairs with the key as a name and the value as the data type. This also allows the specific data types to be pulled out as individual columns:
CREATE TABLE movies (
movie UNION(num INT, name VARCHAR)
);
INSERT INTO movies
VALUES ('The Motion Picture'), (2), (3), (4), (5), (6), ('First Contact');
FROM movies
SELECT
movie,
union_tag(movie) AS type,
movie.name,
movie.num;
movie | type | name | num |
---|---|---|---|
union(num integer, name varchar) | varchar | varchar | int32 |
The Motion Picture | name | The Motion Picture | |
2 | num | 2 | |
3 | num | 3 | |
4 | num | 4 | |
5 | num | 5 | |
6 | num | 6 | |
First Contact | name | First Contact |
Additional friendly features
Several other friendly features are worth mentioning and some are powerful enough to warrant their own blog posts.
DuckDB takes a nod from the describe
function in Pandas and implements a SUMMARIZE
keyword that will calculate a variety of statistics about each column in a dataset for a quick, high-level overview. Simply prepend SUMMARIZE
to any table or SELECT
statement.
Have a look at the correlated subqueries post to see how to use subqueries that refer to each others’ columns. DuckDB’s advanced optimizer improves correlated subquery performance by orders of magnitude, allowing for queries to be expressed as naturally as possible. What was once an anti-pattern for performance reasons can now be used freely!
DuckDB has added more ways to JOIN
tables together that make expressing common calculations much easier. Some like LATERAL
, ASOF
, SEMI
, and ANTI
joins are present in other systems, but have high-performance implementations in DuckDB. DuckDB also adds a new POSITIONAL
join that combines by the row numbers in each table to match the commonly used Pandas capability of joining on row number indexes. See the JOIN
documentation for details, and look out for a blog post describing DuckDB’s state of the art ASOF
joins!
Summary and future work
DuckDB aims to be the easiest database to use. Fundamental architectural decisions to be in-process, have zero dependencies, and have strong typing contribute to this goal, but the friendliness of its SQL dialect has a strong impact as well. By extending the industry-standard PostgreSQL dialect, DuckDB aims to provide the simplest way to express the data transformations you need. These changes range from altering the ancient clause order of the SELECT
statement to begin with FROM
, allowing a fundamentally new way to use functions with chaining, to advanced nested data type calculations like list comprehensions. Each of these features are available in the 0.8.1 release.
Future work for friendlier SQL includes:
- Lambda functions with more than 1 argument, like
list_zip
- Underscores as digit separators (Ex: 1_000_000 instead of 1000000)
- Extension user experience, including autoloading
- Improvements to file globbing
- Your suggestions!
Please let us know what areas of SQL can be improved! We welcome your feedback on Discord or GitHub.
Live long and prosper! 🖖
continue readingDuckDB ADBC - Zero-Copy data transfer via Arrow Database Connectivity
TLDR: DuckDB has added support for Arrow Database Connectivity (ADBC), an API standard that enables efficient data ingestion and retrieval from database systems, similar to Open Database Connectivity (ODBC) interface. However, unlike ODBC, ADBC specifically caters to the columnar storage model, facilitating fast data transfers between a columnar database and an external application.
Database interface standards allow developers to write application code that is independent of the underlying database management system (DBMS) being used. DuckDB has supported two standards that have gained popularity in the past few decades: the core interface of ODBC and Java Database Connectivity (JDBC). Both interfaces are designed to fully support database connectivity and management, with JDBC being catered for the Java environment. With these APIs, developers can query DBMS agnostically, retrieve query results, run prepared statements, and manage connections.
These interfaces were designed in the early 90s when row-wise database systems reigned supreme. As a result, they were primarily intended for transferring data in a row-wise format. However, in the mid-2000s, columnar-wise database systems started gaining a lot of traction due to their drastic performance advantages for data analysis (You can find myself giving a brief exemplification of this difference at EuroPython). This means that these APIs offer no support for transferring data in a columnar-wise format (or, in the case of ODBC, some support with a lot of added complexity). In practice, when analytical, column-wise systems like DuckDB make use of these APIs, converting the data between these representation formats becomes a major bottleneck.
The figure below depicts how a developer can use these APIs to query a DuckDB database. For example, developers can submit SQL queries via the API, which then uses a DuckDB driver to internally call the proper functions. A query result is then produced in DuckDB’s internal columnar representation, and the driver takes care of transforming it to the JDBC or ODBC row-wise result format. This transformation has significant costs for rearranging and copying the data, quickly becoming a major bottleneck.
To overcome this transformation cost, ADBC has been proposed, with a generic API to support database operations while using the Apache Arrow memory format to send data in and out of the DBMS. DuckDB now supports the ADBC specification. Due to DuckDB’s zero-copy integration with the Arrow format, using ADBC as an interface is rather efficient, since there is only a small constant cost to transform DuckDB query results to the Arrow format.
The figure below depicts the query execution flow when using ADBC. Note that the main difference between ODBC/JDBC is that the result does not need to be transformed to a row-wise format.
Quick Tour
For our quick tour, we will illustrate an example of round-tripping data using DuckDB-ADBC via Python. Please note that DuckDB-ADBC can also be utilized with other programming languages. Specifically, you can find C++ DuckDB-ADBC examples and tests in the DuckDB Github repository along with usage examples available in C++. For convenience, you can also find a ready-to-run version of this tour in a Colab notebook. If you would like to see a more detailed explanation of the DuckDB-ADBC API or view a C++ example, please refer to our documentation page.
Setup
For this example, you must have a dynamic library from the latest bleeding-edge version of DuckDB, pyarrow, and the adbc-driver-manager. The ADBC driver manager is a Python package developed by Voltron Data. The driver manager is compliant with DB-API 2.0. It wraps ADBC, making its usage more straightforward. You can find the documentation of the ADBC Driver Manager here.
Note: While DuckDB is already DB-API compliant in Python, what sets ADBC apart is that you do not need a DuckDB module installed and loaded. Additionally, unlike the DB-API, it does not utilize row-wise as its data transfer format of choice.
pip install pyarrow
pip install adbc-driver-manager
Insert Data
First, we need to include the necessary libraries that will be used in this tour. Mainly, PyArrow and the DBAPI from the ADBC Driver Manager.
import pyarrow
from adbc_driver_manager import dbapi
Next, we can create a connection via ADBC with DuckDB. This connection simply requires the path to DuckDB’s driver and the entrypoint function name. DuckDB’s entrypoint is duckdb_adbc_init
.
By default, connections are established with an in-memory database. However, if desired, you have the option to specify the path
variable and connect to a local duckdb instance, allowing you to store the data on disk.
Note that these are the only variables in ADBC that are not DBMS agnostic; instead, they are set by the user, often through a configuration file.
con = dbapi.connect(driver="path/to/duckdb.lib", entrypoint="duckdb_adbc_init", db_kwargs={"path": "test.db"})
To insert the data, we can simply call the adbc_ingest
function with a cursor from our connection. It requires the name of the table we want to perform the ingestion to and the Arrow Python object we want to ingest. This function also has two modes: append
, where data is appended to an existing table, and create
, where the table does not exist yet and will be created with the input data. By default, it’s set to create, so we don’t need to define it here.
table = pyarrow.table(
[
["Tenacious D", "Backstreet Boys", "Wu Tang Clan"],
[4, 10, 7]
],
names=["Name", "Albums"],
)
with con.cursor() as cursor:
cursor.adbc_ingest("Bands", table)
After calling adbc_ingest
, the table is created in the DuckDB connection and the data is fully inserted.
Read Data
To read data from DuckDB, one simply needs to use the execute
function with a SQL query and then return the cursor’s result to the desired Arrow format, such as a PyArrow Table in this example.
with con.cursor() as cursor:
cursor.execute("SELECT * FROM Bands")
cursor.fetch_arrow_table()
Benchmark ADBC vs ODBC
In our benchmark section, we aim to evaluate the differences in data reading from DuckDB via ADBC and ODBC. This benchmark was executed on an Apple M1 Max with 32GB of RAM and involves outputting and inserting the Lineitem table of TPC-H SF 1. You can find the repository with the code used to run this benchmark here.
Name | Time (s) |
---|---|
ODBC | 28.149 |
ADBC | 0.724 |
The time difference between ODBC and ADBC is 38x. This significant contrast results from the extra allocations and copies that exist in ODBC.
Conclusions
DuckDB now supports the ADBC standard for database connection. ADBC is particularly efficient when combined with DuckDB, thanks to its use of the Arrow zero-copy integration.
ADBC is particularly interesting because it can drastically decrease interactions between analytic systems compared to ODBC. For example, if software that already support ODBC, e.g., if MS-Excel was to implement ADBC, integrations with columnar systems like DuckDB could benefit from this significant difference in performance.
DuckDB-ADBC is currently supported via the C Interface and through the Python ADBC Driver Manager. We will add more extensive tutorials for other languages to our documentation webpage. Please feel free to let us know your preferred language for interacting with DuckDB via ADBC!
As always, we are happy to hear your thoughts! Feel free to drop us an email if you have any suggestions, comments or questions!
Last but not least, if you encounter any problems using ADBC, please open an issue DuckDB’s - issue tracker.
continue readingFrom Waddle to Flying: Quickly expanding DuckDB's functionality with Scalar Python UDFs
TLDR: DuckDB now supports vectorized Scalar Python User Defined Functions (UDFs). By implementing Python UDFs, users can easily expand the functionality of DuckDB while taking advantage of DuckDB’s fast execution model, SQL and data safety.
User Defined Functions (UDFs) enable users to extend the functionality of a Database Management System (DBMS) to perform domain-specific tasks that are not implemented as built-in functions. For instance, users who frequently need to export private data can benefit from an anonymization function that masks the local part of an email while preserving the domain. Ideally, this function would be executed directly in the DBMS. This approach offers several advantages:
1) Performance. The function could be executed using the same execution model (e.g., streaming results, beyond-memory/out-of-core execution) of the DBMS, and without any unnecessary transformations.
2) Easy Use. UDFs can be seamlessly integrated into SQL queries, allowing users to leverage the power of SQL to call the functions. This eliminates the need for passing data through a separate database connector and executing external code. The functions can be utilized in various SQL contexts (e.g., subqueries, join conditions).
3) Safety. The sensitive data never leaves the DBMS process.
There are two main reasons users often refrain from implementing UDFs. 1) There are security concerns associated with UDFs. Since UDFs are custom code created by users and executed within the DBMS process, there is a potential risk of crashing the server. However, when it comes to DuckDB, an embedded database, this concern is mitigated as each analyst runs their own DuckDB process separately. Therefore, the impact on server stability is not a significant worry. 2) The difficulty of implementation is a common deterrent for users. High-Performance UDFs are typically only supported in low-level languages. UDFs in higher-level languages like Python incur significant performance costs. Consequently many users cannot quickly implement their UDFs without investing a significant amount of time in learning a low-level language and understanding the internal details of the DBMS.
DuckDB followed a similar approach. As a DBMS tailored for analytical tasks, performance is a key consideration, leading to the implementation of its core in C++. Consequently, the initial focus of extensibility efforts was centered around C++. However, this duck is not limited to just waddling; it can also fly. So we are delighted to announce the recent addition of Scalar Python UDFs to DuckDB.
DuckDB provides support for two distinct types of Python UDFs, differing in the Python object used for communication between DuckDB’s native data types and the Python process. These communication layers include support for Python built-in types and PyArrow Tables.
The two approaches exhibit two key differences:
1) Zero-Copy. PyArrow Tables leverage our zero-copy integration with Arrow, enabling efficient translation of data types to Python-Land with zero-copy cost.
2) Vectorization. PyArrow Table functions operate on a chunk level, processing chunks of data containing up to 2048 rows. This approach maximizes cache locality and leverages vectorization. On the other hand, the built-in types UDF implementation operates on a per-row basis.
This blog post aims to demonstrate how you can extend DuckDB using Python UDFs, with a particular emphasis on PyArrow-powered UDFs. In our quick-tour section, we will provide examples using the PyArrow UDF types. For those interested in benchmarks, you can jump ahead to the benchmark section below. If you want to see a detailed description of the Python UDF API, please refer to our documentation.
Python UDFs
This section depicts several practical examples of using Python UDFs. Each example uses a different type of Python UDF.
Quick-Tour
To demonstrate the usage of Python UDFs in DuckDB, let’s consider the following example. We have a dictionary called world_cup_titles
that maps countries to the number of World Cups they have won. We want to create a Python UDF that takes a country name as input, searches for the corresponding value in the dictionary, and returns the number of World Cups won by that country. If the country is not found in the dictionary, the UDF will return NULL
.
Here’s an example implementation:
import duckdb
from duckdb.typing import *
con = duckdb.connect()
# Dictionary that maps countries and world cups they won
world_cup_titles = {
"Brazil": 5,
"Germany": 4,
"Italy": 4,
"Argentina": 2,
"Uruguay": 2,
"France": 2,
"England": 1,
"Spain": 1
}
# Function that will be registered as an UDF, simply does a lookup in the python dictionary
def world_cups(x):
return world_cup_titles.get(x)
# We register the function
con.create_function("wc_titles", world_cups, [VARCHAR], INTEGER)
That’s it, the function is then registered and ready to be called through SQL.
# Let's create an example countries table with the countries we are interested in using
con.execute("CREATE TABLE countries(country VARCHAR)")
con.execute("INSERT INTO countries VALUES ('Brazil'), ('Germany'), ('Italy'), ('Argentina'), ('Uruguay'), ('France'), ('England'), ('Spain'), ('Netherlands')")
# We can simply call the function through SQL, and even use the function return to eliminate the countries that never won a world cup
con.sql("SELECT country, wc_titles(country) as world_cups from countries").fetchall()
# [('Brazil', 5), ('Germany', 4), ('Italy', 4), ('Argentina', 2), ('Uruguay', 2), ('France', 2), ('England', 1), ('Spain', 1), ('Netherlands', None)]
Generating Fake Data with Faker (Built-In Type UDF)
Here is an example that demonstrates the usage of the Faker library to generate a scalar function in DuckDB, which returns randomly generated dates. The function, named random_date
, does not require any inputs and outputs a DATE
column. Since Faker utilizes built-in Python types, the function directly returns them.
One important thing to notice is that a function that is not deterministic based on its input must be marked as having side_effects
.
import duckdb
# By importing duckdb.typing we can specify DuckDB Types directly without using strings
from duckdb.typing import *
from faker import Faker
# Our Python UDF generates a random date every time it's called
def random_date():
fake = Faker()
return fake.date_between()
We then have to register the Python function in DuckDB using create_function
. Since our function doesn’t require any inputs, we can pass an empty list as the argument_type_list
. As the function returns a date, we specify DATE
from duckdb.typing
as the return_type
. Note that since our random_date()
function returns a built-in Python type (datetime.date
), we don’t need to specify the UDF type.
# To exemplify the effect of side-effect, let's first run the function without marking it.
duckdb.create_function('random_date', random_date, [], DATE)
# After registration, we can use the function directly via SQL
# Notice that without side_effect=True, it's not guaranteed that the function will be re-evaluated.
res = duckdb.sql('select random_date() from range (3)').fetchall()
# [(datetime.date(2003, 8, 3),), (datetime.date(2003, 8, 3),), (datetime.date(2003, 8, 3),)]
# Now let's re-add the function with side-effects marked as true.
duckdb.remove_function('random_date')
duckdb.create_function('random_date', random_date, [], DATE, side_effects=True)
res = duckdb.sql('select random_date() from range (3)').fetchall()
# [(datetime.date(2020, 11, 29),), (datetime.date(2009, 5, 18),), (datetime.date(2018, 5, 24),)]
Swap String Case (PyArrow Type UDF)
One issue with using built-in types is that you don’t benefit from zero-copy, vectorization and cache locality. Using PyArrow as a UDF type should be favored to leverage these optimizations.
To demonstrate a PyArrow function, let’s consider a simple example where we want to transform lowercase characters to uppercase and uppercase characters to lowercase. Fortunately, PyArrow already has a function for this in the compute engine, and it’s as simple as calling pc.utf8_swapcase(x)
.
import duckdb
# By importing duckdb.typing we can specify DuckDB Types directly without using strings
from duckdb.typing import *
import pyarrow as pa
import pyarrow.compute as pc
def swap_case(x):
# Swap the case of the 'column' using utf8_swapcase and return the result
return pc.utf8_swapcase(x)
con = duckdb.connect()
# To register the function, we must define it's type to be 'arrow'
con.create_function('swap_case', swap_case, [VARCHAR], VARCHAR, type='arrow')
res = con.sql("select swap_case('PEDRO HOLANDA')").fetchall()
# [('pedro holanda',)]
Predicting Taxi Fare costs (Ibis + PyArrow UDF)
Python UDFs offer significant power as they enable users to leverage the extensive Python ecosystem and tools, including libraries like PyTorch and Tensorflow that efficiently implement machine learning operations.
Additionally the Ibis project offers a DataFrame API with great DuckDB integration and supports both of DuckDB’s native Python and PyArrow UDFs.
In this example, we demonstrate the usage of a pre-built PyTorch model to estimate taxi fare costs based on the traveled distance. You can find a complete example in this blog post by the Ibis team.
import torch
import pyarrow as pa
import ibis
import ibis.expr.datatypes as dt
from ibis.expr.operations import udf
# The code to generate the model is not specified in this snippet, please refer to the provided link for more information
model = ...
# Function that uses the model and a traveled distance input tensor to predict values, please refer to the provided link for more information
def predict_linear_regression(model, tensor: torch.Tensor) -> torch.Tensor:
...
# Indicate to ibis that this is a scalar user-defined function whose input format is pyarrow
@udf.scalar.pyarrow
def predict_fare(x: dt.float64) -> dt.float32:
# `x` is a pyarrow.ChunkedArray; the `dt.float64` annotation indicate the element type of the ChunkedArray.
# Transform the data from PyArrow to the required torch tensor format and dimension.
tensor = torch.from_numpy(x.to_numpy()[:, None]).float()
# Call the actual prediction function, which also returns a torch tensor.
predicted = predict_linear_regression(model, tensor).ravel()
return pa.array(predicted.numpy())
# Execute a query on the NYC Taxi parquet file to showcase our model's predictions, the actual fare amount, and the distance.
expr = (
ibis.read_parquet('yellow_tripdata_2016-02.parquet')
.mutate(
"fare_amount",
"trip_distance",
predicted_fare=lambda t: predict_fare(t.trip_distance),
)
)
df = expr.execute()
By utilizing Python UDFs in DuckDB with Ibis, you can seamlessly incorporate machine learning models and perform predictions directly within your Ibis code and SQL queries. The example demonstrates how to predict taxi fare costs based on distance using a PyTorch model, showcasing the integration of machine learning capabilities within DuckDB’s SQL environment driven by Ibis.
Benchmarks
In this section, we will perform simple benchmark comparisons to demonstrate the performance differences between two different types of Python UDFs. The benchmark will measure the execution time, and peak memory consumption. The benchmarks are executed 5 times, and the median value is considered. The benchmark is conducted on a Mac Apple M1 with 16GB of RAM.
Built-In Python Vs PyArrow
To benchmark these UDF types, we create UDFs that take an integral column as input, add one to each value, and return the result. The code used for this benchmark section can be found here.
import pyarrow.compute as pc
import duckdb
import pyarrow as pa
# Built-In UDF
def add_built_in_type(x):
return x + 1
#Arrow UDF
def add_arrow_type(x):
return pc.add(x,1)
con = duckdb.connect()
# Registration
con.create_function('built_in_types', add_built_in_type, ['BIGINT'], 'BIGINT', type='native')
con.create_function('add_arrow_type', add_arrow_type, ['BIGINT'], 'BIGINT', type='arrow')
# Integer View with 10,000,000 elements.
con.sql("""
select
i
from range(10000000) tbl(i);
""").to_view("numbers")
# Calls for both UDFs
native_res = con.sql("select sum(add_built_in_type(i)) from numbers").fetchall()
arrow_res = con.sql("select sum(add_arrow_type(i)) from numbers").fetchall()
Name | Time (s) |
---|---|
Built-In | 5.37 |
PyArrow | 0.35 |
We can observe a performance difference of more than one order of magnitude between the two UDFs. The difference in performance is primarily due to three factors:
1) In Python, object construction and general use is rather slow. This is due to several reasons, including automatic memory management, interpretation, and dynamic typing. 2) The PyArrow UDF does not require any data copying. 3) The PyArrow UDF is executed in a vectorized fashion, processing chunks of data instead of individual rows.
Python UDFs Vs External Functions
Here we compare the usage of a Python UDF with an external function. In this case, we have a function that calculates the sum of the lengths of all strings in a column. You can find the code used for this benchmark section here.
import duckdb
import pyarrow as pa
# Function used in UDF
def string_length_arrow(x):
tuples = len(x)
values = [len(i.as_py()) if i.as_py() != None else 0 for i in x]
array = pa.array(values, type=pa.int32(), size=tuples)
return array
# Same Function but external to the database
def exec_external(con):
arrow_table = con.sql("select i from strings tbl(i)").arrow()
arrow_column = arrow_table['i']
tuples = len(arrow_column)
values = [len(i.as_py()) if i.as_py() != None else 0 for i in arrow_column]
array = pa.array(values, type=pa.int32(), size=tuples)
arrow_tbl = pa.Table.from_arrays([array], names=['i'])
return con.sql("select sum(i) from arrow_tbl").fetchall()
con = duckdb.connect()
con.create_function('strlen_arrow', string_length_arrow, ['VARCHAR'], int, type='arrow')
con.sql("""
select
case when i != 0 and i % 42 = 0
then
NULL
else
repeat(chr((65 + (i % 26))::INTEGER), (4 + (i % 12))) end
from range(10000000) tbl(i);
""").to_view("strings")
con.sql("select sum(strlen_arrow(i)) from strings tbl(i)").fetchall()
exec_external(con)
Name | Time (s) | Peak Memory Consumption (MB) |
---|---|---|
External | 5.65 | 584.032 |
UDF | 5.63 | 112.848 |
Here we can see that there is no significant regression in performance when utilizing UDFs. However, you still have the benefits of safer execution and the utilization of SQL. In our example, we can also notice that the external function materializes the entire query, resulting in a 5x higher peak memory consumption compared to the UDF approach.
Conclusions and Further Development
Scalar Python UDFs are now supported in DuckDB, marking a significant milestone in extending the functionality of the database. This enhancement empowers users to perform complex computations using a high-level language. Additionally, Python UDFs can leverage DuckDB’s zero-copy integration with Arrow, eliminating data transfer costs and ensuring efficient query execution.
While the introduction of Python UDFs is a major step forward, our work in this area is ongoing. Our roadmap includes the following focus areas:
-
Aggregate/Table-Producing UDFs: Currently, users can create Scalar UDFs, but we are actively working on supporting Aggregation Functions (which perform calculations on a set of values and return a single result) and Table-Producing Functions (which return tables without limitations on the number of columns and rows).
-
Types: Scalar Python UDFs currently support most DuckDB types, with the exception of ENUM types and BIT types. We are working towards expanding the type support to ensure comprehensive functionality.
As always, we are happy to hear your thoughts! Feel free to drop us an email if you have any suggestions, comments or questions.
Last but not least, if you encounter any problems using our Python UDFs, please open an issue in DuckDB’s issue tracker.
continue readingCorrelated Subqueries in SQL
Subqueries in SQL are a powerful abstraction that allow simple queries to be used as composable building blocks. They allow you to break down complex problems into smaller parts, and subsequently make it easier to write, understand and maintain large and complex queries.
DuckDB uses a state-of-the-art subquery decorrelation optimizer that allows subqueries to be executed very efficiently. As a result, users can freely use subqueries to create expressive queries without having to worry about manually rewriting subqueries into joins. For more information, skip to the Performance section.
Types of Subqueries
SQL subqueries exist in two main forms: subqueries as expressions and subqueries as tables. Subqueries that are used as expressions can be used in the SELECT
or WHERE
clauses. Subqueries that are used as tables can be used in the FROM
clause. In this blog post we will focus on subqueries used as expressions. A future blog post will discuss subqueries as tables.
Subqueries as expressions exist in three forms.
- Scalar subqueries
EXISTS
IN
/ANY
/ALL
All of the subqueries can be either correlated or uncorrelated. An uncorrelated subquery is a query that is independent from the outer query. A correlated subquery is a subquery that contains expressions from the outer query. Correlated subqueries can be seen as parameterized subqueries.
Uncorrelated Scalar Subqueries
Uncorrelated scalar subqueries can only return a single value. That constant value is then substituted and used in the query. As an example of why this is useful - imagine that we want to select all of the shortest flights in our dataset. We could run the following query to obtain the shortest flight distance:
SELECT MIN(distance)
FROM ontime;
min(distance) |
---|
31.0 |
We could manually take this distance and use it in the WHERE
clause to obtain all flights on this route.
SELECT uniquecarrier, origincityname, destcityname, flightdate
FROM ontime
WHERE distance=31.0;
uniquecarrier | origincityname | destcityname | flightdate |
---|---|---|---|
AS | Petersburg, AK | Wrangell, AK | 2017-01-15 |
AS | Wrangell, AK | Petersburg, AK | 2017-01-15 |
AS | Petersburg, AK | Wrangell, AK | 2017-01-16 |
However - this requires us to hardcode the constant inside the query. By using the first query as a subquery we can compute the minimum distance as part of the query.
SELECT uniquecarrier, origincityname, destcityname, flightdate
FROM ontime
WHERE distance=(
SELECT MIN(distance)
FROM ontime
);
Correlated Scalar Subqueries
While uncorrelated subqueries are powerful, they come with a hard restriction: only a single value can be returned. Often, what we want to do is parameterize the query, so that we can return different values per row.
For example, suppose that we want to find all of the shortest flights for each carrier. We can find the shortest flight for a specific carrier using the following parameterized query:
PREPARE min_distance_per_carrier AS
SELECT MIN(distance)
FROM ontime
WHERE uniquecarrier=?;
We can execute this prepared statement to obtain the minimum distance for a specific carrier.
EXECUTE min_distance_per_carrier('UA');
min(distance) |
---|
67.0 |
If we want to use this parameterized query as a subquery, we need to use a correlated subquery. Correlated subqueries allow us to use parameterized queries as scalar subqueries by referencing columns from the outer query. We can obtain the set of shortest flights per carrier using the following query:
SELECT uniquecarrier, origincityname, destcityname, flightdate, distance
FROM ontime AS ontime_outer
WHERE distance=(
SELECT MIN(distance)
FROM ontime
WHERE uniquecarrier=ontime_outer.uniquecarrier
);
uniquecarrier | origincityname | destcityname | flightdate | distance |
---|---|---|---|---|
AS | Wrangell, AK | Petersburg, AK | 2017-01-01 | 31.0 |
NK | Fort Lauderdale, FL | Orlando, FL | 2017-01-01 | 177.0 |
VX | Las Vegas, NV | Los Angeles, CA | 2017-01-01 | 236.0 |
Notice how the column from the outer relation (ontime_outer
) is used inside the query. This is what turns the subquery into a correlated subquery. The column from the outer relation (ontime_outer.uniquecarrier
) is a parameter for the subquery. Logically the subquery is executed once for every row that is present in ontime
, where the value for the column at that row is substituted as a parameter.
In order to make it more clear that the correlated subquery is in essence a parameterized query, we can create a scalar macro that contains the query using DuckDB’s macros.
CREATE MACRO min_distance_per_carrier(param) AS (
SELECT MIN(distance)
FROM ontime
WHERE uniquecarrier=param
);
We can then use the macro in our original query as if it is a function.
SELECT uniquecarrier, origincityname, destcityname, flightdate, distance
FROM ontime AS ontime_outer
WHERE distance=min_distance_per_carrier(ontime_outer.uniquecarrier);
This gives us the same result as placing the correlated subquery inside of the query, but is cleaner as we can decompose the query into multiple segments more effectively.
EXISTS
EXISTS
can be used to check if a given subquery has any results. This is powerful when used as a correlated subquery. For example, we can use EXISTS
if we want to obtain the last flight that has been flown on each route.
We can obtain a list of all flights on a given route past a certain date using the following query:
PREPARE flights_after_date AS
SELECT uniquecarrier, origincityname, destcityname, flightdate, distance
FROM ontime
WHERE origin=? AND dest=? AND flightdate>?;
EXECUTE flights_after_date('LAX', 'JFK', DATE '2017-05-01');
uniquecarrier | origincityname | destcityname | flightdate | distance |
---|---|---|---|---|
AA | Los Angeles, CA | New York, NY | 2017-08-01 | 2475.0 |
AA | Los Angeles, CA | New York, NY | 2017-08-02 | 2475.0 |
AA | Los Angeles, CA | New York, NY | 2017-08-03 | 2475.0 |
Now in order to obtain the last flight on a route, we need to find flights for which no later flight exists.
SELECT uniquecarrier, origincityname, destcityname, flightdate, distance
FROM ontime AS ontime_outer
WHERE NOT EXISTS (
SELECT uniquecarrier, origincityname, destcityname, flightdate, distance
FROM ontime
WHERE origin=ontime_outer.origin AND dest=ontime_outer.dest AND flightdate>ontime_outer.flightdate
);
uniquecarrier | origincityname | destcityname | flightdate | distance | |
---|---|---|---|---|---|
AA | Daytona Beach, FL | Charlotte, NC | 2017-02-27 | 416.0 | |
EV | Abilene, TX | Dallas/Fort Worth, TX | 2017-02-15 | 158.0 | |
EV | Dallas/Fort Worth, TX | Durango, CO | 2017-02-13 | 674.0 |
IN / ANY / ALL
IN
can be used to check if a given value exists within the result returned by the subquery. For example, we can obtain a list of all carriers that have performed more than 250 000
flights in the dataset using the following query:
SELECT uniquecarrier
FROM ontime
GROUP BY uniquecarrier
HAVING COUNT(*) > 250000;
We can then use an IN
clause to obtain all flights performed by those carriers.
SELECT *
FROM ontime
WHERE uniquecarrier IN (
SELECT uniquecarrier
FROM ontime
GROUP BY uniquecarrier
HAVING COUNT(*) > 250000
);
A correlated subquery can be useful here if we want to not count the total amount of flights performed by each carrier, but count the total amount of flights for the given route. We can select all flights performed by carriers that have performed at least 1000 flights on a given route using the following query.
SELECT *
FROM ontime AS ontime_outer
WHERE uniquecarrier IN (
SELECT uniquecarrier
FROM ontime
WHERE ontime.origin=ontime_outer.origin AND ontime.dest=ontime_outer.dest
GROUP BY uniquecarrier
HAVING COUNT(*) > 1000
);
ANY
and ALL
are generalizations of IN
. IN
checks if the value is present in the set returned by the subquery. This is equivalent to = ANY(...)
. The ANY
and ALL
operators can be used to perform other comparison operators (such as >
, <
, <>
). The above query can be rewritten to ANY
in the following form.
SELECT *
FROM ontime AS ontime_outer
WHERE uniquecarrier = ANY (
SELECT uniquecarrier
FROM ontime
WHERE ontime.origin=ontime_outer.origin AND ontime.dest=ontime_outer.dest
GROUP BY uniquecarrier
HAVING COUNT(*) > 1000
);
Performance
Whereas scalar subqueries are logically executed once, correlated subqueries are logically executed once per row. As such, it is natural to think that correlated subqueries are very expensive and should be avoided for performance reasons.
While that is true in many SQL systems, it is not the case in DuckDB. In DuckDB, subqueries are always decorrelated. DuckDB uses a state-of-the-art subquery decorrelation algorithm as described in the Unnesting Arbitrary Queries paper. This allows all subqueries to be decorrelated and executed as a single, much more efficient, query.
In DuckDB, correlation does not imply performance degradation.
If we look at the query plan for the correlated scalar subquery using EXPLAIN
, we can see that the query has been transformed into a hash aggregate followed by a hash join. This allows the query to be executed very efficiently.
EXPLAIN SELECT uniquecarrier, origincityname, destcityname, flightdate, distance
FROM ontime AS ontime_outer
WHERE distance=(
SELECT MIN(distance)
FROM ontime
WHERE uniquecarrier=ontime_outer.uniquecarrier
);
┌───────────────────────────┐
│ HASH_JOIN │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ uniquecarrier = │
│ uniquecarrier ├──────────────┐
└─────────────┬─────────────┘ │
┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│ SEQ_SCAN ││ HASH_GROUP_BY │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ││ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ ontime ││ uniquecarrier │
└───────────────────────────┘│ min(distance) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ SEQ_SCAN │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ ontime │
└───────────────────────────┘
We can see the drastic performance difference that subquery decorrelation has when we compare the run-time of this query in DuckDB with the run-time in Postgres and SQLite. When running the above query on the ontime
dataset for 2017
with roughly ~4 million
rows, we get the following performance results:
DuckDB | Postgres | SQLite |
---|---|---|
0.06s | >48 Hours | >48 Hours |
As Postgres and SQLite do not de-correlate the subquery, the query is not just logically, but actually executed once for every row. As a result, the subquery is executed 4 million times in those systems, which takes an immense amount of time.
In this case, it is possible to manually decorrelate the query and generate the following SQL:
SELECT ontime.uniquecarrier, origincityname, destcityname, flightdate, distance
FROM ontime
JOIN (
SELECT uniquecarrier, MIN(distance) AS min_distance
FROM ontime
GROUP BY uniquecarrier
) AS subquery
ON (ontime.uniquecarrier=subquery.uniquecarrier AND distance=min_distance);
By performing the de-correlation manually, the performance of SQLite and Postgres improves significantly. However, both systems remain over 30x slower than DuckDB.
DuckDB | Postgres | SQLite |
---|---|---|
0.06s | 1.98s | 2.81s |
Note that while it is possible to manually decorrelate certain subqueries by rewriting the SQL, it is not always possible to do so. As described in the Unnesting Arbitrary Queries paper, special join types that are not present in SQL are necessary to decorrelate arbitrary queries.
In DuckDB, these special join types will be automatically generated by the system to decorrelate all subqueries. In fact, DuckDB does not have support for executing subqueries that are not decorrelated. All subqueries will be decorrelated before DuckDB executes them.
Conclusion
Subqueries are a very powerful tool that allow you to take arbitrary queries and convert them into ad-hoc functions. When used in combination with DuckDB’s powerful subquery decorrelation, they can be executed extremely efficiently, making previously intractable queries not only possible, but fast.
continue readingAnnouncing DuckDB 0.8.0
The DuckDB team is happy to announce the latest DuckDB release (0.8.0). This release is named “Fulvigula” after the Mottled Duck (Anas Fulvigula) native to the Gulf of Mexico.
To install the new version, please visit the installation guide. The full release notes can be found here.
continue reading10 000 Stars on GitHub
Today, DuckDB reached 10 000 stars on GitHub. We would like to pause for a second to express our gratitude to everyone who contributed to DuckDB and of course all its users. When we started working on DuckDB back in 2018, we would have never dreamt of getting this kind of adoption in such a short time.
From those brave souls who were early adopters of DuckDB back in 2019 to the many today, we are happy you’re part of our community. Thank you for your feedback, feature requests and for your enthusiasm in adopting new features and integrations. Thank you for helping each other on our Discord server or in GitHub discussions. Thank you for spreading the word, too.
We also would like to extend special thanks to the DuckDB foundation supporters, who through their generous donations keep DuckDB independent.
For us, the maintainers of DuckDB, the past few years have also been quite eventful: We spun off from the research group where DuckDB originated to a successful company with close to 20 employees and many excellent partnerships.
We are very much looking forward to what the future will hold for DuckDB. Things are looking bright!
PostGEESE? Introducing The DuckDB Spatial Extension
TL;DR: DuckDB now has an official Spatial extension to enable geospatial processing
Geospatial data has become increasingly important and prevalent in modern-day applications and data engineering workflows, with use-cases ranging from location-based services to environmental monitoring.
While there are many great and specialized tools for working with geospatial data, integrating geospatial capabilities directly into DuckDB has multiple advantages. For one, you get to operate, transform and join your geospatial data alongside your regular, unstructured or time-series data using DuckDBs rich type system and extensions like JSON
and ICU
. Secondly, spatial queries involving geometric predicates and relations translate surprisingly well to SQL, which is all about expressing relations after all! Not to mention all the other benefits provided by DuckDB such as transactional semantics, high performance multi-threaded vectorized execution and larger-than-memory data processing.
Therefore, we’re very excited to announce that DuckDB now has a Spatial extension packed with features easily installable from the DuckDB CLI and other DuckDB clients. Simply execute INSTALL spatial; LOAD spatial;
from within DuckDB and you’re good to go!
No, we’re not calling it GeoDuck either, that’s just gross
What’s in it?
The core of the extension is a GEOMETRY
type based on the “Simple Features” geometry model and accompanying functions such as ST_Area
, ST_Intersects
. It also provides methods for reading and writing geospatial data formats and converting between coordinate reference systems (details later in the post!). While we’re not ready to commit to full compliance with the OGC Simple Feature Access and SQL/MM Standards yet, if you’ve worked with geospatial functionality in other database systems such as PostGIS or SpatiaLite, you should feel right at home.
Most of the implemented functions are based on the trifecta of foundational geospatial libraries, GEOS, GDAL and PROJ, which provide algorithms, format conversions and coordinate reference system transformations respectively. In particular, we leverage GDAL to provide a set of table and copy functions that enable import and export of tables from and to 50+ different geospatial data formats (so far!), including the most common ones such as Shapefiles, GeoJSON, GeoPackage, KML, GML, WKT, WKB, etc.
Check for yourself by running:
SELECT * FROM st_drivers();
short_name long_name can_create can_copy can_open help_url ESRI Shapefile ESRI Shapefile true false true https://gdal.org/drivers/vector/shapefile.html MapInfo File MapInfo File true false true https://gdal.org/drivers/vector/mitab.html UK .NTF UK .NTF false false true https://gdal.org/drivers/vector/ntf.html LVBAG Kadaster LV BAG Extract 2.0 false false true https://gdal.org/drivers/vector/lvbag.html S57 IHO S-57 (ENC) true false true https://gdal.org/drivers/vector/s57.html DGN Microstation DGN true false true https://gdal.org/drivers/vector/dgn.html OGR_VRT VRT - Virtual Datasource false false true https://gdal.org/drivers/vector/vrt.html Memory Memory true false true CSV Comma Separated Value (.csv) true false true https://gdal.org/drivers/vector/csv.html GML Geography Markup Language (GML) true false true https://gdal.org/drivers/vector/gml.html GPX GPX true false true https://gdal.org/drivers/vector/gpx.html KML Keyhole Markup Language (KML) true false true https://gdal.org/drivers/vector/kml.html GeoJSON GeoJSON true false true https://gdal.org/drivers/vector/geojson.html GeoJSONSeq GeoJSON Sequence true false true https://gdal.org/drivers/vector/geojsonseq.html ESRIJSON ESRIJSON false false true https://gdal.org/drivers/vector/esrijson.html TopoJSON TopoJSON false false true https://gdal.org/drivers/vector/topojson.html OGR_GMT GMT ASCII Vectors (.gmt) true false true https://gdal.org/drivers/vector/gmt.html GPKG GeoPackage true true true https://gdal.org/drivers/vector/gpkg.html SQLite SQLite / Spatialite true false true https://gdal.org/drivers/vector/sqlite.html WAsP WAsP .map format true false true https://gdal.org/drivers/vector/wasp.html OpenFileGDB ESRI FileGDB true false true https://gdal.org/drivers/vector/openfilegdb.html DXF AutoCAD DXF true false true https://gdal.org/drivers/vector/dxf.html CAD AutoCAD Driver false false true https://gdal.org/drivers/vector/cad.html FlatGeobuf FlatGeobuf true false true https://gdal.org/drivers/vector/flatgeobuf.html Geoconcept Geoconcept true false true GeoRSS GeoRSS true false true https://gdal.org/drivers/vector/georss.html VFK Czech Cadastral Exchange Data Format false false true https://gdal.org/drivers/vector/vfk.html PGDUMP PostgreSQL SQL dump true false false https://gdal.org/drivers/vector/pgdump.html OSM OpenStreetMap XML and PBF false false true https://gdal.org/drivers/vector/osm.html GPSBabel GPSBabel true false true https://gdal.org/drivers/vector/gpsbabel.html WFS OGC WFS (Web Feature Service) false false true https://gdal.org/drivers/vector/wfs.html OAPIF OGC API - Features false false true https://gdal.org/drivers/vector/oapif.html EDIGEO French EDIGEO exchange format false false true https://gdal.org/drivers/vector/edigeo.html SVG Scalable Vector Graphics false false true https://gdal.org/drivers/vector/svg.html ODS Open Document/ LibreOffice / OpenOffice Spreadsheet true false true https://gdal.org/drivers/vector/ods.html XLSX MS Office Open XML spreadsheet true false true https://gdal.org/drivers/vector/xlsx.html Elasticsearch Elastic Search true false true https://gdal.org/drivers/vector/elasticsearch.html Carto Carto true false true https://gdal.org/drivers/vector/carto.html AmigoCloud AmigoCloud true false true https://gdal.org/drivers/vector/amigocloud.html SXF Storage and eXchange Format false false true https://gdal.org/drivers/vector/sxf.html Selafin Selafin true false true https://gdal.org/drivers/vector/selafin.html JML OpenJUMP JML true false true https://gdal.org/drivers/vector/jml.html PLSCENES Planet Labs Scenes API false false true https://gdal.org/drivers/vector/plscenes.html CSW OGC CSW (Catalog Service for the Web) false false true https://gdal.org/drivers/vector/csw.html VDV VDV-451/VDV-452/INTREST Data Format true false true https://gdal.org/drivers/vector/vdv.html MVT Mapbox Vector Tiles true false true https://gdal.org/drivers/vector/mvt.html NGW NextGIS Web true true true https://gdal.org/drivers/vector/ngw.html MapML MapML true false true https://gdal.org/drivers/vector/mapml.html TIGER U.S. Census TIGER/Line false false true https://gdal.org/drivers/vector/tiger.html AVCBin Arc/Info Binary Coverage false false true https://gdal.org/drivers/vector/avcbin.html AVCE00 Arc/Info E00 (ASCII) Coverage false false true https://gdal.org/drivers/vector/avce00.html
Initially we have prioritized providing a breadth of capabilities by wrapping existing libraries. We’re planning to implement more of the core functions and algorithms natively in the future to enable faster performance and more efficient memory management.
As an initial step in this direction, we provide a set of non-standard specialized columnar DuckDB native geometry types such as POINT_2D
, BOX_2D
, etc. that should provide better compression and faster execution in exchange for some flexibility, but work around these are still very much experimental.
Example Usage
The following demonstrates how you can use the spatial extension to read and export multiple geospatial data formats, transform geometries between different coordinate reference systems and work with spatial property and predicate functions. While this example may be slightly contrived, we want to showcase the power of the currently available features. You can find the datasets used in this example in the spatial extension repository.
Let’s start by installing and loading the spatial extension and the parquet extension. Now we can import the NYC taxi ride data provided in parquet format, as well as the accompanying taxi zone data from a shapefile, using the ST_Read
table function provided by the spatial extension. These taxi zones break NYC into polygons that represent regions, for example the Newark Airport. We then create a table for the rides and a table for the zones. Note that ST_Read
produces a table with a wkb_geometry
column that contains the geometry data encoded as a WKB (Well-Known Binary) blob, which we then convert to the GEOMETRY
type using the ST_GeomFromWKB
function.
This may all seem a bit much if you are not familiar with the geospatial ecosystem, but rest assured this is all you really need to get started. In short:
- Shapefile (.shp, .shx, .dbf) is a common format for storing geometry vector data and auxiliary metadata such as indexes and attributes.
- WKB (Well Known Binary), while not really a file format in itself, is a common binary encoding of vector geometry data, used in e.g. GeoParquet. Comes in multiple flavors, but we’re only concerned with “standard” WKB for now.
GEOMETRY
is a DuckDB type that represents a Simple Features geometry object, which is based on a set of standards modeling vector geometry data as points, linestrings, polygons or collections of such. This is the core data type used by the spatial extension, and what most of the provided functions take and return.
INSTALL spatial;
INSTALL parquet;
LOAD spatial;
LOAD parquet;
CREATE TABLE rides AS SELECT *
FROM './spatial/test/data/nyc_taxi/yellow_tripdata_2010-01-limit1mil.parquet';
-- Load the NYC taxi zone data from a shapefile using the gdal-based ST_Read function
CREATE TABLE zones AS SELECT zone, LocationId, borough, ST_GeomFromWKB(wkb_geometry) AS geom
FROM ST_Read('./spatial/test/data/nyc_taxi/taxi_zones/taxi_zones.shx');
Let’s compare the trip distance to the linear distance between the pickup and dropoff points to figure out how efficient the taxi drivers are (or how dirty the data is, since some diffs seem to be negative). We transform the coordinates from “WGS84” (given by the identifier EPSG:4326), also commonly known as simply latitude/longitude to the “NAD83 / New York Long Island ftUS” (identified as ESRI:102718) coordinate reference system which is a projection with minimal distortion around New York. We then calculate the distance using the ST_Distance
function. In This case we get the distance in feet since we’ve converted the coordinates to NAD83 but we can easily convert it into to miles (5280 ft/mile) which is the unit used in the rides dataset so we can compare them correctly.
Wait, what’s all this about coordinate reference systems and projections?
The earth is not flat, but sometimes it is useful to pretend it is for the sake of simplicity by “projecting” the coordinates onto a flat surface. The “parameters” of a projection - e.g. where the “origin” is located, what unit coordinates are in, or how the earth’s shape is approximated - are encapsulated by a “Spatial Reference System” or “Coordinate Reference System” (CRS) which is usually referenced by a shorthand identifier composed of an authority and a code, e.g. “EPSG:4326” or “ESRI:102718”. Projections are always lossy, so its important to use a CRS that is well suited for the “area of interest” your data is in. The spatial extension uses the PROJ library to handle coordinate reference systems and projections.
Trips with a distance shorter than the aerial distance are likely to be erroneous, so we use this query to filter out some bad data. The query below takes advantage of DuckDB’s ability to refer to column aliases defined within the same select statement. This is a small example of how DuckDB’s rich SQL dialect can simplify geospatial analysis.
CREATE TABLE cleaned_rides AS SELECT
ST_Point(pickup_latitude, pickup_longitude) AS pickup_point,
ST_Point(dropoff_latitude, dropoff_longitude) AS dropoff_point,
dropoff_datetime::TIMESTAMP - pickup_datetime::TIMESTAMP AS time,
trip_distance,
ST_Distance(
ST_Transform(pickup_point, 'EPSG:4326', 'ESRI:102718'),
ST_Transform(dropoff_point, 'EPSG:4326', 'ESRI:102718')) / 5280
AS aerial_distance,
trip_distance - aerial_distance AS diff
FROM rides
WHERE diff > 0
ORDER BY diff DESC;
SELECT * FROM rides LIMIT 10;
vendor_id | pickup_datetime | dropoff_datetime | passenger_count | trip_distance | pickup_longitude | pickup_latitude | rate_code | store_and_fwd_flag | dropoff_longitude | dropoff_latitude | payment_type | fare_amount | surcharge | mta_tax | tip_amount | tolls_amount | total_amount |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
VTS | 2010-01-01 00:00:17 | 2010-01-01 00:00:17 | 3 | 0.0 | -73.87105699999998 | 40.773522 | 1 | -73.871048 | 40.773545 | CAS | 45.0 | 0.0 | 0.5 | 0.0 | 0.0 | 45.5 | |
VTS | 2010-01-01 00:00:20 | 2010-01-01 00:00:20 | 1 | 0.05 | -73.97512999999998 | 40.789973 | 1 | -73.97498799999998 | 40.790598 | CAS | 2.5 | 0.5 | 0.5 | 0.0 | 0.0 | 3.5 | |
CMT | 2010-01-01 00:00:23 | 2010-01-01 00:00:25 | 1 | 0.0 | -73.999431 | 40.71216 | 1 | 0 | -73.99915799999998 | 40.712421 | No | 2.5 | 0.5 | 0.5 | 0.0 | 0.0 | 3.5 |
CMT | 2010-01-01 00:00:33 | 2010-01-01 00:00:55 | 1 | 0.0 | -73.97721699999998 | 40.749633 | 1 | 0 | -73.97732899999998 | 40.749629 | Cas | 2.5 | 0.5 | 0.5 | 0.0 | 0.0 | 3.5 |
VTS | 2010-01-01 00:01:00 | 2010-01-01 00:01:00 | 1 | 0.0 | -73.942313 | 40.784332 | 1 | -73.942313 | 40.784332 | Cre | 10.0 | 0.0 | 0.5 | 2.0 | 0.0 | 12.5 | |
VTS | 2010-01-01 00:01:06 | 2010-01-01 00:01:06 | 2 | 0.38 | -73.97463 | 40.756687 | 1 | -73.979872 | 40.759143 | CAS | 3.7 | 0.5 | 0.5 | 0.0 | 0.0 | 4.7 | |
VTS | 2010-01-01 00:01:07 | 2010-01-01 00:01:07 | 2 | 0.23 | -73.987358 | 40.718475 | 1 | -73.98518 | 40.720468 | CAS | 2.9 | 0.5 | 0.5 | 0.0 | 0.0 | 3.9 | |
CMT | 2010-01-01 00:00:02 | 2010-01-01 00:01:08 | 1 | 0.1 | -73.992807 | 40.741418 | 1 | 0 | -73.995799 | 40.742596 | No | 2.9 | 0.5 | 0.5 | 0.0 | 0.0 | 3.9 |
VTS | 2010-01-01 00:01:23 | 2010-01-01 00:01:23 | 1 | 0.6099999999999999 | -73.98003799999998 | 40.74306 | 1 | -73.974862 | 40.750387 | CAS | 3.7 | 0.5 | 0.5 | 0.0 | 0.0 | 4.7 | |
VTS | 2010-01-01 00:01:34 | 2010-01-01 00:01:34 | 1 | 0.02 | -73.954122 | 40.801173 | 1 | -73.95431499999998 | 40.800897 | CAS | 45.0 | 0.0 | 0.5 | 0.0 | 0.0 | 45.5 |
SELECT * FROM zones LIMIT 10;
zone | LocationID | borough | geom |
---|---|---|---|
Newark Airport | 1 | EWR | POLYGON (…) |
Jamaica Bay | 2 | Queens | MULTIPOLYGON (…) |
Allerton/Pelham Gardens | 3 | Bronx | POLYGON (…) |
Alphabet City | 4 | Manhattan | POLYGON (…) |
Arden Heights | 5 | Staten Island | POLYGON (…) |
Arrochar/Fort Wadsworth | 6 | Staten Island | POLYGON (…) |
Astoria | 7 | Queens | POLYGON (…) |
Astoria Park | 8 | Queens | POLYGON (…) |
Auburndale | 9 | Queens | POLYGON (…) |
Baisley Park | 10 | Queens | POLYGON (…) |
It should be noted that this is not entirely accurate since the
ST_Distance
function we use does not take into account the curvature of the earth. However, we’ll accept it as a good enough approximation for our purposes. Spherical and geodesic distance calculations are on the roadmap!
Now let’s join the taxi rides with the taxi zones to get the start and end zone for each ride. We use the ST_Within
function as our join condition to check if a pickup or dropoff point is within a taxi zone polygon. Again we need to transform the coordinates from WGS84 to the NAD83 since the taxi zone data also use that projection. Spatial joins like these are the bread and butter of geospatial data processing, but we don’t currently have any optimizations in place (such as spatial indexes) to speed up these queries, which is why we only use a subset of the data for the following step.
-- Since we don't have spatial indexes yet, use a smaller dataset for the join.
DELETE FROM cleaned_rides WHERE rowid > 5000;
CREATE TABLE joined AS
SELECT
pickup_point,
dropoff_point,
start_zone.zone AS start_zone,
end_zone.zone AS end_zone,
trip_distance,
time,
FROM cleaned_rides
JOIN zones AS start_zone
ON ST_Within(ST_Transform(pickup_point, 'EPSG:4326', 'ESRI:102718'), start_zone.geom)
JOIN zones AS end_zone
ON ST_Within(ST_Transform(dropoff_point, 'EPSG:4326', 'ESRI:102718'), end_zone.geom);
SELECT * FROM joined USING SAMPLE 10 ROWS;
pickup_point | dropoff_point | start_zone | end_zone | trip_distance | time |
---|---|---|---|---|---|
POINT (40.722223 -73.98385299999998) | POINT (40.715507 -73.992438) | East Village | Lower East Side | 10.3 | 00:19:16 |
POINT (40.648687 -73.783522) | POINT (40.649567 -74.005812) | JFK Airport | Sunset Park West | 23.57 | 00:28:00 |
POINT (40.761603 -73.96661299999998) | POINT (40.760232 -73.96344499999998) | Upper East Side South | Sutton Place/Turtle Bay North | 17.6 | 00:27:05 |
POINT (40.697212 -73.937495) | POINT (40.652377 -73.93983299999998) | Stuyvesant Heights | East Flatbush/Farragut | 13.55 | 00:24:00 |
POINT (40.721462 -73.993583) | POINT (40.774205 -73.90441699999998) | Lower East Side | Steinway | 28.75 | 01:03:00 |
POINT (40.716955 -74.004328) | POINT (40.754688 -73.991612) | TriBeCa/Civic Center | Garment District | 18.4 | 00:46:12 |
POINT (40.740052 -73.994918) | POINT (40.75439 -73.98587499999998) | Flatiron | Garment District | 24.2 | 00:35:25 |
POINT (40.763017 -73.95949199999998) | POINT (40.763615 -73.959182) | Lenox Hill East | Lenox Hill West | 18.4 | 00:33:46 |
POINT (40.865663 -73.927458) | POINT (40.86537 -73.927352) | Washington Heights North | Washington Heights North | 10.47 | 00:27:00 |
POINT (40.738408 -73.980345) | POINT (40.696038 -73.955493) | Gramercy | Bedford | 16.4 | 00:21:47 |
We can export the joined table to a GeoJSONSeq
file using the GDAL copy function, passing in a GDAL layer creation option. Since GeoJSON
only supports a single GEOMETRY
per record, we use the ST_MakeLine
function to combine the pickup and dropoff points into a single line geometry. The default coordinate reference system for GeoJSON
is WGS84, but the coordinate pairs are expected to be in longitude/latitude, so we need to flip the geometry using the ST_FlipCoordinates
function.
COPY (
SELECT
ST_AsWKB(ST_FlipCoordinates(ST_MakeLine(pickup_point, dropoff_point)))
as wkb_geometry,
start_zone,
end_zone,
time::VARCHAR as trip_time
FROM joined)
TO 'joined.geojsonseq'
WITH (FORMAT GDAL, DRIVER 'GeoJSONSeq', LAYER_CREATION_OPTIONS 'WRITE_BBOX=YES');
head -n 10 joined.geojsonseq
{ "type": "Feature", "properties": { "start_zone": "JFK Airport", "end_zone": "Park Slope", "trip_time": "00:52:00" }, "geometry": { "type": "LineString", "coordinates": [ [ -73.789923, 40.643515 ], [ -73.97608, 40.680395 ] ] } }
{ "type": "Feature", "properties": { "start_zone": "JFK Airport", "end_zone": "Park Slope", "trip_time": "00:35:00" }, "geometry": { "type": "LineString", "coordinates": [ [ -73.776445, 40.645422 ], [ -73.98427, 40.670782 ] ] } }
{ "type": "Feature", "properties": { "start_zone": "JFK Airport", "end_zone": "Park Slope", "trip_time": "00:45:42" }, "geometry": { "type": "LineString", "coordinates": [ [ -73.776878, 40.645065 ], [ -73.992153, 40.662571 ] ] } }
{ "type": "Feature", "properties": { "start_zone": "JFK Airport", "end_zone": "Park Slope", "trip_time": "00:36:00" }, "geometry": { "type": "LineString", "coordinates": [ [ -73.788028, 40.641508 ], [ -73.97584, 40.670927 ] ] } }
{ "type": "Feature", "properties": { "start_zone": "JFK Airport", "end_zone": "Park Slope", "trip_time": "00:47:58" }, "geometry": { "type": "LineString", "coordinates": [ [ -73.781855, 40.644749 ], [ -73.980129, 40.663663 ] ] } }
{ "type": "Feature", "properties": { "start_zone": "JFK Airport", "end_zone": "Park Slope", "trip_time": "00:32:10" }, "geometry": { "type": "LineString", "coordinates": [ [ -73.787494, 40.641559 ], [ -73.974694, 40.673479 ] ] } }
{ "type": "Feature", "properties": { "start_zone": "JFK Airport", "end_zone": "Park Slope", "trip_time": "00:36:59" }, "geometry": { "type": "LineString", "coordinates": [ [ -73.790138, 40.643342 ], [ -73.982721, 40.662379 ] ] } }
{ "type": "Feature", "properties": { "start_zone": "JFK Airport", "end_zone": "Park Slope", "trip_time": "00:32:00" }, "geometry": { "type": "LineString", "coordinates": [ [ -73.786952, 40.641248 ], [ -73.97421, 40.676237 ] ] } }
{ "type": "Feature", "properties": { "start_zone": "JFK Airport", "end_zone": "Park Slope", "trip_time": "00:33:21" }, "geometry": { "type": "LineString", "coordinates": [ [ -73.783892, 40.648514 ], [ -73.979283, 40.669721 ] ] } }
{ "type": "Feature", "properties": { "start_zone": "JFK Airport", "end_zone": "Park Slope", "trip_time": "00:35:45" }, "geometry": { "type": "LineString", "coordinates": [ [ -73.776643, 40.645272 ], [ -73.978873, 40.66723 ] ] } }
And there we have it! We pulled tabular data from parquet, combined it with geospatial data in a shapefile, cleaned and analyzed that combined data, and output it to a human readable geospatial format. The full set of currently supported functions and their implementation status can be found over at the docs in this table
What’s next?
While it’s probably going to take a while for us to catch up to the full set of functions provided by e.g. PostGIS, we believe that DuckDB’s vectorized execution model and columnar storage format will enable a whole new class of optimizations for geospatial processing that we’ve just begun exploring. Improving the performance of spatial joins and predicates is therefore high on our list of priorities.
There are also some limitations with our GEOMETRY
type that we would eventually like to tackle, such as the fact that we don’t support additional Z and M dimensions, or don’t support the full range of geometry sub-types that are mandated by the OGC standard, like curves or polyhedral surfaces.
We’re also interested in supporting spherical and ellipsoidal calculations in the near future, perhaps in the form of a dedicated GEOGRAPHY
type.
WASM builds are also just around the corner!
Please take a look at the GitHub repository for the full roadmap and to see what we’re currently working on. If you would like to help build this capability, please reach out on GitHub!
Conclusion
The DuckDB Spatial extension is another step towards making DuckDB a swiss army knife for data engineering and analytics. This extension provides a flexible and familiar GEOMETRY
type, reprojectable between thousands of coordinate reference systems, coupled with the capability to export and import geospatial data between more than 50 different data sources. All embedded into a single extension with minimal runtime dependencies. This enables DuckDB to fit seamlessly into your existing GIS workflows regardless of which geospatial data formats or projections you’re working with.
We are excited to hear what you make of the DuckDB spatial extension. It’s still early days but we hope to have a lot more to share in the future as we continue making progress! If you have any questions, suggestions, ideas or issues, please don’t hesitate to reach out to us on Discord or GitHub!
continue readingDuckCon #3 in San Francisco
We are excited to hold our first “DuckCon” DuckDB user group meeting outside of Europe in San Francisco. The meeting will take place in the afternoon of June 29th. The meeting will be held at the San Francisco Museum of Modern Art (SFMOMA), in the Phyllis Wattis Theater.
In DuckCon #3, we (again) have a talk from DuckDB creators Hannes Mühleisen and Mark Raasveldt about the current state of DuckDB and future plans. Side note: Hannes will also speak at the Data + AI Summit earlier that day.
We are also excited to have talks by Lloyd Tabb, the creator of Google’s Malloy project and Josh Wills, the creator of the dbt-duckdb package.
In addition, we will have several lightning talks from DuckDB users.
Timetable
Time | Title | Presenter |
---|---|---|
4 PM | Welcome to DuckCon! – Setting Up | |
4:10 PM | Introductions | Hannes Mühleisen |
4:15 PM | State of the Duck | Mark Raasveldt |
4:40 PM | ‘Data is Rectangular’ and Other Common Misconceptions | Lloyd Tabb |
5:05 PM | DuckDBT: Not a database or a dbt adapter but a secret third thing | Josh Wills |
5:30 PM | Watershed: Get to zero carbon, fast | Jessica Zhu |
5:35 PM | Building a SQL editor around “fast” | Hamilton Ulmer |
5:40 PM | KalDB: Log analytics with DuckDB | Suman Karumuri |
5:45 PM | Pandas on DuckDBWith Ponder | Aditya Parameswaran |
5:50 PM | Bringing AI to DuckDB with Lance columnar format for multi-modal AI | Chang She |
5:55 PM | Closing thoughts | |
6:00 PM | End of DuckCon | |
From 6 PM | MotherDuck Party, see below |
Lightning Talks
We will have a 20-minute slot for five-minute lightning talks, where our users are invited to show and tell anything DuckDB-related. Please submit your lightning talk proposal in the pre-registration form below.
Registration Process
Attendance is free. While supplies last, you can still get a ticket here. You will need to show this ticket at the entrance to attend. Please contact [email protected] if you have any questions.
MotherDuck Party at 111 Minna
Following DuckCon, MotherDuck will host a party celebrating ducks at 111 Minna (located very close to SFMOMA). DuckCon attendees are cordially invited to attend to eat, drink, and play games. MotherDuck’s Chief Duck Herder will also demo the latest work bringing DuckDB to the cloud. Please note that you will have to separately register for the MotherDuck event.
continue readingIntroducing DuckDB for Swift
TL;DR: DuckDB now has a native Swift API. DuckDB on mobile here we go!
Today we’re excited to announce the DuckDB API for Swift. It enables developers on Swift platforms to harness the full power of DuckDB using a native Swift interface with support for great Swift features such as strong typing and concurrency. The API is available not only on Apple platforms, but on Linux too, opening up new opportunities for the growing Swift on Server ecosystem.
What’s included
DuckDB is designed to be fast, reliable and easy to use, and it’s this philosophy that also guided the creation of our new Swift API.
This initial release supports many of the great features of DuckDB right out of the box, including:
- Queries via DuckDB’s enhanced SQL dialect: In addition to basic SQL, DuckDB supports arbitrary and nested correlated subqueries, window functions, collations, complex types (Swift arrays and structs), and more.
- Import and export of JSON, CSV, and Parquet files: Beyond its built-in and super-efficient native file format, DuckDB supports reading in, and exporting out to, JSON, CSV, and Parquet files.
- Strongly typed result sets: DuckDB’s strongly typed result sets are a natural fit for Swift. It’s simple to cast DuckDB columns to their native Swift equivalents, ready for presentation using SwiftUI or as part of an existing TabularData workflow.
- Swift concurrency support: by virtue of their
Sendable
conformance, many of DuckDB’s core underlying types can be safely passed across concurrency contexts, easing the process of designing parallel processing workflows and ensuring responsive UIs.
Usage
To demonstrate just how well DuckDB works together with Swift, we’ve created an example project that uses raw data from NASA’s Exoplanet Archive loaded directly into DuckDB.
You’ll see how to:
- Instantiate a DuckDB in-memory Database and Connection
- Populate a DuckDB table with the contents of a remote CSV
- Query a DuckDB database and prepare the results for presentation
Finally, we’ll present our analysis with the help of Apple’s TabularData Framework and Swift Charts.
Instantiating Duck DB
DuckDB supports both file-based and in-memory databases. In this example, as we don’t intend to persist the results of our Exoplanet analysis to disk, we’ll opt for an in-memory Database.
let database = try Database(store: .inMemory)
However, we can’t issue queries just yet. Much like other RDMSs, queries must be issued through a database connection. DuckDB supports multiple connections per database. This can be useful to support parallel processing, for example. In our project, we’ll need just the one connection that we’ll eventually access asynchronously.
let connection = try database.connect()
Finally, we’ll create an app-specific type that we’ll use to house our database and connection and through which we’ll eventually define our app-specific queries.
import DuckDB
final class ExoplanetStore {
let database: Database
let connection: Connection
init(database: Database, connection: Connection) {
self.database = database
self.connection = connection
}
}
Populating DuckDB with a remote CSV file
One problem with our current ExoplanetStore
type is that it doesn’t yet contain any data to query. To fix that, we’ll load it with the data of every Exoplanet discovered to date from NASA’s Exoplanet Archive.
There are hundreds of configuration options for this incredible resource, but today we want each exoplanet’s name and its discovery year packaged as a CSV. Checking the docs gives us the following endpoint:
https://exoplanetarchive.ipac.caltech.edu/TAP/sync?query=select+pl_name+,+disc_year+from+pscomppars&format=csv
Once we have our CSV downloaded locally, we can use the following SQL command to load it as a new table within our DuckDB in-memory database. DuckDB’s read_csv_auto
command automatically infers our table schema and the data is immediately available for analysis.
CREATE TABLE exoplanets AS (
SELECT * FROM read_csv_auto('downloaded_exoplanets.csv')
);
Let’s package this up as a new asynchronous factory method on our ExoplanetStore
type:
import DuckDB
import Foundation
final class ExoplanetStore {
// Factory method to create and prepare a new ExoplanetStore
static func create() async throws -> ExoplanetStore {
// Create our database and connection as described above
let database = try Database(store: .inMemory)
let connection = try database.connect()
// Download the CSV from the exoplanet archive
let (csvFileURL, _) = try await URLSession.shared.download(
from: URL(string: "https://exoplanetarchive.ipac.caltech.edu/TAP/sync?query=select+pl_name+,+disc_year+from+pscomppars&format=csv")!)
// Issue our first query to DuckDB
try connection.execute("""
CREATE TABLE exoplanets AS (
SELECT * FROM read_csv_auto('\(csvFileURL.path)')
);
""")
// Create our pre-populated ExoplanetStore instance
return ExoplanetStore(
database: database,
connection: connection
)
}
// Let's make the initializer we defined previously
// private. This prevents anyone accidentally instantiating
// the store without having pre-loaded our Exoplanet CSV
// into the database
private init(database: Database, connection: Connection) {
...
}
}
Querying the Database
Now that the database is populated with data, it’s ready to be analyzed. Let’s create a query which we can use to plot a chart of the number of exoplanets discovered by year.
SELECT disc_year, COUNT(disc_year) AS Count
FROM exoplanets
GROUP BY disc_year
ORDER BY disc_year
Issuing the query to DuckDB from within Swift is simple. We’ll again make use of an async function from which to issue our query. This means the callee won’t be blocked while the query is executing. We’ll then cast the result columns to Swift native types using DuckDB’s ResultSet
cast(to:)
family of methods, before finally wrapping them up in a DataFrame
from the TabularData framework ready for presentation in the UI.
...
import TabularData
extension ExoplanetStore {
// Retrieves the number of exoplanets discovered by year
func groupedByDiscoveryYear() async throws -> DataFrame {
// Issue the query we described above
let result = try connection.query("""
SELECT disc_year, COUNT(disc_year) AS Count
FROM exoplanets
GROUP BY disc_year
ORDER BY disc_year
""")
// Cast our DuckDB columns to their native Swift
// equivalent types
let discoveryYearColumn = result[0].cast(to: Int.self)
let countColumn = result[1].cast(to: Int.self)
// Use our DuckDB columns to instantiate TabularData
// columns and populate a TabularData DataFrame
return DataFrame(columns: [
TabularData.Column(discoveryYearColumn)
.eraseToAnyColumn(),
TabularData.Column(countColumn)
.eraseToAnyColumn(),
])
}
}
Visualizing the Results
In just a few lines of code, our database has been created, populated and analyzed – all that’s left to do now is present the results.
And I have a feeling that we’re just getting started…
For the complete example project – including the SwiftUI views and Chart definitions used to create the screenshot above – clone the DuckDB Swift repo and open up the runnable app project located in Examples/SwiftUI/ExoplanetExplorer.xcodeproj
.
We encourage you to modify the code, explore the Exoplanet Archive and DuckDB, and make some discoveries of your own – interplanetary or otherwise!
Conclusion
In this article we’ve introduced the brand new Swift API for DuckDB and demonstrated how quickly you can get up and running analyzing data.
With DuckDB’s incredible performance and analysis capabilities and Swift’s vibrant eco-system and platform support, there’s never been a better time to begin exploring analytical datasets in Swift.
We can’t wait to see what you do with it. Feel free to reach out on our Discord if you have an questions!
The Swift API for DuckDB is packaged using Swift Package Manager and lives in a new top-level repository available at https://github.com/duckdb/duckdb-swift.
continue readingThe Return of the H2O.ai Database-like Ops Benchmark
TL;DR: We’ve resurrected the H2O.ai database-like ops benchmark with up to date libraries and plan to keep re-running it.
The H2O.ai Database-like Ops Benchmark is a well-known benchmark in the data analytics and R community. The benchmark measures the groupby and join performance of various analytical tools like data.table, polars, dplyr, clickhouse, duckdb and more. Since July 2nd 2021, the benchmark has been dormant, with no result updates or maintenance. Many of the analytical systems measured in the benchmark have since undergone substantial improvements, leaving many of the maintainers curious as to where their analytical tool ranks on the benchmark.
DuckDB has decided to give the H2O.ai benchmark new life and maintain it for the foreseeable future. One reason the DuckDB project has decided to maintain the benchmark is because DuckDB has had 10 new minor releases since the most recent published results on July 2nd, 2021. After managing to run parts of the benchmark on a r3-8xlarge AWS box, DuckDB ranked as a top performer on the benchmark. Additionally, the DuckDB project wants to demonstrate it’s commitment to performance by consistently comparing DuckDB with other analytical systems. While DuckDB delivers differentiated ease of use, raw performance and scalability are critically important for solving tough problems fast. Plus, just like many of our fellow data folks, we have a need for speed. Therefore, the decision was made to fork the benchmark, modernize underlying dependencies and run the benchmark on the latest versions of the included systems. You can find the repository here.
The results of the new benchmark are very interesting, but first a quick summary of the benchmark and what updates took place.
The H2O.ai Database-like Ops Benchmark
There are 5 basic grouping tests and 5 advanced grouping tests. The 10 grouping queries all focus on a combination of the following
- Low cardinality (a few big groups)
- High cardinality (lots of very small groups)
- Grouping integer types
- Grouping string types
Each query is run only twice with both results being reported. This way we can see the performance of a cold run and any effects data caching may have. The idea is to avoid reporting any potential “best” results on a hot system. Data analysts only need to run a query once to get their answer. No one drives to the store a second time to get another litre of milk faster.
The time reported is the sum of the time it takes to run all 5 queries twice.
More information about the specific queries can be found below.
The Data and Queries
The queries have not changed since the benchmark went dormant. The data is generated in a rather simple manner. Inspecting the datagen files you can see that the columns are generated with small, medium, and large groups of char and int values. Similar generation logic applies to the join data generation.
Query | SQL | Objective |
---|---|---|
groupby #1 | SELECT id1, sum(v1) AS v1 FROM tbl GROUP BY id1 |
Sum over large cardinality groups, grouped by varchar |
groupby #2 | SELECT id1, id2, sum(v1) AS v1 FROM tbl GROUP BY id1, id2 |
Sum over medium cardinality groups, grouped by varchars |
groupby #3 | SELECT id3, sum(v1) AS v1, mean(v3) AS v3 FROM tbl GROUP BY id3 |
Sum and mean over many small cardinality groups, grouped by varchar |
groupby #4 | SELECT id4, mean(v1) AS v1, mean(v2) AS v2, mean(v3) AS v3 FROM tbl GROUP BY id4 |
Mean over many large cardinality groups, grouped by integer |
groupby #5 | SELECT id6, sum(v1) AS v1, sum(v2) AS v2, sum(v3) AS v3 FROM tbl GROUP BY id6 |
Sum over many small groups, grouped by integer |
advanced groupby #1 | SELECT id4, id5, quantile_cont(v3, 0.5) AS median_v3, stddev(v3) AS sd_v3 FROM tbl GROUP BY id4, id5 |
quantile_cont over medium cardinality group, grouped by integers |
advanced groupby #2 | SELECT id3, max(v1)-min(v2) AS range_v1_v2 FROM tbl GROUP BY id3 |
Range selection over small cardinality groups, grouped by integer |
advanced groupby #3 | SELECT id6, v3 AS largest2_v3 FROM (SELECT id6, v3, row_number() OVER (PARTITION BY id6 ORDER BY v3 DESC) AS order_v3 FROM x WHERE v3 IS NOT NULL) sub_query WHERE order_v3 <= 2 |
Advanced group by query |
advanced groupby #4 | SELECT id2, id4, pow(corr(v1, v2), 2) AS r2 FROM tbl GROUP BY id2, id4 |
Arithmetic over medium sized groups, grouped by varchar, integer. |
advanced groupby #5 | SELECT id1, id2, id3, id4, id5, id6, sum(v3) AS v3, count(*) AS count FROM tbl GROUP BY id1, id2, id3, id4, id5, id6 |
Many many small groups, the number of groups is the cardinality of the dataset |
join #1 | SELECT x.*, small.id4 AS small_id4, v2 FROM x JOIN small USING (id1) |
Joining a large table (x) with a small-sized table on integer type |
join #2 | SELECT x.*, medium.id1 AS medium_id1, medium.id4 AS medium_id4, medium.id5 AS medium_id5, v2 FROM x JOIN medium USING (id2) |
Joining a large table (x) with a medium-sized table on integer type |
join #3 | SELECT x.*, medium.id1 AS medium_id1, medium.id4 AS medium_id4, medium.id5 AS medium_id5, v2 FROM x LEFT JOIN medium USING (id2) |
Left join a large table (x) with a medium-sized table on integer type |
join #4 | SELECT x.*, medium.id1 AS medium_id1, medium.id2 AS medium_id2, medium.id4 AS medium_id4, v2 FROM x JOIN medium USING (id5) |
Join a large table (x) with a medium table on varchar type |
join #5 | SELECT x.*, big.id1 AS big_id1, big.id2 AS big_id2, big.id4 AS big_id4, big.id5 AS big_id5, big.id6 AS big_id6, v2 FROM x JOIN big USING (id3) |
Join a large table (x) with a large table on integer type. |
You can find more information about the queries in the Efficiency of Data Processing slides.
Modifications to the Benchmark & Hardware
No modifications have been made to the queries or the data generation. Some scripts required minor modifications so that the current version of the library could be run. The hardware used is slightly different as the exact AWS offering the benchmark previously used is no longer available. Base libraries have been updated as well. GPU libraries were not tested.
AWS is a m4.10xlarge
- CPU model: Intel(R) Xeon(R) CPU E5-2676 v3 @ 2.40GHz
- CPU cores: 40
- RAM model: Unknown
- Memory: 160GB
- NO GPU specifications
- R upgraded, 4.0.0 -> 4.2.2
- Python upgraded 3.[6|7] -> 3.10
Changes made to install scripts of other systems
Pandas, Polars, Dask, and Clickhouse required changes to their setup/install scripts. The changes were relatively minor consisting mostly of syntax updates and data ingestion updates. Data ingestion did not affect the reporting timing results.
Results
You can also look at the results here. DuckDB’s timings have improved significantly since v0.2.7 released over two years ago). A major contributor to our increased performance is parallel grouped aggregation, merged in March 2022, and parallel result set materialization. In addition, DuckDB now supports enum types, which makes DuckDB group by
aggregation even faster. Improvements to the out-of-core hash join were merged as well, further improving the performance of our joins.
Questions about certain results?
Some solutions may report internal errors for some queries. Feel free to investigate the errors by using the _utils/repro.sh script and file a github issue to resolve any confusion. In addition, there are many areas in the code where certain query results are automatically nullified. If you believe that is the case for a query for your system or if you have any other questions, you can create a github issue to discuss.
Maintenance plan
DuckDB will continue to maintain this benchmark for the forseeable future. The process for re-running the benchmarks with updated library versions must still be decided.
Do you have any other questions? Would you like to have your system added to the benchmark? Please feel free to read the ReadMe.md in the repository, and if you still have questions, you can reach out to me at [email protected] or on our Discord!
continue readingShredding Deeply Nested JSON, One Vector at a Time
TL;DR: We’ve recently improved DuckDB’s JSON extension so JSON files can be directly queried as if they were tables.
DuckDB has a JSON extension that can be installed and loaded through SQL:
INSTALL 'json';
LOAD 'json';
The JSON extension supports various functions to create, read, and manipulate JSON strings. These functions are similar to the JSON functionality provided by other databases such as PostgreSQL and MySQL. DuckDB uses yyjson internally to parse JSON, a high-performance JSON library written in ANSI C. Many thanks to the yyjson authors and contributors!
Besides these functions, DuckDB is now able to read JSON directly! This is done by automatically detecting the types and column names, then converting the values within the JSON to DuckDB’s vectors. The automated schema detection dramatically simplifies working with JSON data and subsequent queries on DuckDB’s vectors are significantly faster!
Reading JSON Automatically with DuckDB
Since the 0.7.0 update, DuckDB has added JSON table functions.
To demonstrate these, we will read todos.json
, a fake TODO list containing 200 fake TODO items (only the first two items are shown):
[
{
"userId": 1,
"id": 1,
"title": "delectus aut autem",
"completed": false
},
{
"userId": 1,
"id": 2,
"title": "quis ut nam facilis et officia qui",
"completed": false
},
]
Each TODO item is an entry in the JSON array, but in DuckDB, we’d like a table where each entry is a row. This is now (since DuckDB’s 0.7.0 release in February 2023) as easy as:
SELECT * FROM 'todos.json';
userId | id | title | completed |
---|---|---|---|
1 | 1 | delectus aut autem | false |
1 | 2 | quis ut nam facilis et officia qui | false |
1 | 3 | fugiat veniam minus | false |
1 | 4 | et porro tempora | true |
1 | 5 | laboriosam mollitia et enim quasi adipisci quia provident illum | false |
(Note: Only 5 rows shown)
Now, finding out which user completed the most TODO items is as simple as:
SELECT userId, sum(completed::int) total_completed
FROM 'todos.json'
GROUP BY userId
ORDER BY total_completed DESC
LIMIT 1;
userId | total_completed |
---|---|
5 | 12 |
Under the hood, DuckDB recognizes the .json
file extension in 'todos.json'
, and calls read_json_auto('todos.json')
instead.
This function is similar to our read_csv_auto
function, which automatically infers column names and types for CSV files.
Like our other table functions, read_json_auto
supports reading multiple files by passing a list, e.g., read_json_auto(['file1.json', 'file2.json'])
, but also globbing, e.g., read_json_auto('file*.json')
.
DuckDB will read multiple files in parallel.
Newline Delimited JSON
Not all JSON adheres to the format used in todos.json
, which is an array of ‘records’.
Newline-delimited JSON, or NDJSON, stores each row on a new line.
DuckDB also supports reading (and writing!) this format.
First, let’s write our TODO list as NDJSON:
COPY (SELECT * FROM 'todos.json') to 'todos2.json';
Again, DuckDB recognizes the .json
suffix in the output file and automatically infers that we mean to use (FORMAT JSON)
.
The created file looks like this (only the first two records are shown):
{"userId":1,"id":1,"title":"delectus aut autem","completed":false}
{"userId":1,"id":2,"title":"quis ut nam facilis et officia qui","completed":false}
DuckDB can read this file in precisely the same way as the original one:
SELECT * FROM 'todos2.json';
If your JSON file is newline-delimited, DuckDB can parallelize reading.
This is specified with nd
or the lines
parameter:
SELECT * FROM read_ndjson_auto('todos2.json');
SELECT * FROM read_json_auto('todos2.json', lines='true');
You can also set lines='auto'
to auto-detect whether the JSON file is newline-delimited.
Other JSON Formats
If using the read_json
function directly, the format of the JSON can be specified using the json_format
parameter.
This parameter defaults to 'auto'
, which tells DuckDB to infer what kind of JSON we are dealing with.
The first json_format
is 'array_of_records'
, while the second is 'records'
.
This can be specified like so:
SELECT * FROM read_json('todos.json', auto_detect=true, json_format='array_of_records');
SELECT * FROM read_json('todos2.json', auto_detect=true, json_format='records');
Other supported formats are 'values'
and 'array_of_values'
, which are similar to 'records'
and 'array_of_records'
.
However, with these formats, each ‘record’ is not required to be a JSON object but can also be a JSON array, string, or anything supported in JSON.
Manual Schemas
What you may also have noticed is the auto_detect
parameter.
This parameter tells DuckDB to infer the schema, i.e., determine the names and types of the returned columns.
These can manually be specified like so:
SELECT * FROM read_json('todos.json',
columns={userId: 'INT', id: 'INT', title: 'VARCHAR', completed: 'BOOLEAN'},
json_format='array_of_records');
You don’t have to specify all fields, just the ones you’re interested in:
SELECT * FROM read_json('todos.json',
columns={userId: 'INT', completed: 'BOOLEAN'},
json_format='array_of_records');
Now that we know how to use the new DuckDB JSON table functions let’s dive into some analytics!
GitHub Archive Examples
GH Archive is a project to record the public GitHub timeline, archive it, and make it easily accessible for further analysis.
Every hour, a GZIP compressed, newline-delimited JSON file containing all public events on GitHub is uploaded.
I’ve downloaded a whole day (2023-02-08) of activity using wget
and stored the 24 files in a directory called gharchive_gz
.
wget https://data.gharchive.org/2023-02-08-0.json.gz
wget https://data.gharchive.org/2023-02-08-1.json.gz
...
wget https://data.gharchive.org/2023-02-08-23.json.gz
Keep in mind that the data is compressed:
$ du -sh gharchive_gz
2.3G gharchive_gz
$ gunzip -dc gharchive_gz/* | wc -c
18396198934
One day of GitHub activity amounts to more than 18GB of JSON, which compresses to 2.3GB with GZIP.
To get a feel of what the data looks like, we run the following query:
SELECT json_group_structure(json)
FROM (
SELECT *
FROM read_ndjson_objects('gharchive_gz/*.json.gz')
LIMIT 2048
);
Here, we use our read_ndjson_objects
function, which reads the JSON objects in the file as raw JSON, i.e., as strings.
The query reads the first 2048 records of JSON from the JSON files gharchive_gz
directory and describes the structure.
You can also directly query the JSON files from GH Archive using DuckDB’s httpfs
extension, but we will be querying the files multiple times, so it is better to download them in this case.
I’ve formatted the result using an online JSON formatter & validator:
{
"id":"VARCHAR",
"type":"VARCHAR",
"actor":{
"id":"UBIGINT",
"login":"VARCHAR",
"display_login":"VARCHAR",
"gravatar_id":"VARCHAR",
"url":"VARCHAR",
"avatar_url":"VARCHAR"
},
"repo":{
"id":"UBIGINT",
"name":"VARCHAR",
"url":"VARCHAR"
},
"payload":{"..."},
"public":"BOOLEAN",
"created_at":"VARCHAR",
"org":{
"id":"UBIGINT",
"login":"VARCHAR",
"gravatar_id":"VARCHAR",
"url":"VARCHAR",
"avatar_url":"VARCHAR"
}
}
I’ve left "payload"
out because it consists of deeply nested JSON, and its formatted structure takes up more than 1000 lines!
So, how many records are we dealing with exactly? Let’s count it using DuckDB:
SELECT count(*) count FROM 'gharchive_gz/*.json.gz';
count |
---|
4434953 |
That’s around 4.4M daily events, which amounts to almost 200K events per hour. This query takes around 7.3s seconds on my laptop, a 2020 MacBook Pro with an M1 chip and 16GB of memory. This is the time it takes to decompress the GZIP compression and parse every JSON record.
To see how much time is spent decompressing GZIP in the query, I’ve also created a gharchive
directory containing the same data but uncompressed.
Running the same query on the uncompressed data takes around 5.4s, almost 2 seconds faster.
So we got faster, but we also read more than 18GB of data from storage, as opposed to 2.3GB when it was compressed.
So, this comparison really depends on the speed of your storage.
I prefer to keep the data compressed.
As a side note, the speed of this query really shows how fast yyjson is!
So, what kind of events are in the GitHub data?
SELECT type, count(*) count
FROM 'gharchive_gz/*.json.gz'
GROUP BY type
ORDER BY count DESC;
type | count |
---|---|
PushEvent | 2359096 |
CreateEvent | 624062 |
PullRequestEvent | 366090 |
IssueCommentEvent | 238660 |
WatchEvent | 231486 |
DeleteEvent | 154383 |
PullRequestReviewEvent | 131107 |
IssuesEvent | 88917 |
PullRequestReviewCommentEvent | 79540 |
ForkEvent | 64233 |
CommitCommentEvent | 36823 |
ReleaseEvent | 23004 |
MemberEvent | 14872 |
PublicEvent | 14500 |
GollumEvent | 8180 |
This query takes around 7.4s, not much more than the count(*)
query.
So as we can see, data analysis is very fast once everything has been decompressed and parsed.
The most common event type is the PushEvent
, taking up more than half of all events, unsurprisingly, which is people pushing their committed code to GitHub.
The least common event type is the GollumEvent
, taking up less than 1% of all events, which is a creation or update of a wiki page.
If we want to analyze the same data multiple times, decompressing and parsing every time is redundant. Instead, we can create a DuckDB table like so:
CREATE TABLE events AS
SELECT * EXCLUDE (payload)
FROM 'gharchive_gz/*.json.gz';
Which takes around 9s if you’re using an in-memory database.
If you’re using an on-disk database, this takes around 13s and results in a database size of 444MB.
When using an on-disk database, DuckDB ensures the table is persistent and performs all kinds of compression.
Note that we have temporarily ignored the payload
field using the convenient EXCLUDE
clause.
To get a feel of what we read, we can ask DuckDB to describe the table:
DESCRIBE SELECT * FROM events;
This gives us the following:
cid | name | type | notnull | dflt_value | pk |
---|---|---|---|---|---|
0 | id | BIGINT | false | false | |
1 | type | VARCHAR | false | false | |
2 | actor | STRUCT(id UBIGINT, login VARCHAR, display_login VARCHAR, gravatar_id VARCHAR, url VARCHAR, avatar_url VARCHAR) | false | false | |
3 | repo | STRUCT(id UBIGINT, name VARCHAR, url VARCHAR) | false | false | |
4 | public | BOOLEAN | false | false | |
5 | created_at | TIMESTAMP | false | false | |
6 | org | STRUCT(id UBIGINT, login VARCHAR, gravatar_id VARCHAR, url VARCHAR, avatar_url VARCHAR) | false | false |
As we can see, the "actor"
, "repo"
and "org"
fields, which are JSON objects, have been converted to DuckDB structs.
The "id"
column was a string in the original JSON but has been converted to a BIGINT
by DuckDB’s automatic type detection.
DuckDB can also detect a few different DATE
/TIMESTAMP
formats within JSON strings, as well as TIME
and UUID
.
Now that we’ve created the table, we can analyze it like any other DuckDB table!
Let’s see how much activity there was in the duckdb/duckdb
GitHub repository on this specific day:
SELECT type, count(*) count
FROM events
WHERE repo.name = 'duckdb/duckdb'
GROUP BY type
ORDER BY count DESC;
type | count |
---|---|
PullRequestEvent | 35 |
IssueCommentEvent | 30 |
WatchEvent | 29 |
PushEvent | 15 |
PullRequestReviewEvent | 14 |
IssuesEvent | 9 |
PullRequestReviewCommentEvent | 7 |
ForkEvent | 3 |
That’s a lot of pull request activity! Note that this doesn’t mean that 35 pull requests were opened on this day, activity within a pull request is also counted. If we search through the pull requests for that day, we see that there are only 15. This is more activity than normal because most of the DuckDB developers were busy fixing bugs for the 0.7.0 release.
Now, let’s see who was the most active:
SELECT actor.login, count(*) count
FROM events
WHERE repo.name = 'duckdb/duckdb'
AND type = 'PullRequestEvent'
GROUP BY actor.login
ORDER BY count desc
LIMIT 5;
login | count |
---|---|
Mytherin | 19 |
Mause | 4 |
carlopi | 3 |
Tmonster | 2 |
lnkuiper | 2 |
As expected, Mark (Mytherin, co-founder of DuckDB Labs) was the most active! My activity (lnkuiper, software engineer at DuckDB Labs) also shows up.
Handling inconsistent JSON schemas
So far, we have ignored the "payload"
of the events.
We’ve ignored it because the contents of this field are different based on the type of event.
We can see how they differ with the following query:
SELECT json_group_structure(payload) structure
FROM (SELECT *
FROM read_json(
'gharchive_gz/*.json.gz',
columns={
id: 'BIGINT',
type: 'VARCHAR',
actor: 'STRUCT(id UBIGINT,
login VARCHAR,
display_login VARCHAR,
gravatar_id VARCHAR,
url VARCHAR,
avatar_url VARCHAR)',
repo: 'STRUCT(id UBIGINT, name VARCHAR, url VARCHAR)',
payload: 'JSON',
public: 'BOOLEAN',
created_at: 'TIMESTAMP',
org: 'STRUCT(id UBIGINT, login VARCHAR, gravatar_id VARCHAR, url VARCHAR, avatar_url VARCHAR)'
},
lines='true'
)
WHERE type = 'WatchEvent'
LIMIT 2048
);
structure |
---|
{“action”:”VARCHAR”} |
The "payload"
field is simple for events of type WatchEvent
.
However, if we change the type to PullRequestEvent
, we get a JSON structure of more than 500 lines when formatted with a JSON formatter.
We don’t want to look through all those fields, so we cannot use our automatic schema detection, which will try to get them all.
Instead, we can manually supply the structure of the fields we’re interested in.
DuckDB will skip reading the other fields.
Another approach is to store the "payload"
field as DuckDB’s JSON data type and parse it at query time (see the example later in this post!).
I’ve stripped down the JSON structure for the "payload"
of events with the type PullRequestEvent
to the things I’m actually interested in:
{
"action":"VARCHAR",
"number":"UBIGINT",
"pull_request":{
"url":"VARCHAR",
"id":"UBIGINT",
"title":"VARCHAR",
"user":{
"login":"VARCHAR",
"id":"UBIGINT",
},
"body":"VARCHAR",
"created_at":"TIMESTAMP",
"updated_at":"TIMESTAMP",
"assignee":{
"login":"VARCHAR",
"id":"UBIGINT",
},
"assignees":[
{
"login":"VARCHAR",
"id":"UBIGINT",
}
],
}
}
This is technically not valid JSON because there are trailing commas. However, we try to allow trailing commas wherever possible in DuckDB, including JSON!
We can now plug this into the columns
parameter of read_json
, but we need to convert it to a DuckDB type first.
I’m lazy, so I prefer to let DuckDB do this for me:
SELECT typeof(json_transform('{}', '{
"action":"VARCHAR",
"number":"UBIGINT",
"pull_request":{
"url":"VARCHAR",
"id":"UBIGINT",
"title":"VARCHAR",
"user":{
"login":"VARCHAR",
"id":"UBIGINT",
},
"body":"VARCHAR",
"created_at":"TIMESTAMP",
"updated_at":"TIMESTAMP",
"assignee":{
"login":"VARCHAR",
"id":"UBIGINT",
},
"assignees":[
{
"login":"VARCHAR",
"id":"UBIGINT",
}
],
}
}'));
This gives us back a DuckDB type that we can plug the type into our function!
Note that because we are not auto-detecting the schema, we have to supply timestampformat
to be able to parse the timestamps correctly.
The key "user"
must be surrounded by quotes because it is a reserved keyword in SQL:
CREATE TABLE pr_events as
SELECT *
FROM read_json(
'gharchive_gz/*.json.gz',
columns={
id: 'BIGINT',
type: 'VARCHAR',
actor: 'STRUCT(id UBIGINT,
login VARCHAR,
display_login VARCHAR,
gravatar_id VARCHAR,
url VARCHAR,
avatar_url VARCHAR)',
repo: 'STRUCT(id UBIGINT, name VARCHAR, url VARCHAR)',
payload: 'STRUCT(
action VARCHAR,
number UBIGINT,
pull_request STRUCT(
url VARCHAR,
id UBIGINT,
title VARCHAR,
"user" STRUCT(
login VARCHAR,
id UBIGINT
),
body VARCHAR,
created_at TIMESTAMP,
updated_at TIMESTAMP,
assignee STRUCT(login VARCHAR, id UBIGINT),
assignees STRUCT(login VARCHAR, id UBIGINT)[]
)
)',
public: 'BOOLEAN',
created_at: 'TIMESTAMP',
org: 'STRUCT(id UBIGINT, login VARCHAR, gravatar_id VARCHAR, url VARCHAR, avatar_url VARCHAR)'
},
json_format='records',
lines='true',
timestampformat='%Y-%m-%dT%H:%M:%SZ'
)
WHERE type = 'PullRequestEvent';
This query completes in around 36s with an on-disk database (resulting size is 478MB) and 9s with an in-memory database. If you don’t care about preserving insertion order, you can speed the query up with this setting:
SET preserve_insertion_order=false;
With this setting, the query completes in around 27s with an on-disk database and 8.5s with an in-memory database. The difference between the on-disk and in-memory case is quite substantial here because DuckDB has to compress and persist much more data.
Now we can analyze pull request events! Let’s see what the maximum number of assignees is:
SELECT max(length(payload.pull_request.assignees)) max_assignees
FROM pr_events;
max_assignees |
---|
10 |
That’s a lot of people reviewing a single pull request!
We can check who was assigned the most:
WITH assignees AS (
SELECT payload.pull_request.assignee.login assignee
FROM pr_events
UNION ALL
SELECT unnest(payload.pull_request.assignees).login assignee
FROM pr_events
)
SELECT assignee, count(*) count
FROM assignees
WHERE assignee NOT NULL
GROUP BY assignee
ORDER BY count DESC
LIMIT 5;
assignee | count |
---|---|
poad | 494 |
vinayakkulkarni | 268 |
tmtmtmtm | 198 |
fisker | 98 |
icemac | 84 |
That’s a lot of assignments! Although I suspect there are duplicates in here.
Storing as JSON to parse at query time
Specifying the JSON schema of the "payload"
field was helpful because it allowed us to directly analyze what is there, and subsequent queries are much faster.
Still, it can also be quite cumbersome if the schema is complex.
If you don’t want to specify the schema of a field, you can set the type as 'JSON'
:
CREATE TABLE pr_events AS
SELECT *
FROM read_json(
'gharchive_gz/*.json.gz',
columns={
id: 'BIGINT',
type: 'VARCHAR',
actor: 'STRUCT(id UBIGINT,
login VARCHAR,
display_login VARCHAR,
gravatar_id VARCHAR,
url VARCHAR,
avatar_url VARCHAR)',
repo: 'STRUCT(id UBIGINT, name VARCHAR, url VARCHAR)',
payload: 'JSON',
public: 'BOOLEAN',
created_at: 'TIMESTAMP',
org: 'STRUCT(id UBIGINT, login VARCHAR, gravatar_id VARCHAR, url VARCHAR, avatar_url VARCHAR)'
},
json_format='records',
lines='true',
timestampformat='%Y-%m-%dT%H:%M:%SZ'
)
WHERE type = 'PullRequestEvent';
This will load the "payload"
field as a JSON string, and we can use DuckDB’s JSON functions to analyze it when querying.
For example:
SELECT DISTINCT payload->>'action' AS action, count(*) count
FROM pr_events
GROUP BY action
ORDER BY count DESC;
The ->>
arrow is short-hand for our json_extract_string
function.
Creating the entire "payload"
field as a column with type JSON
is not the most efficient way to get just the "action"
field, but this example is just to show the flexibility of read_json
.
The query results in the following table:
action | count |
---|---|
opened | 189096 |
closed | 174914 |
reopened | 2080 |
As we can see, only a few pull requests have been reopened.
Conclusion
DuckDB tries to be an easy-to-use tool that can read all kinds of data formats.
In the 0.7.0 release, we have added support for reading JSON.
JSON comes in many formats and all kinds of schemas.
DuckDB’s rich support for nested types (LIST
, STRUCT
) allows it to fully ‘shred’ the JSON to a columnar format for more efficient analysis.
We are excited to hear what you think about our new JSON functionality. If you have any questions or suggestions, please reach out to us on Discord or GitHub!
continue readingJupySQL Plotting with DuckDB
TLDR
JupySQL provides a seamless SQL experience in Jupyter and uses DuckDB to visualize larger than memory datasets in matplotlib.
continue readingAnnouncing DuckDB 0.7.0
The DuckDB team is happy to announce the latest DuckDB version (0.7.0) has been released. This release of DuckDB is named “Labradorius” after the Labrador Duck (Camptorhynchus labradorius) that was native to North America.
To install the new version, please visit the installation guide. The full release notes can be found here.
continue readingDuckCon 2023 - 2nd edition
The DuckDB team is excited to invite you all for our second DuckCon user group meeting. It will take place the day before FOSDEM in Brussels on Feb 3rd, 2023, at the Hilton Hotel.
In this edition, we will have the DuckDB creators Hannes Mühleisen, and Mark Raasveldt, talking about the current state of DuckDB and future plans. We will also have one talk regarding building out-of-tree extensions from DuckDB contributors Pedro Holanda, and Sam Ansmink and one invited talk from MotherDuck’s Founding Engineers, Boaz Leskes and Yves Le Maout. In addition, we will have a session with lightning talks from our users. We will close the event with a networking session with food and drinks where users are invited to interact directly with the DuckDB team.
continue readingAnnouncing DuckDB 0.6.0
The DuckDB team is happy to announce the latest DuckDB version (0.6.0) has been released. This release of DuckDB is named “Oxyura” after the White-headed duck (Oxyura leucocephala) which is an endangered species native to Eurasia.
To install the new version, please visit the installation guide. Note that the release is still being rolled out, so not all artifacts may be published yet. The full release notes can be found here.
continue readingLightweight Compression in DuckDB
TLDR: DuckDB supports efficient lightweight compression that is automatically used to keep data size down without incurring high costs for compression and decompression.
When working with large amounts of data, compression is critical for reducing storage size and egress costs. Compression algorithms typically reduce data set size by 75-95%, depending on how compressible the data is. Compression not only reduces the storage footprint of a data set, but also often improves performance as less data has to be read from disk or over a network connection.
continue readingModern Data Stack in a Box with DuckDB
TLDR: A fast, free, and open-source Modern Data Stack (MDS) can now be fully deployed on your laptop or to a single machine using the combination of DuckDB, Meltano, dbt, and Apache Superset.
This post is a collaboration with Jacob Matson and cross-posted on dataduel.co.
Summary
There is a large volume of literature (1, 2, 3) about scaling data pipelines. “Use Kafka! Build a lake house! Don’t build a lake house, use Snowflake! Don’t use Snowflake, use XYZ!” However, with advances in hardware and the rapid maturation of data software, there is a simpler approach. This article will light up the path to highly performant single node analytics with an MDS-in-a-box open source stack: Meltano, DuckDB, dbt, & Apache Superset on Windows using Windows Subsystem for Linux (WSL). There are many options within the MDS, so if you are using another stack to build an MDS-in-a-box, please share it with the community on the DuckDB Twitter, GitHub, or Discord, or the dbt slack! Or just stop by for a friendly debate about our choice of tools!
continue readingQuerying Postgres Tables Directly From DuckDB
TLDR: DuckDB can now directly query queries stored in PostgreSQL and speed up complex analytical queries without duplicating data.
continue readingPersistent Storage of Adaptive Radix Trees in DuckDB
TLDR: DuckDB uses Adaptive Radix Tree (ART) Indexes to enforce constraints and to speed up query filters. Up to this point, indexes were not persisted, causing issues like loss of indexing information and high reload times for tables with data constraints. We now persist ART Indexes to disk, drastically diminishing database loading times (up to orders of magnitude), and we no longer lose track of existing indexes. This blog post contains a deep dive into the implementation of ART storage, benchmarks, and future work. Finally, to better understand how our indexes are used, I’m asking you to answer the following survey. It will guide us when defining our future roadmap.
continue readingRange Joins in DuckDB
TL;DR: DuckDB has fully parallelised range joins that can efficiently join millions of range predicates.
Range intersection joins are an important operation in areas such as
temporal analytics,
and occur when two inequality conditions are present in a join predicate.
Database implementations often rely on slow O(N^2)
algorithms that compare every pair of rows
for these operations.
Instead, DuckDB leverages its fast sorting logic to implement two highly optimized parallel join operators
for these kinds of range predicates, resulting in 20-30x faster queries.
With these operators, DuckDB can be used effectively in more time-series-oriented use cases.
Friendlier SQL with DuckDB
An elegant user experience is a key design goal of DuckDB. This goal guides much of DuckDB’s architecture: it is simple to install, seamless to integrate with other data structures like Pandas, Arrow, and R Dataframes, and requires no dependencies. Parallelization occurs automatically, and if a computation exceeds available memory, data is gracefully buffered out to disk. And of course, DuckDB’s processing speed makes it easier to get more work accomplished.
However, SQL is not famous for being user-friendly. DuckDB aims to change that! DuckDB includes both a Relational API for dataframe-style computation, and a highly Postgres-compatible version of SQL. If you prefer dataframe-style computation, we would love your feedback on our roadmap. If you are a SQL fan, read on to see how DuckDB is bringing together both innovation and pragmatism to make it easier to write SQL in DuckDB than anywhere else. Please reach out on GitHub or Discord and let us know what other features would simplify your SQL workflows. Join us as we teach an old dog new tricks!
continue readingParallel Grouped Aggregation in DuckDB
TL;DR: DuckDB has a fully parallelized aggregate hash table that can efficiently aggregate over millions of groups.
Grouped aggregations are a core data analysis command. It is particularly important for large-scale data analysis (“OLAP”) because it is useful for computing statistical summaries of huge tables. DuckDB contains a highly optimized parallel aggregation capability for fast and scalable summarization.
Jump straight to the benchmarks?
continue readingDuckDB Time Zones: Supporting Calendar Extensions
TLDR: The DuckDB ICU extension now provides time zone support.
Time zone support is a common request for temporal analytics, but the rules are complex and somewhat arbitrary.
The most well supported library for locale-specific operations is the International Components for Unicode (ICU).
DuckDB already provided collated string comparisons using ICU via an extension (to avoid dependencies),
and we have now connected the existing ICU calendar and time zone functions to the main code
via the new TIMESTAMP WITH TIME ZONE
(or TIMESTAMPTZ
for short) data type. The ICU extension is pre-bundled in DuckDB’s Python client and can be optionally installed in the remaining clients.
In this post, we will describe how time works in DuckDB and what time zone functionality has been added.
continue readingDuckDB quacks Arrow: A zero-copy data integration between Apache Arrow and DuckDB
TLDR: The zero-copy integration between DuckDB and Apache Arrow allows for rapid analysis of larger than memory datasets in Python and R using either SQL or relational APIs.
This post is a collaboration with and cross-posted on the Arrow blog.
continue reading
DuckDB - The Lord of Enums:
The Fellowship of the Categorical and Factors.
String types are one of the most commonly used types. However, often string columns have a limited number of distinct values. For example, a country column will never have more than a few hundred unique entries. Storing a data type as a plain string causes a waste of storage and compromises query performance. A better solution is to dictionary encode these columns. In dictionary encoding, the data is split into two parts: the category and the values. The category stores the actual strings, and the values stores a reference to the strings. This encoding is depicted below.
continue readingFast Moving Holistic Aggregates
TLDR: DuckDB, a free and Open-Source analytical data management system, has a windowing API that can compute complex moving aggregates like interquartile ranges and median absolute deviation much faster than the conventional approaches.
In a previous post, we described the DuckDB windowing architecture and mentioned the support for some advanced moving aggregates. In this post, we will compare the performance various possible moving implementations of these functions and explain how DuckDB’s performant implementations work.
continue readingDuckDB-Wasm: Efficient Analytical SQL in the Browser
TLDR: DuckDB-Wasm is an in-process analytical SQL database for the browser. It is powered by WebAssembly, speaks Arrow fluently, reads Parquet, CSV and JSON files backed by Filesystem APIs or HTTP requests and has been tested with Chrome, Firefox, Safari and Node.js. You can try it in your browser at shell.duckdb.org or on Observable.
continue readingWindowing in DuckDB
TLDR: DuckDB, a free and Open-Source analytical data management system, has a state-of-the-art windowing engine that can compute complex moving aggregates like inter-quartile ranges as well as simpler moving averages.
Window functions (those using the OVER
clause) are important tools for analysing data series,
but they can be slow if not implemented carefully.
In this post, we will take a look at how DuckDB implements windowing.
We will also see how DuckDB can leverage its aggregate function architecture
to compute useful moving aggregates such as moving inter-quartile ranges (IQRs).
Fastest table sort in the West - Redesigning DuckDB’s sort
TLDR: DuckDB, a free and Open-Source analytical data management system, has a new highly efficient parallel sorting implementation that can sort much more data than fits in main memory.
Database systems use sorting for many purposes, the most obvious purpose being when a user adds an ORDER BY
clause to their query.
Sorting is also used within operators, such as window functions.
DuckDB recently improved its sorting implementation, which is now able to sort data in parallel and sort more data than fits in memory.
In this post, we will take a look at how DuckDB sorts, and how this compares to other data management systems.
Querying Parquet with Precision using DuckDB
TLDR: DuckDB, a free and open source analytical data management system, can run SQL queries directly on Parquet files and automatically take advantage of the advanced features of the Parquet format.
Apache Parquet is the most common “Big Data” storage format for analytics. In Parquet files, data is stored in a columnar-compressed binary format. Each Parquet file stores a single table. The table is partitioned into row groups, which each contain a subset of the rows of the table. Within a row group, the table data is stored in a columnar fashion.
continue readingEfficient SQL on Pandas with DuckDB
TLDR: DuckDB, a free and open source analytical data management system, can efficiently run SQL queries directly on Pandas DataFrames.
Recently, an article was published advocating for using SQL for Data Analysis. Here at team DuckDB, we are huge fans of SQL. It is a versatile and flexible language that allows the user to efficiently perform a wide variety of data transformations, without having to care about how the data is physically represented or how to do these data transformations in the most optimal way.
continue readingTesting out DuckDB's Full Text Search Extension
TLDR: DuckDB now has full-text search functionality, similar to the FTS5 extension in SQLite. The main difference is that our FTS extension is fully formulated in SQL. We tested it out on TREC disks 4 and 5.
Searching through textual data stored in a database can be cumbersome, as SQL does not provide a good way of formulating questions such as “Give me all the documents about Mallard Ducks”: string patterns with LIKE
will only get you so far. Despite SQL’s shortcomings here, storing textual data in a database is commonplace. Consider the table products (id INT, name VARCHAR, description VARCHAR
) - it would be useful to search through the name
and description
columns for a website that sells these products.