2023-11-03Tom Ebergen

Updates to the H2O.ai db-benchmark!

TL;DR: the H2O.ai db-benchmark has been updated with new results. In addition, the AWS EC2 instance used for benchmarking has been changed to a c6id.metal for improved repeatability and fairness across libraries. DuckDB is the fastest library for both join and group by queries at almost every data size.

Skip directly to the results

The Benchmark Has Been Updated!

In April, DuckDB Labs published a blog post reporting updated H2O.ai db-benchmark results. Since then, the results haven’t been updated. The original plan was to update the results with every DuckDB release. DuckDB 0.9.1 was recently released, and DuckDB Labs has updated the benchmark. While updating the benchmark, however, we noticed that our initial setup did not lend itself to being fair to all solutions. The machine used had network storage and could suffer from noisy neighbors. To avoid these issues, the whole benchmark was re-run on a c6id.metal machine.

New Benchmark Environment: c6id.metal Instance

Initially, updating the results to the benchmark showed strange results. Even using the same library versions from the prior update, some solutions regressed and others improved. We believe this variance came from the AWS EC2 instance we chose: an m4.10xlarge. The m4.10xlarge has 40 virtual CPUs and EBS storage. EBS storage is highly available network block storage for EC2 instances. When running compute-heavy benchmarks, a machine like the m4.10xlarge can suffer from the following issues:

  • Network storage is an issue for benchmarking solutions that interact with storage frequently. For the 500MB and 5GB workloads, network storage was not an issue on the m4.10xlarge since all solutions could execute the queries in memory. For the 50GB workload, however, network storage was an issue for the solutions that could not execute queries in memory. While the m4.10xlarge has dedicated EBS bandwidth, any read/write from storage is still happening over the network, which is usually slower than physically mounted storage. Solutions that frequently read and write to storage for the 50GB queries end up doing this over the network. This network time becomes a chunk of the execution time of the query. If the network has variable performance, the query performance is then also variable.

  • Noisy neighbors is a common issue when benchmarking on virtual CPUs. The previous machine most likely shared its compute hardware with other (neighboring) AWS EC2 instances. If these neighbors are also running compute heavy workloads, the physical CPU caches are repeatedly invalidated/flushed by the neighboring instance and the benchmark instance. When the CPU cache is shared between two workloads on two instances, both workloads require extra reads from memory for data that would already be in CPU cache on a non-virtual machine.

In order to be fair to all solutions, we decided to change the instance type to a metal instance with local storage. Metal instance types negate any noisy neighbor problems because the hardware is physical and not shared with any other AWS users/instances. Network storage problems are also fixed because solutions can read and write data to the local instance storage, which is physically mounted on the hardware.

Another benefit of the the c6id.metal box is that it stresses parallel performance. There are 128 cores on the c6id.metal. Performance differences between solutions that can effectively use every core and solutions that cannot are clearly visible.

See the updated settings section on how settings were change for each solution when run on the new machine.

Updating the Benchmark

Moving forward we will update the benchmark when PRs with new performance numbers are provided. The PR should include a description of the changes to a solution script or a version update and new entries in the time.csv and logs.csv files. These entries will be verified using a different c6id.metal instance, and if there is limited variance, the PR will be merged and the results will be updated!

Updated Settings

  1. ClickHouse
    • Storage: Any data this gets spilled to disk also needs to be on the NVMe drive. This has been changed in the new format_and_mount.sh script and the clickhouse/clickhouse-mount-config.xml file.
  2. Julia (juliadf & juliads)
    • Threads: The threads were hardcoded for juliadf/juliads to 20/40 threads. Now the max number of threads are used. No option was given to spill to disk, so this was not changed/researched.
  3. DuckDB
    • Storage: The DuckDB database file was specified to run on the NVMe mount.
  4. Spark
    • Storage: There is an option to spill to disk. I was unsure of how to modify the storage location so that it was on the NVMe drive. Open to a PR with storage location changes and improved results!

Many solutions do not spill to disk, so they did not require any modification to use the instance storage. Other solutions use parallel::ncores() or default to a maximum number of cores for parallelism. Solution scripts were run in their current form on github.com/duckdblabs/db-benchmark. Please read the Updating the Benchmark section on how to re-run your solution.

Results

The first results you see are the 50GB group by results. The benchmark runs every query twice per solution, and both runtimes are reported. The “first time” can be considered a cold run, and the “second time” can be considered a hot run. DuckDB and DuckDB-latest perform very well among all dataset sizes and variations.

The team at DuckDB Labs has been hard at work improving the performance of the out-of-core hash aggregates and joins. The most notable improvement is the performance of query 5 in the advanced group by queries. The cold run is almost an order of magnitude better than every other solution! DuckDB is also one of only two solutions to finish the 50GB join query. Some solutions are experiencing timeouts on the 50GB datasets. Solutions running the 50GB group by queries are killed after running for 180 minutes, meaning all 10 group by queries need to finish within the 180 minutes. Solutions running the 50GB join queries are killed after running for 360 minutes.

Link to result page

continue reading
2023-10-27Pedro Holanda

DuckDB's CSV Sniffer: Automatic Detection of Types and Dialects

ducktetive

TL;DR: DuckDB is primarily focused on performance, leveraging the capabilities of modern file formats. At the same time, we also pay attention to flexible, non-performance-driven formats like CSV files. To create a nice and pleasant experience when reading from CSV files, DuckDB implements a CSV sniffer that automatically detects CSV dialect options, column types, and even skips dirty data. The sniffing process allows users to efficiently explore CSV files without needing to provide any input about the file format.

There are many different file formats that users can choose from when storing their data. For example, there are performance-oriented binary formats like Parquet, where data is stored in a columnar format, partitioned into row-groups, and heavily compressed. However, Parquet is known for its rigidity, requiring specialized systems to read and write these files.

On the other side of the spectrum, there are files with the CSV (comma-separated values) format, which I like to refer to as the ‘Woodstock of data’. CSV files offer the advantage of flexibility; they are structured as text files, allowing users to manipulate them with any text editor, and nearly any data system can read and execute queries on them.

However, this flexibility comes at a cost. Reading a CSV file is not a trivial task, as users need a significant amount of prior knowledge about the file. For instance, DuckDB’s CSV reader offers more than 25 configuration options. I’ve found that people tend to think I’m not working hard enough if I don’t introduce at least three new options with each release. Just kidding. These options include specifying the delimiter, quote and escape characters, determining the number of columns in the CSV file, and identifying whether a header is present while also defining column types. This can slow down an interactive data exploration process, and make analyzing new datasets a cumbersome and less enjoyable task.

One of the raison d’être of DuckDB is to be pleasant and easy to use, so we don’t want our users to have to fiddle with CSV files and input options manually. Manual input should be reserved only for files with rather unusual choices for their CSV dialect (where a dialect comprises the combination of the delimiter, quote, escape, and newline values used to create that file) or for specifying column types.

Automatically detecting CSV options can be a daunting process. Not only are there many options to investigate, but their combinations can easily lead to a search space explosion. This is especially the case for CSV files that are not well-structured. Some might argue that CSV files have a specification, but the truth of the matter is that the “specification” changes as soon as a single system is capable of reading a flawed file. And, oh boy, I’ve encountered my fair share of semi-broken CSV files that people wanted DuckDB to read in the past few months.

DuckDB implements a multi-hypothesis CSV sniffer that automatically detects dialects, headers, date/time formats, column types, and identifies dirty rows to be skipped. Our ultimate goal is to automatically read anything resembling a CSV file, to never give up and never let you down! All of this is achieved without incurring a substantial initial cost when reading CSV files. In the bleeding edge version, the sniffer runs when reading a CSV file by default. Note that the sniffer will always prioritize any options set by the user (e.g., if the user sets , as the delimiter, the sniffer won’t try any other options and will assume that the user input is correct).

In this blog post, I will explain how the current implementation works, discuss its performance, and provide insights into what comes next!

DuckDB’s Automatic Detection

The process of parsing CSV files is depicted in the figure below. It currently consists of five different phases, which will be detailed in the next sections.

The CSV file used in the overview example is as follows:

Name, Height, Vegetarian, Birthday
"Pedro", 1.73, False, 30-07-92
... imagine 2048 consistent rows ...
"Mark", 1.72, N/A, 20-09-92

sniffing overview

In the first phase, we perform Dialect Detection, where we select the dialect candidates that generate the most per-row columns in the CSV file while maintaining consistency (i.e., not exhibiting significant variations in the number of columns throughout the file). In our example, we can observe that, after this phase, the sniffer successfully detects the necessary options for the delimiter, quotes, escapes, and new line delimiters.

The second phase, referred to as Type Detection, involves identifying the data types for each column in our CSV file. In our example, our sniffer recognizes four column types: VARCHAR, DOUBLE, BOOL, and DATE.

The third step, known as Header Detection, is employed to ascertain whether our file includes a header. If a header is present, we use it to set the column names; otherwise, we generate them automatically. In our example, there is a header, and each column gets its name defined in there.

Now that our columns have names, we move on to the fourth, optional phase: Type Replacement. DuckDB’s CSV reader provides users with the option to specify column types by name. If these types are specified, we replace the detected types with the user’s specifications.

Finally, we progress to our last phase, Type Refinement. In this phase, we analyze additional sections of the file to validate the accuracy of the types determined during the initial type detection phase. If necessary, we refine them. In our example, we can see that the Vegetarian column was initially categorized as BOOL. However, upon further examination, it was found to contain the string N/A, leading to an upgrade of the column type to VARCHAR to accommodate all possible values.

The automatic detection is only executed on a sequential sample of the CSV file. By default, the size of the sample is 20,480 tuples (i.e., 10 DuckDB execution chunks). This can be configured via the sample_size option, and can be set to -1 in case the user wants to sniff the complete file. Since the same data is repeatedly read with various options, and users can scan the entire file, all CSV buffers generated during sniffing are cached and efficiently managed to ensure high performance.

Of course, running the CSV Sniffer on very large files will have a drastic impact on the overall performance (see our benchmark section below). In these cases, the sample size should be kept at a reasonable level.

In the next subsections, I will describe each phase in detail.

Dialect Detection

In the Dialect Detection, we identify the delimiter, quotes, escapes, and new line delimiters of a CSV file.

Our delimiter search space consists of the following delimiters: ,, |, ;, \t. If the file has a delimiter outside the search space, it must be provided by the user (e.g., delim='?'). Our quote search space is ", ' and \0, where \0 is a string terminator indicating no quote is present; again, users can provide custom characters outside the search space (e.g., quote='?'). The search space of escape values depends on the value of the quote option, but in summary, they are the same as quotes with the addition of \, and again, they can also be provided by the user (escape='?'). Finally, the last detected option is the new line delimiters; they can be \r, \n, \r\n, and a mix of everything (trust me, I’ve seen a real-world CSV file that used a mix).

By default, the dialect detection runs on 24 different combinations of dialect configurations. To determine the most promising configuration, we calculate the number of columns each CSV tuple would produce under each of these configurations. The one that results in the most columns with the most consistent rows will be chosen.

The calculation of consistent rows depends on additional user-defined options. For example, the null_padding option will pad missing columns with NULL values. Therefore, rows with missing columns will have the missing columns padded with NULL.

If null_padding is set to true, CSV files with inconsistent rows will still be considered, but a preference will be given to configurations that minimize the occurrence of padded rows. If null_padding is set to false, the dialect detector will skip inconsistent rows at the beginning of the CSV file. As an example, consider the following CSV file.

I like my csv files to have notes to make dialect detection harder
I also like commas like this one : ,
A,B,C
1,2,3
4,5,6

Here the sniffer would detect that with the delimiter set to , the first row has one column, the second has two, but the remaining rows have 3 columns. Hence, if null_padding is set to false, it would still select , as a delimiter candidate, by assuming the top rows are dirty notes. (Believe me, CSV notes are a thing!). Resulting in the following table:

A,B,C
1, 2, 3
4, 5, 6

If null_padding is set to true, all lines would be accepted, resulting in the following table:

'I like my csv files to have notes to make dialect detection harder', None, None
'I also like commas like this one : ', None, None
'A', 'B', 'C'
'1', '2', '3'
'4', '5', '6'

If the ignore_errors option is set, then the configuration that yields the most columns with the least inconsistent rows will be picked.

Type Detection

After deciding the dialect that will be used, we detect the types of each column. Our Type Detection considers the following types: SQLNULL, BOOLEAN, BIGINT, DOUBLE, TIME, DATE, TIMESTAMP, VARCHAR. These types are ordered in specificity, which means we first check if a column is a SQLNULL; if not, if it’s a BOOLEAN, and so on, until it can only be a VARCHAR. DuckDB has more types than the ones used by default. Users can also define which types the sniffer should consider via the auto_type_candidates option.

At this phase, the type detection algorithm goes over the first chunk of data (i.e., 2048 tuples). This process starts on the second valid row (i.e., not a note) of the file. The first row is stored separately and not used for type detection. It will be later detected if the first row is a header or not. The type detection runs a per-column, per-value casting trial process to determine the column types. It starts off with a unique, per-column array with all types to be checked. It tries to cast the value of the column to that type; if it fails, it removes the type from the array, attempts to cast with the new type, and continues that process until the whole chunk is finished.

At this phase, we also determine the format of DATE and TIMESTAMP columns. The following formats are considered for DATE columns: %m-%d-%Y, %m-%d-%y, %d-%m-Y, %d-%m-%y, %Y-%m-%d, %y-%m-%d, and the following for TIMESTAMP columns: %Y-%m-%dT%H:%M:%S.%f,%Y-%m-%d %H:%M:%S.%f, %m-%d-%Y %I:%M:%S %p, %m-%d-%y %I:%M:%S %p, %d-%m-%Y %H:%M:%S, %d-%m-%y %H:%M:%S, %Y-%m-%d %H:%M:%S, %y-%m-%d %H:%M:%S. For columns that use formats outside this search space, they must be defined with the dateformat and timestampformat options.

As an example, let’s consider the following CSV file.

Name, Age
,
Jack Black, 54
Kyle Gass, 63.2

The first row [Name, Age] will be stored separately for the header detection phase. The second row [NULL, NULL] will allow us to cast the first and second columns to SQLNULL. Therefore, their type candidate arrays will be the same: [SQLNULL, BOOLEAN, BIGINT, DOUBLE, TIME, DATE, TIMESTAMP, VARCHAR].

In the third row [Jack Black, 54], things become more interesting. With ‘Jack Black,’ the type candidate array for column 0 will exclude all values with higher specificity, as ‘Jack Black’ can only be converted to a VARCHAR. The second column cannot be converted to either SQLNULL or BOOLEAN, but it will succeed as a BIGINT. Hence, the type candidate for the second column will be [BIGINT, DOUBLE, TIME, DATE, TIMESTAMP, VARCHAR].

In the fourth row, we have [Kyle Gass, 63.2]. For the first column, there’s no problem since it’s also a valid VARCHAR. However, for the second column, a cast to BIGINT will fail, but a cast to DOUBLE will succeed. Hence, the new array of candidate types for the second column will be [DOUBLE, TIME, DATE, TIMESTAMP, VARCHAR].

Header Detection

The Header Detection phase simply obtains the first valid line of the CSV file and attempts to cast it to the candidate types in our columns. If there is a cast mismatch, we consider that row as the header; if not, we treat the first row as actual data and automatically generate a header.

In our previous example, the first row was [Name, Age], and the column candidate type arrays were [VARCHAR] and [DOUBLE, TIME, DATE, TIMESTAMP, VARCHAR]. Name is a string and can be converted to VARCHAR. Age is also a string, and attempting to cast it to DOUBLE will fail. Since the casting fails, the auto-detection algorithm considers the first row as a header, resulting in the first column being named Name and the second as Age.

If a header is not detected, column names will be automatically generated with the pattern column${x}, where x represents the column’s position (0-based index) in the CSV file.

Type Replacement

Now that the auto-detection algorithm has discovered the header names, if the user specifies column types, the types detected by the sniffer will be replaced with them in the Type Replacement phase. For example, we can replace the Age type with FLOAT by using:

SELECT * FROM read_csv('greatest_band_in_the_world.csv', types = {'Age': 'FLOAT'})

This phase is optional and will only be triggered if there are manually defined types.

Type Refinement

The Type Refinement phase performs the same tasks as type detection; the only difference is the granularity of the data on which the casting operator works, which is adjusted for performance reasons. During type detection, we conduct cast checks on a per-column, per-value basis.

In this phase, we transition to a more efficient vectorized casting algorithm. The validation process remains the same as in type detection, with types from type candidate arrays being eliminated if a cast fails.

How Fast is the Sniffing?

To analyze the impact of running DuckDB’s automatic detection, we execute the sniffer on the NYC taxi dataset. The file consists of 19 columns, 10,906,858 tuples and is 1.72 GB in size.

The cost of sniffing the dialect column names and types is approximately 4% of the total cost of loading the data.

Name Time (s)
Sniffing 0.11
Loading 2.43

Varying Sampling Size

Sometimes, CSV files can have dialect options or more refined types that appear only later in the CSV file. In those cases, the sample_size option becomes an important tool for users to ensure that the sniffer examines enough data to make the correct decision. However, increasing the sample_size also leads to an increase in the total runtime of the sniffer because it uses more data to detect all possible dialects and types.

Below, you can see how increasing the default sample size by multiplier (see X axis) affects the sniffer’s runtime on the NYC dataset. As expected, the total time spent on sniffing increases linearly with the total sample size.

sample benchmark

Varying Number of Columns

The other main characteristic of a CSV file that will affect the auto-detection is the number of columns the file has. Here, we test the sniffer against a varying number of INTEGER type columns in files with 10,906,858 tuples. The results are depicted in the figure below. We can see that from one column to two, we have a steeper increase in runtime. That’s because, for single columns, we have a simplified dialect detection due to the lack of delimiters. For the other columns, as expected, we have a more linear increase in runtime, depending on the number of columns.

sniffer benchmark

Conclusion & Future Work

If you have unusual CSV files and want to query, clean up, or normalize them, DuckDB is already one of the top solutions available. It is very easy to get started. To read a CSV file with the sniffer, you can simply:

SELECT * FROM 'path/to/csv_file.csv';

DuckDB’s CSV auto-detection algorithm is an important tool to facilitate the exploration of CSV files. With its default options, it has a low impact on the total cost of loading and reading CSV files. Its main goal is to always be capable of reading files, doing a best-effort job even on files that are ill-defined.

We have a list of points related to the sniffer that we would like to improve in the future.

  1. Advanced Header Detection. We currently determine if a CSV has a header by identifying a type mismatch between the first valid row and the remainder of the CSV file. However, this can generate false negatives if, for example, all the columns of a CSV are of a type VARCHAR. We plan on enhancing our Header Detection to perform matches with commonly used names for headers.
  2. Adding Accuracy and Speed Benchmarks. We currently implement many accuracy and regression tests; however, due to the CSV’s inherent flexibility, manually creating test cases is quite daunting. The plan moving forward is to implement a whole accuracy and regression test suite using the Pollock Benchmark
  3. Improved Sampling. We currently execute the auto-detection algorithm on a sequential sample of data. However, it’s very common that new settings are only introduced later in the file (e.g., quotes might be used only in the last 10% of the file). Hence, being able to execute the sniffer in distinct parts of the file can improve accuracy.
  4. Multi-Table CSV File. Multiple tables can be present in the same CSV file, which is a common scenario when exporting spreadsheets to CSVs. Therefore, we would like to be able to identify and support these.
  5. Null-String Detection. We currently do not have an algorithm in place to identify the representation of null strings.
  6. Decimal Precision Detection. We also don’t automatically detect decimal precision yet. This is something that we aim to tackle in the future.
  7. Parallelization. Despite DuckDB’s CSV Reader being fully parallelized, the sniffer is still limited to a single thread. Parallelizing it in a similar fashion to what is done with the CSV Reader (description coming in a future blog post) would significantly enhance sniffing performance and enable full-file sniffing.
  8. Sniffer as a stand-alone function. Currently, users can utilize the DESCRIBE query to acquire information from the sniffer, but it only returns column names and types. We aim to expose the sniffing algorithm as a stand-alone function that provides the complete results from the sniffer. This will allow users to easily configure files using the exact same options without the need to rerun the sniffer.
continue reading
2023-10-06Mark Raasveldt, Hannes Mühleisen, Gabor Szarnyas

DuckCon #4 in Amsterdam

We are excited to hold the next “DuckCon” DuckDB user group meeting for the first time in the birthplace of DuckDB, Amsterdam, the Netherlands. The meeting will take place on February 2, 2024 (Friday) in the OBA Congress Center’s Theater room, five minutes walking distance from Amsterdam Central Station.

Conveniently, the event is co-located in time and space with the FOSDEM conference that will start the following day in Brussels (February 3-4). Participants could for example choose to attend DuckCon, then take a direct train from Amsterdam Central Station to Brussels.

As is traditional in DuckCons, we will start with a talk from DuckDB’s creators Hannes Mühleisen and Mark Raasveldt about the current state of DuckDB and plans for releasing DuckDB version 1.0. This will be followed by two presentations by DuckDB users.

In addition, we will have several lightning talks from the DuckDB community.

Timetable

Time Title Presenter
2:00 PM First session  
2:00 PM Welcome to DuckCon!  
2:05 PM Introductions Hannes Mühleisen
2:10 PM State of the Duck Mark Raasveldt
2:35 PM Hugging Face ❤️ DuckDB Polina Kazakova (Hugging Face)
3:00 PM 🦆s in the Lakehouse Kamaljit Pati, Subash Roul (Fivetran)
3:25 PM Break  
3:40 PM Second session  
3:40 PM The Duck (DB) Feather in Your Parquet Cap Niger Little-Poole (Prequel)
4:05 PM Lightning talk block  
4:05 PM DuckDB Powering Interactive Notebooks Rik Bauwens (DataCamp)
4:10 PM DuckDB Applications in Information Retrieval Arjen de Vries (Radboud University)
4:15 PM How to Analyse a DDoS Quackly Remco Poortinga - van Wijnen (SURF)
4:20 PM How I Test New dbt-core Features with DuckDB Kshitij Aranke (dbt Labs)
4:25 PM Building Tecton’s Feature Engineering Platform
on DuckDB
Mike Eastham (Tecton AI)
4:30 PM Answering Questions about Football 60x Faster
with DuckDB
Ian Jenkins (Liverpool FC)
4:35 PM Closing thoughts  
4:40 PM Drink and snacks sponsored by Rill Data  
6:00 PM End of event  

Registration Process

Attendance is free. While supplies last, you can still get a ticket on Eventbrite. You will need to show this ticket at the entrance to attend. Please contact Gabor Szarnyas at [email protected] if you have any questions.

continue reading
2023-09-26Mark Raasveldt and Hannes Mühleisen

Announcing DuckDB 0.9.0

Image of the Yellow Billed Duck

The DuckDB team is happy to announce the latest DuckDB release (0.9.0). This release is named Undulata after the Yellow-billed duck native to Africa.

To install the new version, please visit the installation guide. The full release notes can be found here.

continue reading
2023-09-15Richard Wesley

DuckDB's AsOf Joins: Fuzzy Temporal Lookups

TL;DR: DuckDB supports AsOf Joins – a way to match nearby values. They are especially useful for searching event tables for temporal analytics.

Do you have time series data that you want to join, but the timestamps don’t quite match? Or do you want to look up a value that changes over time using the times in another table? And did you end up writing convoluted (and slow) inequality joins to get your results? Then this post is for you!

continue reading
2023-08-23Alex Monahan

Even Friendlier SQL with DuckDB

Looks like a Duck ready to boldly go where databases have not gone before

TL;DR: DuckDB continues to push the boundaries of SQL syntax to both simplify queries and make more advanced analyses possible. Highlights include dynamic column selection, queries that start with the FROM clause, function chaining, and list comprehensions. We boldly go where no SQL engine has gone before!

Who says that SQL should stay frozen in time, chained to a 1999 version of the specification? As a comparison, do folks remember what JavaScript felt like before Promises? Those didn’t launch until 2012! It’s clear that innovation at the programming syntax layer can have a profoundly positive impact on an entire language ecosystem.

We believe there are many valid reasons for innovation in the SQL language, among them opportunities to simplify basic queries and also to make more dynamic analyses possible. Many of these features arose from community suggestions! Please let us know your SQL pain points on Discord or GitHub and join us as we change what it feels like to write SQL!

If you have not had a chance to read the first installment in this series, please take a quick look here.

The future is now

The first few enhancements in this list were included in the “Ideas for the Future” section of the prior post.

Reusable column aliases

When working with incremental calculated expressions in a select statement, traditional SQL dialects force you to either write out the full expression for each column or create a common table expression (CTE) around each step of the calculation. Now, any column alias can be reused by subsequent columns within the same select statement. Not only that, but these aliases can be used in the where and order by clauses as well.

Old way 1: Repeat yourself

select 
    'These are the voyages of the starship Enterprise...' AS intro,
    instr('These are the voyages of the starship Enterprise...', 'starship') AS starship_loc
    substr('These are the voyages of the starship Enterprise...', instr('These are the voyages of the starship Enterprise...', 'starship') + len('starship') + 1) AS trimmed_intro;

Old way 2: All the CTEs

WITH intro_cte AS (
    SELECT
        'These are the voyages of the starship Enterprise...' AS intro
), starship_loc_cte AS (
    SELECT
        intro,
        instr(intro, 'starship') AS starship_loc
    FROM intro_cte
)
SELECT
    intro,
    starship_loc,
    substr(intro, starship_loc + len('starship') + 1) AS trimmed_intro
FROM starship_loc_cte;

New way

SELECT 
     'These are the voyages of the starship Enterprise...' AS intro,
     instr(intro, 'starship') AS starship_loc,
     substr(intro, starship_loc + len('starship') + 1) AS trimmed_intro;
intro starship_loc trimmed_intro
These are the voyages of the starship Enterprise… 30 Enterprise…

Dynamic column selection

Databases typically prefer strictness in column definitions and flexibility in the number of rows. This can help by enforcing data types and recording column level metadata. However, in data science workflows and elsewhere, it is very common to dynamically generate columns (for example during feature engineering).

No longer do you need to know all of your column names up front! DuckDB can select and even modify columns based on regular expression pattern matching, EXCLUDE or REPLACE modifiers, and even lambda functions (see the section on lambda functions below for details!).

Let’s take a look at some facts gathered about the first season of Star Trek. Using DuckDB’s httpfs extension, we can query a csv dataset directly from GitHub. It has several columns so let’s DESCRIBE it.

INSTALL httpfs;
LOAD httpfs;
CREATE TABLE trek_facts AS
    SELECT * FROM 'https://raw.githubusercontent.com/Alex-Monahan/example_datasets/main/Star_Trek-Season_1.csv';

DESCRIBE trek_facts;
column_name column_type null key default extra
season_num BIGINT YES NULL NULL NULL
episode_num BIGINT YES NULL NULL NULL
aired_date DATE YES NULL NULL NULL
cnt_kirk_hookups BIGINT YES NULL NULL NULL
cnt_downed_redshirts BIGINT YES NULL NULL NULL
bool_aliens_almost_took_over_planet BIGINT YES NULL NULL NULL
bool_aliens_almost_took_over_enterprise BIGINT YES NULL NULL NULL
cnt_vulcan_nerve_pinch BIGINT YES NULL NULL NULL
cnt_warp_speed_orders BIGINT YES NULL NULL NULL
highest_warp_speed_issued BIGINT YES NULL NULL NULL
bool_hand_phasers_fired BIGINT YES NULL NULL NULL
bool_ship_phasers_fired BIGINT YES NULL NULL NULL
bool_ship_photon_torpedos_fired BIGINT YES NULL NULL NULL
cnt_transporter_pax BIGINT YES NULL NULL NULL
cnt_damn_it_jim_quote BIGINT YES NULL NULL NULL
cnt_im_givin_her_all_shes_got_quote BIGINT YES NULL NULL NULL
cnt_highly_illogical_quote BIGINT YES NULL NULL NULL
bool_enterprise_saved_the_day BIGINT YES NULL NULL NULL

COLUMNS() with regular expressions

The COLUMNS expression can accept a string parameter that is a regular expression and will return all column names that match the pattern. How did warp change over the first season? Let’s examine any column name that contains the word warp.

SELECT
    episode_num,
    COLUMNS('.*warp.*')
FROM trek_facts;
episode_num cnt_warp_speed_orders highest_warp_speed_issued
0 1 1
1 0 0
2 1 1
3 1 0
27 1 1
28 0 0
29 2 8

The COLUMNS expression can also be wrapped by other functions to apply those functions to each selected column. Let’s simplify the above query to look at the maximum values across all episodes:

SELECT
    MAX(COLUMNS('.*warp.*'))
FROM trek_facts;
max(trek_facts.cnt_warp_speed_orders) max(trek_facts.highest_warp_speed_issued)
5 8

We can also create a WHERE clause that applies across multiple columns. All columns must match the filter criteria, which is equivalent to combining them with AND. Which episodes had at least 2 warp speed orders and at least a warp speed level of 2?

SELECT
    episode_num,
    COLUMNS('.*warp.*')
FROM trek_facts
WHERE
    COLUMNS('.*warp.*') >= 2;
    -- cnt_warp_speed_orders >= 2 
    -- AND 
    -- highest_warp_speed_issued >= 2
episode_num cnt_warp_speed_orders highest_warp_speed_issued
14 3 7
17 2 7
18 2 8
29 2 8

COLUMNS() with EXCLUDE and REPLACE

Individual columns can also be either excluded or replaced prior to applying calculations on them. For example, since our dataset only includes season 1, we do not need to find the MAX of that column. It would be highly illogical.

SELECT
    MAX(COLUMNS(* EXCLUDE season_num))
FROM trek_facts;
max(trek_facts.
episode_num)
max(trek_facts.
aired_date)
max(trek_facts.
cnt_kirk_hookups)
max(trek_facts.
bool_enterprise_saved_the_day)
29 1967-04-13 2 1

The REPLACE syntax is also useful when applied to a dynamic set of columns. In this example, we want to convert the dates into timestamps prior to finding the maximum value in each column. Previously this would have required an entire subquery or CTE to pre-process just that single column!

SELECT
    MAX(COLUMNS(* REPLACE aired_date::timestamp AS aired_date))
FROM trek_facts;
max(trek_facts.
season_num)
max(trek_facts.
episode_num)
max(aired_date :=
CAST(aired_date AS TIMESTAMP))
max(trek_facts.
bool_enterprise_saved_the_day)
1 29 1967-04-13 00:00:00 1

COLUMNS() with lambda functions

The most flexible way to query a dynamic set of columns is through a lambda function. This allows for any matching criteria to be applied to the names of the columns, not just regular expressions. See more details about lambda functions below.

For example, if using the LIKE syntax is more comfortable, we can select columns matching a LIKE pattern rather than with a regular expression.

SELECT
    episode_num,
    COLUMNS(col -> col LIKE '%warp%')
FROM trek_facts
WHERE
    COLUMNS(col -> col LIKE '%warp%') >= 2;
episode_num cnt_warp_speed_orders highest_warp_speed_issued
14 3 7
17 2 7
18 2 8
29 2 8

Automatic JSON to nested types conversion

The first installment in the series mentioned JSON dot notation references as future work. However, the team has gone even further! Instead of referring to JSON-typed columns using dot notation, JSON can now be automatically parsed into DuckDB’s native types for significantly faster performance, compression, as well as that friendly dot notation!

First, install and load the httpfs and json extensions if they don’t come bundled with the client you are using. Then query a remote JSON file directly as if it were a table!

INSTALL httpfs;
LOAD httpfs;
INSTALL json;
LOAD json;

SELECT 
     starfleet[10].model AS starship 
FROM 'https://raw.githubusercontent.com/vlad-saling/star-trek-ipsum/master/src/content/content.json';
starship
USS Farragut - NCC-1647 - Ship on which James Kirk served as a phaser station operator. Attacked by the Dikironium Cloud Creature, killing half the crew. ad.

Now for some new SQL capabilities beyond the ideas from the prior post!

FROM first in SELECT statements

When building a query, the first thing you need to know is where your data is coming FROM. Well then why is that the second clause in a SELECT statement?? No longer! DuckDB is building SQL as it should have always been - putting the FROM clause first! This addresses one of the longest standing complaints about SQL, and the DuckDB team implemented it in 2 days.

FROM my_table SELECT my_column;

Not only that, the SELECT statement can be completely removed and DuckDB will assume all columns should be SELECTed. Taking a look at a table is now as simple as:

FROM my_table;
-- SELECT * FROM my_table

Other statements like COPY are simplified as well.

COPY (FROM trek_facts) TO 'phaser_filled_facts.parquet';

This has an additional benefit beyond saving keystrokes and staying in a development flow state: autocomplete will have much more context when you begin to choose columns to query. Give the AI a helping hand!

Note that this syntax is completely optional, so your SELECT * FROM keyboard shortcuts are safe, even if they are obsolete… 🙂

Function chaining

Many SQL blogs advise the use of CTEs instead of subqueries. Among other benefits, they are much more readable. Operations are compartmentalized into discrete chunks and they can be read in order top to bottom instead of forcing the reader to work their way inside out.

DuckDB enables the same interpretability improvement for every scalar function! Use the dot operator to chain functions together, just like in Python. The prior expression in the chain is used as the first argument to the subsequent function.

SELECT 
     ('Make it so')
          .UPPER()
          .string_split(' ')
          .list_aggr('string_agg','.')
          .concat('.') AS im_not_messing_around_number_one;
im_not_messing_around_number_one
MAKE.IT.SO.

Now compare that with the old way…

SELECT 
     concat(
          list_aggr(
               string_split(
                    UPPER('Make it stop'),
               ' '),
          'string_agg','.'),
     '.') AS oof;
oof
MAKE.IT.STOP.

Union by name

DuckDB aims to blend the best of databases and dataframes. This new syntax is inspired by the concat function in Pandas. Rather than vertically stacking tables based on column position, columns are matched by name and stacked accordingly. Simply replace UNION with UNION BY NAME or UNION ALL with UNION ALL BY NAME.

For example, we had to add some new alien species proverbs in The Next Generation:

CREATE TABLE proverbs AS
     SELECT 
          'Revenge is a dish best served cold' AS klingon_proverb 
     UNION ALL BY NAME 
     SELECT 
          'You will be assimilated' AS borg_proverb,
          'If winning is not important, why keep score?' AS klingon_proverb;

FROM proverbs;
klingon_proverb borg_proverb
Revenge is a dish best served cold NULL
If winning is not important, why keep score? You will be assimilated

This approach has additional benefits. As seen above, not only can tables with different column orders be combined, but so can tables with different numbers of columns entirely. This is helpful as schemas migrate, and is particularly useful for DuckDB’s multi-file reading capabilities.

Insert by name

Another common situation where column order is strict in SQL is when inserting data into a table. Either the columns must match the order exactly, or all of the column names must be repeated in two locations within the query.

Instead, add the keywords BY NAME after the table name when inserting. Any subset of the columns in the table in any order can be inserted.

INSERT INTO proverbs BY NAME 
     SELECT 'Resistance is futile' AS borg_proverb;

SELECT * FROM proverbs;
klingon_proverb borg_proverb
Revenge is a dish best served cold NULL
If winning is not important, why keep score? You will be assimilated
NULL Resistance is futile

Dynamic PIVOT and UNPIVOT

Historically, databases are not well-suited for pivoting operations. However, DuckDB’s PIVOT and UNPIVOT clauses can create or stack dynamic column names for a truly flexible pivoting capability! In addition to that flexibility, DuckDB also provides both the SQL standard syntax and a friendlier shorthand.

For example, let’s take a look at some procurement forecast data just as the Earth-Romulan war was beginning:

CREATE TABLE purchases (item VARCHAR, year INT, count INT);

INSERT INTO purchases
    VALUES ('phasers', 2155, 1035), ('phasers', 2156, 25039), ('phasers', 2157, 95000),
           ('photon torpedoes', 2155, 255), ('photon torpedoes', 2156, 17899), ('photon torpedoes', 2157, 87492);

FROM purchases;
item year count
phasers 2155 1035
phasers 2156 25039
phasers 2157 95000
photon torpedoes 2155 255
photon torpedoes 2156 17899
photon torpedoes 2157 87492

It is easier to compare our phaser needs to our photon torpedo needs if each year’s data is visually close together. Let’s pivot this into a friendlier format! Each year should receive its own column (but each year shouldn’t need to be specified in the query!), we want to sum up the total count, and we still want to keep a separate group (row) for each item.

CREATE TABLE pivoted_purchases AS
     PIVOT purchases 
          ON year 
          USING SUM(count) 
          GROUP BY item;

FROM pivoted_purchases;
item 2155 2156 2157
phasers 1035 25039 95000
photon torpedoes 255 17899 87492

Looks like photon torpedoes went on sale…

Now imagine the reverse situation. Scotty in engineering has been visually analyzing and manually constructing his purchases forecast. He prefers things pivoted so it’s easier to read. Now you need to fit it back into the database! This war may go on for a bit, so you may need to do this again next year. Let’s write an UNPIVOT query to return to the original format that can handle any year.

The COLUMNS expression will use all columns except item. After stacking, the column containing the column names from pivoted_purchases should be renamed to year, and the values within those columns represent the count. The result is the same dataset as the original.

UNPIVOT pivoted_purchases
     ON COLUMNS(* EXCLUDE item)
     INTO
          NAME year
          VALUE count;
item year count
phasers 2155 1035
phasers 2156 25039
phasers 2157 95000
photon torpedoes 2155 255
photon torpedoes 2156 17899
photon torpedoes 2157 87492

More examples are included as a part of our DuckDB 0.8.0 announcement post, and the PIVOT and UNPIVOT documentation pages highlight more complex queries.

Stay tuned for a future post to cover what is happening behind the scenes!

List lambda functions

List lambdas allow for operations to be applied to each item in a list. These do not need to be pre-defined - they are created on the fly within the query.

In this example, a lambda function is used in combination with the list_transform function to shorten each official ship name.

SELECT 
     (['Enterprise NCC-1701', 'Voyager NCC-74656', 'Discovery NCC-1031'])
          .list_transform(x -> x.string_split(' ')[1]) AS short_name;
ship_name
[Enterprise, Voyager, Discovery]

Lambdas can also be used to filter down the items in a list. The lambda returns a list of booleans, which is used by the list_filter function to select specific items. The contains function is using the function chaining described earlier.

SELECT 
     (['Enterprise NCC-1701', 'Voyager NCC-74656', 'Discovery NCC-1031'])
          .list_filter(x -> x.contains('1701')) AS the_original;
the_original
[Enterprise NCC-1701]

List comprehensions

What if there was a simple syntax to both modify and filter a list? DuckDB takes inspiration from Python’s approach to list comprehensions to dramatically simplify the above examples. List comprehensions are syntactic sugar - these queries are rewritten into lambda expressions behind the scenes!

Within brackets, first specify the transformation that is desired, then indicate which list should be iterated over, and finally include the filter criteria.

SELECT 
     [x.string_split(' ')[1] 
     FOR x IN ['Enterprise NCC-1701', 'Voyager NCC-74656', 'Discovery NCC-1031'] 
     IF x.contains('1701')] AS ready_to_boldly_go;
ready_to_boldly_go
[Enterprise]

Exploding struct.*

A struct in DuckDB is a set of key/value pairs. Behind the scenes, a struct is stored with a separate column for each key. As a result, it is computationally easy to explode a struct into separate columns, and now it is also syntactically simple as well! This is another example of allowing SQL to handle dynamic column names.

WITH damage_report AS (
     SELECT {'gold_casualties':5, 'blue_casualties':15, 'red_casualties': 10000} AS casualties
) 
FROM damage_report
SELECT 
     casualties.*;
gold_casualties blue_casualties red_casualties
5 15 10000

Automatic struct creation

DuckDB exposes an easy way to convert any table into a single-column struct. Instead of SELECTing column names, SELECT the table name itself.

WITH officers AS (
     SELECT 'Captain' AS rank, 'Jean-Luc Picard' AS name 
     UNION ALL 
     SELECT 'Lieutenant Commander', 'Data'
) 
FROM officers 
SELECT officers;
officers
{‘rank’: Captain, ‘name’: Jean-Luc Picard}
{‘rank’: Lieutenant Commander, ‘name’: Data}

Union data type

DuckDB utilizes strong typing to provide high performance and enforce data quality. However, DuckDB is also as forgiving as possible using approaches like implicit casting to avoid always having to cast between data types.

Another way DuckDB enables flexibility is the new UNION data type. A UNION data type allows for a single column to contain multiple types of values. This can be thought of as an “opt-in” to SQLite’s flexible data typing rules (the opposite direction of SQLite’s recently announced strict tables).

By default DuckDB will seek the common denominator of data types when combining tables together. The below query results in a VARCHAR column:

SELECT 'The Motion Picture' AS movie UNION ALL 
SELECT 2 UNION ALL 
SELECT 3 UNION ALL 
SELECT 4 UNION ALL 
SELECT 5 UNION ALL 
SELECT 6 UNION ALL 
SELECT 'First Contact';
movie
varchar
The Motion Picture
First Contact
6
5
4
3
2

However, if a UNION type is used, each individual row retains its original data type. A UNION is defined using key-value pairs with the key as a name and the value as the data type. This also allows the specific data types to be pulled out as individual columns:

CREATE TABLE movies (
     movie UNION(num INT, name VARCHAR)
);
INSERT INTO movies 
     VALUES ('The Motion Picture'), (2), (3), (4), (5), (6), ('First Contact');

FROM movies 
SELECT 
     movie,
     union_tag(movie) AS type,
     movie.name,
     movie.num;
movie type name num
union(num integer, name varchar) varchar varchar int32
The Motion Picture name The Motion Picture  
2 num   2
3 num   3
4 num   4
5 num   5
6 num   6
First Contact name First Contact  

Additional friendly features

Several other friendly features are worth mentioning and some are powerful enough to warrant their own blog posts.

DuckDB takes a nod from the describe function in Pandas and implements a SUMMARIZE keyword that will calculate a variety of statistics about each column in a dataset for a quick, high-level overview. Simply prepend SUMMARIZE to any table or SELECT statement.

Have a look at the correlated subqueries post to see how to use subqueries that refer to each others’ columns. DuckDB’s advanced optimizer improves correlated subquery performance by orders of magnitude, allowing for queries to be expressed as naturally as possible. What was once an anti-pattern for performance reasons can now be used freely!

DuckDB has added more ways to JOIN tables together that make expressing common calculations much easier. Some like LATERAL, ASOF, SEMI, and ANTI joins are present in other systems, but have high-performance implementations in DuckDB. DuckDB also adds a new POSITIONAL join that combines by the row numbers in each table to match the commonly used Pandas capability of joining on row number indexes. See the JOIN documentation for details, and look out for a blog post describing DuckDB’s state of the art ASOF joins!

Summary and future work

DuckDB aims to be the easiest database to use. Fundamental architectural decisions to be in-process, have zero dependencies, and have strong typing contribute to this goal, but the friendliness of its SQL dialect has a strong impact as well. By extending the industry-standard PostgreSQL dialect, DuckDB aims to provide the simplest way to express the data transformations you need. These changes range from altering the ancient clause order of the SELECT statement to begin with FROM, allowing a fundamentally new way to use functions with chaining, to advanced nested data type calculations like list comprehensions. Each of these features are available in the 0.8.1 release.

Future work for friendlier SQL includes:

  • Lambda functions with more than 1 argument, like list_zip
  • Underscores as digit separators (Ex: 1_000_000 instead of 1000000)
  • Extension user experience, including autoloading
  • Improvements to file globbing
  • Your suggestions!

Please let us know what areas of SQL can be improved! We welcome your feedback on Discord or GitHub.

Live long and prosper! 🖖

continue reading
2023-08-04Pedro Holanda

DuckDB ADBC - Zero-Copy data transfer via Arrow Database Connectivity

DuckDB-Arrow

TL;DR: DuckDB has added support for Arrow Database Connectivity (ADBC), an API standard that enables efficient data ingestion and retrieval from database systems, similar to Open Database Connectivity (ODBC) interface. However, unlike ODBC, ADBC specifically caters to the columnar storage model, facilitating fast data transfers between a columnar database and an external application.

Database interface standards allow developers to write application code that is independent of the underlying database management system (DBMS) being used. DuckDB has supported two standards that have gained popularity in the past few decades: the core interface of ODBC and Java Database Connectivity (JDBC). Both interfaces are designed to fully support database connectivity and management, with JDBC being catered for the Java environment. With these APIs, developers can query DBMS agnostically, retrieve query results, run prepared statements, and manage connections.

These interfaces were designed in the early 90s when row-wise database systems reigned supreme. As a result, they were primarily intended for transferring data in a row-wise format. However, in the mid-2000s, columnar-wise database systems started gaining a lot of traction due to their drastic performance advantages for data analysis (You can find myself giving a brief exemplification of this difference at EuroPython). This means that these APIs offer no support for transferring data in a columnar-wise format (or, in the case of ODBC, some support with a lot of added complexity). In practice, when analytical, column-wise systems like DuckDB make use of these APIs, converting the data between these representation formats becomes a major bottleneck.

The figure below depicts how a developer can use these APIs to query a DuckDB database. For example, developers can submit SQL queries via the API, which then uses a DuckDB driver to internally call the proper functions. A query result is then produced in DuckDB’s internal columnar representation, and the driver takes care of transforming it to the JDBC or ODBC row-wise result format. This transformation has significant costs for rearranging and copying the data, quickly becoming a major bottleneck.

DuckDB-JDBC-ODBC

To overcome this transformation cost, ADBC has been proposed, with a generic API to support database operations while using the Apache Arrow memory format to send data in and out of the DBMS. DuckDB now supports the ADBC specification. Due to DuckDB’s zero-copy integration with the Arrow format, using ADBC as an interface is rather efficient, since there is only a small constant cost to transform DuckDB query results to the Arrow format.

The figure below depicts the query execution flow when using ADBC. Note that the main difference between ODBC/JDBC is that the result does not need to be transformed to a row-wise format.

DuckDB-ADBC

Quick Tour

For our quick tour, we will illustrate an example of round-tripping data using DuckDB-ADBC via Python. Please note that DuckDB-ADBC can also be utilized with other programming languages. Specifically, you can find C++ DuckDB-ADBC examples and tests in the DuckDB Github repository along with usage examples available in C++. For convenience, you can also find a ready-to-run version of this tour in a Colab notebook. If you would like to see a more detailed explanation of the DuckDB-ADBC API or view a C++ example, please refer to our documentation page.

Setup

For this example, you must have a dynamic library from the latest bleeding-edge version of DuckDB, pyarrow, and the adbc-driver-manager. The ADBC driver manager is a Python package developed by Voltron Data. The driver manager is compliant with DB-API 2.0. It wraps ADBC, making its usage more straightforward. You can find the documentation of the ADBC Driver Manager here.

Note: While DuckDB is already DB-API compliant in Python, what sets ADBC apart is that you do not need a DuckDB module installed and loaded. Additionally, unlike the DB-API, it does not utilize row-wise as its data transfer format of choice.

pip install pyarrow
pip install adbc-driver-manager

Insert Data

First, we need to include the necessary libraries that will be used in this tour. Mainly, PyArrow and the DBAPI from the ADBC Driver Manager.

import pyarrow
from adbc_driver_manager import dbapi

Next, we can create a connection via ADBC with DuckDB. This connection simply requires the path to DuckDB’s driver and the entrypoint function name. DuckDB’s entrypoint is duckdb_adbc_init. By default, connections are established with an in-memory database. However, if desired, you have the option to specify the path variable and connect to a local duckdb instance, allowing you to store the data on disk. Note that these are the only variables in ADBC that are not DBMS agnostic; instead, they are set by the user, often through a configuration file.

 con = dbapi.connect(driver="path/to/duckdb.lib", entrypoint="duckdb_adbc_init", db_kwargs={"path": "test.db"})

To insert the data, we can simply call the adbc_ingest function with a cursor from our connection. It requires the name of the table we want to perform the ingestion to and the Arrow Python object we want to ingest. This function also has two modes: append, where data is appended to an existing table, and create, where the table does not exist yet and will be created with the input data. By default, it’s set to create, so we don’t need to define it here.

table = pyarrow.table(
     [
          ["Tenacious D", "Backstreet Boys", "Wu Tang Clan"],
          [4, 10, 7]

     ],
     names=["Name", "Albums"],
)

with con.cursor() as cursor:
     cursor.adbc_ingest("Bands", table)

After calling adbc_ingest, the table is created in the DuckDB connection and the data is fully inserted.

Read Data

To read data from DuckDB, one simply needs to use the execute function with a SQL query and then return the cursor’s result to the desired Arrow format, such as a PyArrow Table in this example.

with con.cursor() as cursor:
     cursor.execute("SELECT * FROM Bands")
     cursor.fetch_arrow_table()

Benchmark ADBC vs ODBC

In our benchmark section, we aim to evaluate the differences in data reading from DuckDB via ADBC and ODBC. This benchmark was executed on an Apple M1 Max with 32GB of RAM and involves outputting and inserting the Lineitem table of TPC-H SF 1. You can find the repository with the code used to run this benchmark here.

Name Time (s)
ODBC 28.149
ADBC 0.724

The time difference between ODBC and ADBC is 38x. This significant contrast results from the extra allocations and copies that exist in ODBC.

Conclusions

DuckDB now supports the ADBC standard for database connection. ADBC is particularly efficient when combined with DuckDB, thanks to its use of the Arrow zero-copy integration.

ADBC is particularly interesting because it can drastically decrease interactions between analytic systems compared to ODBC. For example, if software that already support ODBC, e.g., if MS-Excel was to implement ADBC, integrations with columnar systems like DuckDB could benefit from this significant difference in performance.

DuckDB-ADBC is currently supported via the C Interface and through the Python ADBC Driver Manager. We will add more extensive tutorials for other languages to our documentation webpage. Please feel free to let us know your preferred language for interacting with DuckDB via ADBC!

As always, we are happy to hear your thoughts! Feel free to drop us an email if you have any suggestions, comments or questions!

Last but not least, if you encounter any problems using ADBC, please open an issue DuckDB’s - issue tracker.

continue reading
2023-07-07Pedro Holanda, Thijs Bruineman and Phillip Cloud

From Waddle to Flying: Quickly expanding DuckDB's functionality with Scalar Python UDFs

DuckDB-Waddle-fly

TL;DR: DuckDB now supports vectorized Scalar Python User Defined Functions (UDFs). By implementing Python UDFs, users can easily expand the functionality of DuckDB while taking advantage of DuckDB’s fast execution model, SQL and data safety.

User Defined Functions (UDFs) enable users to extend the functionality of a Database Management System (DBMS) to perform domain-specific tasks that are not implemented as built-in functions. For instance, users who frequently need to export private data can benefit from an anonymization function that masks the local part of an email while preserving the domain. Ideally, this function would be executed directly in the DBMS. This approach offers several advantages:

1) Performance. The function could be executed using the same execution model (e.g., streaming results, beyond-memory/out-of-core execution) of the DBMS, and without any unnecessary transformations.

2) Easy Use. UDFs can be seamlessly integrated into SQL queries, allowing users to leverage the power of SQL to call the functions. This eliminates the need for passing data through a separate database connector and executing external code. The functions can be utilized in various SQL contexts (e.g., subqueries, join conditions).

3) Safety. The sensitive data never leaves the DBMS process.

There are two main reasons users often refrain from implementing UDFs. 1) There are security concerns associated with UDFs. Since UDFs are custom code created by users and executed within the DBMS process, there is a potential risk of crashing the server. However, when it comes to DuckDB, an embedded database, this concern is mitigated as each analyst runs their own DuckDB process separately. Therefore, the impact on server stability is not a significant worry. 2) The difficulty of implementation is a common deterrent for users. High-Performance UDFs are typically only supported in low-level languages. UDFs in higher-level languages like Python incur significant performance costs. Consequently many users cannot quickly implement their UDFs without investing a significant amount of time in learning a low-level language and understanding the internal details of the DBMS.

DuckDB followed a similar approach. As a DBMS tailored for analytical tasks, performance is a key consideration, leading to the implementation of its core in C++. Consequently, the initial focus of extensibility efforts was centered around C++. However, this duck is not limited to just waddling; it can also fly. So we are delighted to announce the recent addition of Scalar Python UDFs to DuckDB.

DuckDB provides support for two distinct types of Python UDFs, differing in the Python object used for communication between DuckDB’s native data types and the Python process. These communication layers include support for Python built-in types and PyArrow Tables.

The two approaches exhibit two key differences:

1) Zero-Copy. PyArrow Tables leverage our zero-copy integration with Arrow, enabling efficient translation of data types to Python-Land with zero-copy cost.

2) Vectorization. PyArrow Table functions operate on a chunk level, processing chunks of data containing up to 2048 rows. This approach maximizes cache locality and leverages vectorization. On the other hand, the built-in types UDF implementation operates on a per-row basis.

This blog post aims to demonstrate how you can extend DuckDB using Python UDFs, with a particular emphasis on PyArrow-powered UDFs. In our quick-tour section, we will provide examples using the PyArrow UDF types. For those interested in benchmarks, you can jump ahead to the benchmark section below. If you want to see a detailed description of the Python UDF API, please refer to our documentation.

Python UDFs

This section depicts several practical examples of using Python UDFs. Each example uses a different type of Python UDF.

Quick-Tour

To demonstrate the usage of Python UDFs in DuckDB, let’s consider the following example. We have a dictionary called world_cup_titles that maps countries to the number of World Cups they have won. We want to create a Python UDF that takes a country name as input, searches for the corresponding value in the dictionary, and returns the number of World Cups won by that country. If the country is not found in the dictionary, the UDF will return NULL.

Here’s an example implementation:

import duckdb
from duckdb.typing import *

con = duckdb.connect()

# Dictionary that maps countries and world cups they won
world_cup_titles = {
    "Brazil": 5,
    "Germany": 4,
    "Italy": 4,
    "Argentina": 2,
    "Uruguay": 2,
    "France": 2,
    "England": 1,
    "Spain": 1
}

# Function that will be registered as an UDF, simply does a lookup in the python dictionary
def world_cups(x):
     return world_cup_titles.get(x)

# We register the function
con.create_function("wc_titles", world_cups, [VARCHAR], INTEGER)

That’s it, the function is then registered and ready to be called through SQL.

# Let's create an example countries table with the countries we are interested in using
con.execute("CREATE TABLE countries(country VARCHAR)")
con.execute("INSERT INTO countries VALUES ('Brazil'), ('Germany'), ('Italy'), ('Argentina'), ('Uruguay'), ('France'), ('England'), ('Spain'), ('Netherlands')")
# We can simply call the function through SQL, and even use the function return to eliminate the countries that never won a world cup
con.sql("SELECT country, wc_titles(country) as world_cups from countries").fetchall()
# [('Brazil', 5), ('Germany', 4), ('Italy', 4), ('Argentina', 2), ('Uruguay', 2), ('France', 2), ('England', 1), ('Spain', 1), ('Netherlands', None)]

Generating Fake Data with Faker (Built-In Type UDF)

Here is an example that demonstrates the usage of the Faker library to generate a scalar function in DuckDB, which returns randomly generated dates. The function, named random_date, does not require any inputs and outputs a DATE column. Since Faker utilizes built-in Python types, the function directly returns them. One important thing to notice is that a function that is not deterministic based on its input must be marked as having side_effects.

import duckdb

# By importing duckdb.typing we can specify DuckDB Types directly without using strings
from duckdb.typing import *

from faker import Faker

# Our Python UDF generates a random date every time it's called
def random_date():
     fake = Faker()
     return fake.date_between()

We then have to register the Python function in DuckDB using create_function. Since our function doesn’t require any inputs, we can pass an empty list as the argument_type_list. As the function returns a date, we specify DATE from duckdb.typing as the return_type. Note that since our random_date() function returns a built-in Python type (datetime.date), we don’t need to specify the UDF type.

# To exemplify the effect of side-effect, let's first run the function without marking it.
duckdb.create_function('random_date', random_date, [], DATE)

# After registration, we can use the function directly via SQL
# Notice that without side_effect=True, it's not guaranteed that the function will be re-evaluated.
res = duckdb.sql('select random_date() from range (3)').fetchall()
# [(datetime.date(2003, 8, 3),), (datetime.date(2003, 8, 3),), (datetime.date(2003, 8, 3),)]

# Now let's re-add the function with side-effects marked as true.
duckdb.remove_function('random_date')
duckdb.create_function('random_date', random_date, [], DATE, side_effects=True)
res = duckdb.sql('select random_date() from range (3)').fetchall()
# [(datetime.date(2020, 11, 29),), (datetime.date(2009, 5, 18),), (datetime.date(2018, 5, 24),)]

Swap String Case (PyArrow Type UDF)

One issue with using built-in types is that you don’t benefit from zero-copy, vectorization and cache locality. Using PyArrow as a UDF type should be favored to leverage these optimizations.

To demonstrate a PyArrow function, let’s consider a simple example where we want to transform lowercase characters to uppercase and uppercase characters to lowercase. Fortunately, PyArrow already has a function for this in the compute engine, and it’s as simple as calling pc.utf8_swapcase(x).

import duckdb

# By importing duckdb.typing we can specify DuckDB Types directly without using strings
from duckdb.typing import *

import pyarrow as pa
import pyarrow.compute as pc

def swap_case(x):
     # Swap the case of the 'column' using utf8_swapcase and return the result
     return pc.utf8_swapcase(x)

con = duckdb.connect()
# To register the function, we must define it's type to be 'arrow'
con.create_function('swap_case', swap_case, [VARCHAR], VARCHAR, type='arrow')

res = con.sql("select swap_case('PEDRO HOLANDA')").fetchall()
# [('pedro holanda',)]

Predicting Taxi Fare costs (Ibis + PyArrow UDF)

Python UDFs offer significant power as they enable users to leverage the extensive Python ecosystem and tools, including libraries like PyTorch and Tensorflow that efficiently implement machine learning operations.

Additionally the Ibis project offers a DataFrame API with great DuckDB integration and supports both of DuckDB’s native Python and PyArrow UDFs.

In this example, we demonstrate the usage of a pre-built PyTorch model to estimate taxi fare costs based on the traveled distance. You can find a complete example in this blog post by the Ibis team.

import torch
import pyarrow as pa
import ibis
import ibis.expr.datatypes as dt

from ibis.expr.operations import udf


# The code to generate the model is not specified in this snippet, please refer to the provided link for more information
model = ...

# Function that uses the model and a traveled distance input tensor to predict values, please refer to the provided link for more information
def predict_linear_regression(model, tensor: torch.Tensor) -> torch.Tensor:
    ...


# Indicate to ibis that this is a scalar user-defined function whose input format is pyarrow
@udf.scalar.pyarrow
def predict_fare(x: dt.float64) -> dt.float32:
    # `x` is a pyarrow.ChunkedArray; the `dt.float64` annotation indicate the element type of the ChunkedArray.

    # Transform the data from PyArrow to the required torch tensor format and dimension.
    tensor = torch.from_numpy(x.to_numpy()[:, None]).float()

    # Call the actual prediction function, which also returns a torch tensor.
    predicted = predict_linear_regression(model, tensor).ravel()
    return pa.array(predicted.numpy())


# Execute a query on the NYC Taxi parquet file to showcase our model's predictions, the actual fare amount, and the distance.
expr = (
    ibis.read_parquet('yellow_tripdata_2016-02.parquet')
    .mutate(
        "fare_amount",
        "trip_distance",
        predicted_fare=lambda t: predict_fare(t.trip_distance),
    )
)
df = expr.execute()

By utilizing Python UDFs in DuckDB with Ibis, you can seamlessly incorporate machine learning models and perform predictions directly within your Ibis code and SQL queries. The example demonstrates how to predict taxi fare costs based on distance using a PyTorch model, showcasing the integration of machine learning capabilities within DuckDB’s SQL environment driven by Ibis.

Benchmarks

In this section, we will perform simple benchmark comparisons to demonstrate the performance differences between two different types of Python UDFs. The benchmark will measure the execution time, and peak memory consumption. The benchmarks are executed 5 times, and the median value is considered. The benchmark is conducted on a Mac Apple M1 with 16GB of RAM.

Built-In Python Vs PyArrow

To benchmark these UDF types, we create UDFs that take an integral column as input, add one to each value, and return the result. The code used for this benchmark section can be found here.

import pyarrow.compute as pc
import duckdb
import pyarrow as pa

# Built-In UDF
def add_built_in_type(x):
     return x + 1

#Arrow UDF
def add_arrow_type(x):
     return pc.add(x,1)

con = duckdb.connect()

# Registration
con.create_function('built_in_types', add_built_in_type, ['BIGINT'], 'BIGINT', type='native')
con.create_function('add_arrow_type', add_arrow_type, ['BIGINT'], 'BIGINT', type='arrow')

# Integer View with 10,000,000 elements.
con.sql("""
     select
          i
          from range(10000000) tbl(i);
""").to_view("numbers")

# Calls for both UDFs
native_res = con.sql("select sum(add_built_in_type(i)) from numbers").fetchall()
arrow_res = con.sql("select sum(add_arrow_type(i)) from numbers").fetchall()

Name Time (s)
Built-In 5.37
PyArrow 0.35

We can observe a performance difference of more than one order of magnitude between the two UDFs. The difference in performance is primarily due to three factors:

1) In Python, object construction and general use is rather slow. This is due to several reasons, including automatic memory management, interpretation, and dynamic typing. 2) The PyArrow UDF does not require any data copying. 3) The PyArrow UDF is executed in a vectorized fashion, processing chunks of data instead of individual rows.

Python UDFs Vs External Functions

Here we compare the usage of a Python UDF with an external function. In this case, we have a function that calculates the sum of the lengths of all strings in a column. You can find the code used for this benchmark section here.

import duckdb
import pyarrow as pa

# Function used in UDF
def string_length_arrow(x):
     tuples = len(x)
     values = [len(i.as_py()) if i.as_py() != None else 0 for i in x]
     array = pa.array(values, type=pa.int32(), size=tuples)
     return array


# Same Function but external to the database
def exec_external(con):
     arrow_table = con.sql("select i from strings tbl(i)").arrow()
     arrow_column = arrow_table['i']
     tuples = len(arrow_column)
     values = [len(i.as_py()) if i.as_py() != None else 0 for i in arrow_column]
     array = pa.array(values, type=pa.int32(), size=tuples)
     arrow_tbl = pa.Table.from_arrays([array], names=['i'])
     return con.sql("select sum(i) from arrow_tbl").fetchall()


con = duckdb.connect()
con.create_function('strlen_arrow', string_length_arrow, ['VARCHAR'], int, type='arrow')

con.sql("""
     select
          case when i != 0 and i % 42 = 0
          then
               NULL
          else
               repeat(chr((65 + (i % 26))::INTEGER), (4 + (i % 12))) end
          from range(10000000) tbl(i);
""").to_view("strings")

con.sql("select sum(strlen_arrow(i)) from strings tbl(i)").fetchall()

exec_external(con)

Name Time (s) Peak Memory Consumption (MB)
External 5.65 584.032
UDF 5.63 112.848

Here we can see that there is no significant regression in performance when utilizing UDFs. However, you still have the benefits of safer execution and the utilization of SQL. In our example, we can also notice that the external function materializes the entire query, resulting in a 5x higher peak memory consumption compared to the UDF approach.

Conclusions and Further Development

Scalar Python UDFs are now supported in DuckDB, marking a significant milestone in extending the functionality of the database. This enhancement empowers users to perform complex computations using a high-level language. Additionally, Python UDFs can leverage DuckDB’s zero-copy integration with Arrow, eliminating data transfer costs and ensuring efficient query execution.

While the introduction of Python UDFs is a major step forward, our work in this area is ongoing. Our roadmap includes the following focus areas:

  1. Aggregate/Table-Producing UDFs: Currently, users can create Scalar UDFs, but we are actively working on supporting Aggregation Functions (which perform calculations on a set of values and return a single result) and Table-Producing Functions (which return tables without limitations on the number of columns and rows).

  2. Types: Scalar Python UDFs currently support most DuckDB types, with the exception of ENUM types and BIT types. We are working towards expanding the type support to ensure comprehensive functionality.

As always, we are happy to hear your thoughts! Feel free to drop us an email if you have any suggestions, comments or questions.

Last but not least, if you encounter any problems using our Python UDFs, please open an issue in DuckDB’s issue tracker.

continue reading
2023-05-26Mark Raasveldt

Correlated Subqueries in SQL

Subqueries in SQL are a powerful abstraction that allow simple queries to be used as composable building blocks. They allow you to break down complex problems into smaller parts, and subsequently make it easier to write, understand and maintain large and complex queries.

DuckDB uses a state-of-the-art subquery decorrelation optimizer that allows subqueries to be executed very efficiently. As a result, users can freely use subqueries to create expressive queries without having to worry about manually rewriting subqueries into joins. For more information, skip to the Performance section.

Types of Subqueries

SQL subqueries exist in two main forms: subqueries as expressions and subqueries as tables. Subqueries that are used as expressions can be used in the SELECT or WHERE clauses. Subqueries that are used as tables can be used in the FROM clause. In this blog post we will focus on subqueries used as expressions. A future blog post will discuss subqueries as tables.

Subqueries as expressions exist in three forms.

  • Scalar subqueries
  • EXISTS
  • IN/ANY/ALL

All of the subqueries can be either correlated or uncorrelated. An uncorrelated subquery is a query that is independent from the outer query. A correlated subquery is a subquery that contains expressions from the outer query. Correlated subqueries can be seen as parameterized subqueries.

Uncorrelated Scalar Subqueries

Uncorrelated scalar subqueries can only return a single value. That constant value is then substituted and used in the query. As an example of why this is useful - imagine that we want to select all of the shortest flights in our dataset. We could run the following query to obtain the shortest flight distance:

SELECT MIN(distance)
FROM ontime;
min(distance)
31.0

We could manually take this distance and use it in the WHERE clause to obtain all flights on this route.

SELECT uniquecarrier, origincityname, destcityname, flightdate
FROM ontime
WHERE distance=31.0;
uniquecarrier origincityname destcityname flightdate
AS Petersburg, AK Wrangell, AK 2017-01-15
AS Wrangell, AK Petersburg, AK 2017-01-15
AS Petersburg, AK Wrangell, AK 2017-01-16

However - this requires us to hardcode the constant inside the query. By using the first query as a subquery we can compute the minimum distance as part of the query.

SELECT uniquecarrier, origincityname, destcityname, flightdate
FROM ontime
WHERE distance=(
     SELECT MIN(distance)
     FROM ontime
);

Correlated Scalar Subqueries

While uncorrelated subqueries are powerful, they come with a hard restriction: only a single value can be returned. Often, what we want to do is parameterize the query, so that we can return different values per row.

For example, suppose that we want to find all of the shortest flights for each carrier. We can find the shortest flight for a specific carrier using the following parameterized query:

PREPARE min_distance_per_carrier AS
SELECT MIN(distance)
FROM ontime
WHERE uniquecarrier=?;

We can execute this prepared statement to obtain the minimum distance for a specific carrier.

EXECUTE min_distance_per_carrier('UA');
min(distance)
67.0

If we want to use this parameterized query as a subquery, we need to use a correlated subquery. Correlated subqueries allow us to use parameterized queries as scalar subqueries by referencing columns from the outer query. We can obtain the set of shortest flights per carrier using the following query:

SELECT uniquecarrier, origincityname, destcityname, flightdate, distance
FROM ontime AS ontime_outer
WHERE distance=(
     SELECT MIN(distance)
     FROM ontime
     WHERE uniquecarrier=ontime_outer.uniquecarrier
);
uniquecarrier origincityname destcityname flightdate distance
AS Wrangell, AK Petersburg, AK 2017-01-01 31.0
NK Fort Lauderdale, FL Orlando, FL 2017-01-01 177.0
VX Las Vegas, NV Los Angeles, CA 2017-01-01 236.0

Notice how the column from the outer relation (ontime_outer) is used inside the query. This is what turns the subquery into a correlated subquery. The column from the outer relation (ontime_outer.uniquecarrier) is a parameter for the subquery. Logically the subquery is executed once for every row that is present in ontime, where the value for the column at that row is substituted as a parameter.

In order to make it more clear that the correlated subquery is in essence a parameterized query, we can create a scalar macro that contains the query using DuckDB’s macros.

CREATE MACRO min_distance_per_carrier(param) AS (
     SELECT MIN(distance)
     FROM ontime
     WHERE uniquecarrier=param
);

We can then use the macro in our original query as if it is a function.

SELECT uniquecarrier, origincityname, destcityname, flightdate, distance
FROM ontime AS ontime_outer
WHERE distance=min_distance_per_carrier(ontime_outer.uniquecarrier);

This gives us the same result as placing the correlated subquery inside of the query, but is cleaner as we can decompose the query into multiple segments more effectively.

EXISTS

EXISTS can be used to check if a given subquery has any results. This is powerful when used as a correlated subquery. For example, we can use EXISTS if we want to obtain the last flight that has been flown on each route.

We can obtain a list of all flights on a given route past a certain date using the following query:

PREPARE flights_after_date AS
SELECT uniquecarrier, origincityname, destcityname, flightdate, distance
FROM ontime
WHERE origin=? AND dest=? AND flightdate>?;
EXECUTE flights_after_date('LAX', 'JFK', DATE '2017-05-01');
uniquecarrier origincityname destcityname flightdate distance
AA Los Angeles, CA New York, NY 2017-08-01 2475.0
AA Los Angeles, CA New York, NY 2017-08-02 2475.0
AA Los Angeles, CA New York, NY 2017-08-03 2475.0

Now in order to obtain the last flight on a route, we need to find flights for which no later flight exists.

SELECT uniquecarrier, origincityname, destcityname, flightdate, distance
FROM ontime AS ontime_outer
WHERE NOT EXISTS (
     SELECT uniquecarrier, origincityname, destcityname, flightdate, distance
     FROM ontime
     WHERE origin=ontime_outer.origin AND dest=ontime_outer.dest AND flightdate>ontime_outer.flightdate
);
uniquecarrier origincityname destcityname flightdate distance  
AA Daytona Beach, FL Charlotte, NC 2017-02-27 416.0  
EV Abilene, TX Dallas/Fort Worth, TX 2017-02-15 158.0  
EV Dallas/Fort Worth, TX Durango, CO 2017-02-13 674.0  

IN / ANY / ALL

IN can be used to check if a given value exists within the result returned by the subquery. For example, we can obtain a list of all carriers that have performed more than 250 000 flights in the dataset using the following query:

SELECT uniquecarrier
FROM ontime
GROUP BY uniquecarrier
HAVING COUNT(*) > 250000;

We can then use an IN clause to obtain all flights performed by those carriers.

SELECT *
FROM ontime
WHERE uniquecarrier IN (
     SELECT uniquecarrier
     FROM ontime
     GROUP BY uniquecarrier
     HAVING COUNT(*) > 250000
);

A correlated subquery can be useful here if we want to not count the total amount of flights performed by each carrier, but count the total amount of flights for the given route. We can select all flights performed by carriers that have performed at least 1000 flights on a given route using the following query.

SELECT *
FROM ontime AS ontime_outer
WHERE uniquecarrier IN (
     SELECT uniquecarrier
     FROM ontime
     WHERE ontime.origin=ontime_outer.origin AND ontime.dest=ontime_outer.dest
     GROUP BY uniquecarrier
     HAVING COUNT(*) > 1000
);

ANY and ALL are generalizations of IN. IN checks if the value is present in the set returned by the subquery. This is equivalent to = ANY(...). The ANY and ALL operators can be used to perform other comparison operators (such as >, <, <>). The above query can be rewritten to ANY in the following form.

SELECT *
FROM ontime AS ontime_outer
WHERE uniquecarrier = ANY (
     SELECT uniquecarrier
     FROM ontime
     WHERE ontime.origin=ontime_outer.origin AND ontime.dest=ontime_outer.dest
     GROUP BY uniquecarrier
     HAVING COUNT(*) > 1000
);

Performance

Whereas scalar subqueries are logically executed once, correlated subqueries are logically executed once per row. As such, it is natural to think that correlated subqueries are very expensive and should be avoided for performance reasons.

While that is true in many SQL systems, it is not the case in DuckDB. In DuckDB, subqueries are always decorrelated. DuckDB uses a state-of-the-art subquery decorrelation algorithm as described in the Unnesting Arbitrary Queries paper. This allows all subqueries to be decorrelated and executed as a single, much more efficient, query.

In DuckDB, correlation does not imply performance degradation.

If we look at the query plan for the correlated scalar subquery using EXPLAIN, we can see that the query has been transformed into a hash aggregate followed by a hash join. This allows the query to be executed very efficiently.

EXPLAIN SELECT uniquecarrier, origincityname, destcityname, flightdate, distance
FROM ontime AS ontime_outer
WHERE distance=(
     SELECT MIN(distance)
     FROM ontime
     WHERE uniquecarrier=ontime_outer.uniquecarrier
);
┌───────────────────────────┐
│         HASH_JOIN         │ 
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │ 
│      uniquecarrier =      │ 
│       uniquecarrier       ├──────────────┐
└─────────────┬─────────────┘              │
┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│         SEQ_SCAN          ││       HASH_GROUP_BY       │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   ││   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│           ontime          ││       uniquecarrier       │
└───────────────────────────┘│       min(distance)       │
                             └─────────────┬─────────────┘
                             ┌─────────────┴─────────────┐
                             │         SEQ_SCAN          │
                             │   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
                             │           ontime          │
                             └───────────────────────────┘
                             

We can see the drastic performance difference that subquery decorrelation has when we compare the run-time of this query in DuckDB with the run-time in Postgres and SQLite. When running the above query on the ontime dataset for 2017 with roughly ~4 million rows, we get the following performance results:

DuckDB Postgres SQLite
0.06s >48 Hours >48 Hours

As Postgres and SQLite do not de-correlate the subquery, the query is not just logically, but actually executed once for every row. As a result, the subquery is executed 4 million times in those systems, which takes an immense amount of time.

In this case, it is possible to manually decorrelate the query and generate the following SQL:

SELECT ontime.uniquecarrier, origincityname, destcityname, flightdate, distance
FROM ontime
JOIN (
     SELECT uniquecarrier, MIN(distance) AS min_distance
     FROM ontime
     GROUP BY uniquecarrier
) AS subquery 
ON (ontime.uniquecarrier=subquery.uniquecarrier AND distance=min_distance);

By performing the de-correlation manually, the performance of SQLite and Postgres improves significantly. However, both systems remain over 30x slower than DuckDB.

DuckDB Postgres SQLite
0.06s 1.98s 2.81s

Note that while it is possible to manually decorrelate certain subqueries by rewriting the SQL, it is not always possible to do so. As described in the Unnesting Arbitrary Queries paper, special join types that are not present in SQL are necessary to decorrelate arbitrary queries.

In DuckDB, these special join types will be automatically generated by the system to decorrelate all subqueries. In fact, DuckDB does not have support for executing subqueries that are not decorrelated. All subqueries will be decorrelated before DuckDB executes them.

Conclusion

Subqueries are a very powerful tool that allow you to take arbitrary queries and convert them into ad-hoc functions. When used in combination with DuckDB’s powerful subquery decorrelation, they can be executed extremely efficiently, making previously intractable queries not only possible, but fast.

continue reading
2023-05-17Mark Raasveldt and Hannes Mühleisen

Announcing DuckDB 0.8.0

Image of the Mottled Duck

The DuckDB team is happy to announce the latest DuckDB release (0.8.0). This release is named “Fulvigula” after the Mottled Duck (Anas Fulvigula) native to the Gulf of Mexico.

To install the new version, please visit the installation guide. The full release notes can be found here.

continue reading
2023-05-12Mark Raasveldt and Hannes Mühleisen

10 000 Stars on GitHub

Today, DuckDB reached 10 000 stars on GitHub. We would like to pause for a second to express our gratitude to everyone who contributed to DuckDB and of course all its users. When we started working on DuckDB back in 2018, we would have never dreamt of getting this kind of adoption in such a short time.

From those brave souls who were early adopters of DuckDB back in 2019 to the many today, we are happy you’re part of our community. Thank you for your feedback, feature requests and for your enthusiasm in adopting new features and integrations. Thank you for helping each other on our Discord server or in GitHub discussions. Thank you for spreading the word, too.

We also would like to extend special thanks to the DuckDB foundation supporters, who through their generous donations keep DuckDB independent.

For us, the maintainers of DuckDB, the past few years have also been quite eventful: We spun off from the research group where DuckDB originated to a successful company with close to 20 employees and many excellent partnerships.

We are very much looking forward to what the future will hold for DuckDB. Things are looking bright!

Wilbur the duck approves of all those stars

continue reading
2023-04-28Max Gabrielsson

PostGEESE? Introducing The DuckDB Spatial Extension

TL;DR: DuckDB now has an official Spatial extension to enable geospatial processing.

Geospatial data has become increasingly important and prevalent in modern-day applications and data engineering workflows, with use-cases ranging from location-based services to environmental monitoring.

While there are many great and specialized tools for working with geospatial data, integrating geospatial capabilities directly into DuckDB has multiple advantages. For one, you get to operate, transform and join your geospatial data alongside your regular, unstructured or time-series data using DuckDBs rich type system and extensions like JSON and ICU. Secondly, spatial queries involving geometric predicates and relations translate surprisingly well to SQL, which is all about expressing relations after all! Not to mention all the other benefits provided by DuckDB such as transactional semantics, high performance multi-threaded vectorized execution and larger-than-memory data processing.

Therefore, we’re very excited to announce that DuckDB now has a Spatial extension packed with features easily installable from the DuckDB CLI and other DuckDB clients. Simply execute INSTALL spatial; LOAD spatial; from within DuckDB and you’re good to go!

No, we’re not calling it GeoDuck either, that’s just gross

What’s in it?

The core of the extension is a GEOMETRY type based on the “Simple Features” geometry model and accompanying functions such as ST_Area, ST_Intersects. It also provides methods for reading and writing geospatial data formats and converting between coordinate reference systems (details later in the post!). While we’re not ready to commit to full compliance with the OGC Simple Feature Access and SQL/MM Standards yet, if you’ve worked with geospatial functionality in other database systems such as PostGIS or SpatiaLite, you should feel right at home.

Most of the implemented functions are based on the trifecta of foundational geospatial libraries, GEOS, GDAL and PROJ, which provide algorithms, format conversions and coordinate reference system transformations respectively. In particular, we leverage GDAL to provide a set of table and copy functions that enable import and export of tables from and to 50+ different geospatial data formats (so far!), including the most common ones such as Shapefiles, GeoJSON, GeoPackage, KML, GML, WKT, WKB, etc.

Check for yourself by running:

SELECT * FROM st_drivers();
short_name long_name can_create can_copy can_open help_url
ESRI Shapefile ESRI Shapefile true false true https://gdal.org/drivers/vector/shapefile.html
MapInfo File MapInfo File true false true https://gdal.org/drivers/vector/mitab.html
UK .NTF UK .NTF false false true https://gdal.org/drivers/vector/ntf.html
LVBAG Kadaster LV BAG Extract 2.0 false false true https://gdal.org/drivers/vector/lvbag.html
S57 IHO S-57 (ENC) true false true https://gdal.org/drivers/vector/s57.html
DGN Microstation DGN true false true https://gdal.org/drivers/vector/dgn.html
OGR_VRT VRT - Virtual Datasource false false true https://gdal.org/drivers/vector/vrt.html
Memory Memory true false true  
CSV Comma Separated Value (.csv) true false true https://gdal.org/drivers/vector/csv.html
GML Geography Markup Language (GML) true false true https://gdal.org/drivers/vector/gml.html
GPX GPX true false true https://gdal.org/drivers/vector/gpx.html
KML Keyhole Markup Language (KML) true false true https://gdal.org/drivers/vector/kml.html
GeoJSON GeoJSON true false true https://gdal.org/drivers/vector/geojson.html
GeoJSONSeq GeoJSON Sequence true false true https://gdal.org/drivers/vector/geojsonseq.html
ESRIJSON ESRIJSON false false true https://gdal.org/drivers/vector/esrijson.html
TopoJSON TopoJSON false false true https://gdal.org/drivers/vector/topojson.html
OGR_GMT GMT ASCII Vectors (.gmt) true false true https://gdal.org/drivers/vector/gmt.html
GPKG GeoPackage true true true https://gdal.org/drivers/vector/gpkg.html
SQLite SQLite / Spatialite true false true https://gdal.org/drivers/vector/sqlite.html
WAsP WAsP .map format true false true https://gdal.org/drivers/vector/wasp.html
OpenFileGDB ESRI FileGDB true false true https://gdal.org/drivers/vector/openfilegdb.html
DXF AutoCAD DXF true false true https://gdal.org/drivers/vector/dxf.html
CAD AutoCAD Driver false false true https://gdal.org/drivers/vector/cad.html
FlatGeobuf FlatGeobuf true false true https://gdal.org/drivers/vector/flatgeobuf.html
Geoconcept Geoconcept true false true  
GeoRSS GeoRSS true false true https://gdal.org/drivers/vector/georss.html
VFK Czech Cadastral Exchange Data Format false false true https://gdal.org/drivers/vector/vfk.html
PGDUMP PostgreSQL SQL dump true false false https://gdal.org/drivers/vector/pgdump.html
OSM OpenStreetMap XML and PBF false false true https://gdal.org/drivers/vector/osm.html
GPSBabel GPSBabel true false true https://gdal.org/drivers/vector/gpsbabel.html
WFS OGC WFS (Web Feature Service) false false true https://gdal.org/drivers/vector/wfs.html
OAPIF OGC API - Features false false true https://gdal.org/drivers/vector/oapif.html
EDIGEO French EDIGEO exchange format false false true https://gdal.org/drivers/vector/edigeo.html
SVG Scalable Vector Graphics false false true https://gdal.org/drivers/vector/svg.html
ODS Open Document/ LibreOffice / OpenOffice Spreadsheet true false true https://gdal.org/drivers/vector/ods.html
XLSX MS Office Open XML spreadsheet true false true https://gdal.org/drivers/vector/xlsx.html
Elasticsearch Elastic Search true false true https://gdal.org/drivers/vector/elasticsearch.html
Carto Carto true false true https://gdal.org/drivers/vector/carto.html
AmigoCloud AmigoCloud true false true https://gdal.org/drivers/vector/amigocloud.html
SXF Storage and eXchange Format false false true https://gdal.org/drivers/vector/sxf.html
Selafin Selafin true false true https://gdal.org/drivers/vector/selafin.html
JML OpenJUMP JML true false true https://gdal.org/drivers/vector/jml.html
PLSCENES Planet Labs Scenes API false false true https://gdal.org/drivers/vector/plscenes.html
CSW OGC CSW (Catalog Service for the Web) false false true https://gdal.org/drivers/vector/csw.html
VDV VDV-451/VDV-452/INTREST Data Format true false true https://gdal.org/drivers/vector/vdv.html
MVT Mapbox Vector Tiles true false true https://gdal.org/drivers/vector/mvt.html
NGW NextGIS Web true true true https://gdal.org/drivers/vector/ngw.html
MapML MapML true false true https://gdal.org/drivers/vector/mapml.html
TIGER U.S. Census TIGER/Line false false true https://gdal.org/drivers/vector/tiger.html
AVCBin Arc/Info Binary Coverage false false true https://gdal.org/drivers/vector/avcbin.html
AVCE00 Arc/Info E00 (ASCII) Coverage false false true https://gdal.org/drivers/vector/avce00.html

Initially we have prioritized providing a breadth of capabilities by wrapping existing libraries. We’re planning to implement more of the core functions and algorithms natively in the future to enable faster performance and more efficient memory management.

As an initial step in this direction, we provide a set of non-standard specialized columnar DuckDB native geometry types such as POINT_2D, BOX_2D, etc. that should provide better compression and faster execution in exchange for some flexibility, but work around these are still very much experimental.

Example Usage

The following demonstrates how you can use the spatial extension to read and export multiple geospatial data formats, transform geometries between different coordinate reference systems and work with spatial property and predicate functions. While this example may be slightly contrived, we want to showcase the power of the currently available features. You can find the datasets used in this example in the spatial extension repository.

Let’s start by installing and loading the spatial extension and the parquet extension. Now we can import the NYC taxi ride data provided in parquet format, as well as the accompanying taxi zone data from a shapefile, using the ST_Read table function provided by the spatial extension. These taxi zones break NYC into polygons that represent regions, for example the Newark Airport. We then create a table for the rides and a table for the zones. Note that ST_Read produces a table with a wkb_geometry column that contains the geometry data encoded as a WKB (Well-Known Binary) blob, which we then convert to the GEOMETRY type using the ST_GeomFromWKB function.

This may all seem a bit much if you are not familiar with the geospatial ecosystem, but rest assured this is all you really need to get started. In short:

  • Shapefile (.shp, .shx, .dbf) is a common format for storing geometry vector data and auxiliary metadata such as indexes and attributes.
  • WKB (Well Known Binary), while not really a file format in itself, is a common binary encoding of vector geometry data, used in e.g. GeoParquet. Comes in multiple flavors, but we’re only concerned with “standard” WKB for now.
  • GEOMETRY is a DuckDB type that represents a Simple Features geometry object, which is based on a set of standards modeling vector geometry data as points, linestrings, polygons or collections of such. This is the core data type used by the spatial extension, and what most of the provided functions take and return.
INSTALL spatial;
INSTALL parquet;
LOAD spatial;
LOAD parquet;

CREATE TABLE rides AS SELECT * 
FROM './spatial/test/data/nyc_taxi/yellow_tripdata_2010-01-limit1mil.parquet';

-- Load the NYC taxi zone data from a shapefile using the gdal-based ST_Read function
CREATE TABLE zones AS SELECT zone, LocationId, borough, ST_GeomFromWKB(wkb_geometry) AS geom 
FROM ST_Read('./spatial/test/data/nyc_taxi/taxi_zones/taxi_zones.shx');

Let’s compare the trip distance to the linear distance between the pickup and dropoff points to figure out how efficient the taxi drivers are (or how dirty the data is, since some diffs seem to be negative). We transform the coordinates from “WGS84” (given by the identifier EPSG:4326), also commonly known as simply latitude/longitude to the “NAD83 / New York Long Island ftUS” (identified as ESRI:102718) coordinate reference system which is a projection with minimal distortion around New York. We then calculate the distance using the ST_Distance function. In This case we get the distance in feet since we’ve converted the coordinates to NAD83 but we can easily convert it into to miles (5280 ft/mile) which is the unit used in the rides dataset so we can compare them correctly.

Wait, what’s all this about coordinate reference systems and projections?

The earth is not flat, but sometimes it is useful to pretend it is for the sake of simplicity by “projecting” the coordinates onto a flat surface. The “parameters” of a projection - e.g. where the “origin” is located, what unit coordinates are in, or how the earth’s shape is approximated - are encapsulated by a “Spatial Reference System” or “Coordinate Reference System” (CRS) which is usually referenced by a shorthand identifier composed of an authority and a code, e.g. “EPSG:4326” or “ESRI:102718”. Projections are always lossy, so its important to use a CRS that is well suited for the “area of interest” your data is in. The spatial extension uses the PROJ library to handle coordinate reference systems and projections.

Trips with a distance shorter than the aerial distance are likely to be erroneous, so we use this query to filter out some bad data. The query below takes advantage of DuckDB’s ability to refer to column aliases defined within the same select statement. This is a small example of how DuckDB’s rich SQL dialect can simplify geospatial analysis.

CREATE TABLE cleaned_rides AS SELECT 
    ST_Point(pickup_latitude, pickup_longitude) AS pickup_point,
    ST_Point(dropoff_latitude, dropoff_longitude) AS dropoff_point,
    dropoff_datetime::TIMESTAMP - pickup_datetime::TIMESTAMP AS time,
    trip_distance,
    ST_Distance(
        ST_Transform(pickup_point, 'EPSG:4326', 'ESRI:102718'), 
        ST_Transform(dropoff_point, 'EPSG:4326', 'ESRI:102718')) / 5280 
    AS aerial_distance, 
    trip_distance - aerial_distance AS diff 
FROM rides 
WHERE diff > 0
ORDER BY diff DESC;
SELECT * FROM rides LIMIT 10;
vendor_id pickup_datetime dropoff_datetime passenger_count trip_distance pickup_longitude pickup_latitude rate_code store_and_fwd_flag dropoff_longitude dropoff_latitude payment_type fare_amount surcharge mta_tax tip_amount tolls_amount total_amount
VTS 2010-01-01 00:00:17 2010-01-01 00:00:17 3 0.0 -73.87105699999998 40.773522 1   -73.871048 40.773545 CAS 45.0 0.0 0.5 0.0 0.0 45.5
VTS 2010-01-01 00:00:20 2010-01-01 00:00:20 1 0.05 -73.97512999999998 40.789973 1   -73.97498799999998 40.790598 CAS 2.5 0.5 0.5 0.0 0.0 3.5
CMT 2010-01-01 00:00:23 2010-01-01 00:00:25 1 0.0 -73.999431 40.71216 1 0 -73.99915799999998 40.712421 No 2.5 0.5 0.5 0.0 0.0 3.5
CMT 2010-01-01 00:00:33 2010-01-01 00:00:55 1 0.0 -73.97721699999998 40.749633 1 0 -73.97732899999998 40.749629 Cas 2.5 0.5 0.5 0.0 0.0 3.5
VTS 2010-01-01 00:01:00 2010-01-01 00:01:00 1 0.0 -73.942313 40.784332 1   -73.942313 40.784332 Cre 10.0 0.0 0.5 2.0 0.0 12.5
VTS 2010-01-01 00:01:06 2010-01-01 00:01:06 2 0.38 -73.97463 40.756687 1   -73.979872 40.759143 CAS 3.7 0.5 0.5 0.0 0.0 4.7
VTS 2010-01-01 00:01:07 2010-01-01 00:01:07 2 0.23 -73.987358 40.718475 1   -73.98518 40.720468 CAS 2.9 0.5 0.5 0.0 0.0 3.9
CMT 2010-01-01 00:00:02 2010-01-01 00:01:08 1 0.1 -73.992807 40.741418 1 0 -73.995799 40.742596 No 2.9 0.5 0.5 0.0 0.0 3.9
VTS 2010-01-01 00:01:23 2010-01-01 00:01:23 1 0.6099999999999999 -73.98003799999998 40.74306 1   -73.974862 40.750387 CAS 3.7 0.5 0.5 0.0 0.0 4.7
VTS 2010-01-01 00:01:34 2010-01-01 00:01:34 1 0.02 -73.954122 40.801173 1   -73.95431499999998 40.800897 CAS 45.0 0.0 0.5 0.0 0.0 45.5
SELECT * FROM zones LIMIT 10;
zone LocationID borough geom
Newark Airport 1 EWR POLYGON (…)
Jamaica Bay 2 Queens MULTIPOLYGON (…)
Allerton/Pelham Gardens 3 Bronx POLYGON (…)
Alphabet City 4 Manhattan POLYGON (…)
Arden Heights 5 Staten Island POLYGON (…)
Arrochar/Fort Wadsworth 6 Staten Island POLYGON (…)
Astoria 7 Queens POLYGON (…)
Astoria Park 8 Queens POLYGON (…)
Auburndale 9 Queens POLYGON (…)
Baisley Park 10 Queens POLYGON (…)

It should be noted that this is not entirely accurate since the ST_Distance function we use does not take into account the curvature of the earth. However, we’ll accept it as a good enough approximation for our purposes. Spherical and geodesic distance calculations are on the roadmap!

Now let’s join the taxi rides with the taxi zones to get the start and end zone for each ride. We use the ST_Within function as our join condition to check if a pickup or dropoff point is within a taxi zone polygon. Again we need to transform the coordinates from WGS84 to the NAD83 since the taxi zone data also use that projection. Spatial joins like these are the bread and butter of geospatial data processing, but we don’t currently have any optimizations in place (such as spatial indexes) to speed up these queries, which is why we only use a subset of the data for the following step.

-- Since we don't have spatial indexes yet, use a smaller dataset for the join.
DELETE FROM cleaned_rides WHERE rowid > 5000;

CREATE TABLE joined AS 
SELECT 
    pickup_point,
    dropoff_point,
    start_zone.zone AS start_zone,
    end_zone.zone AS end_zone, 
    trip_distance,
    time,
FROM cleaned_rides 
JOIN zones AS start_zone 
ON ST_Within(ST_Transform(pickup_point, 'EPSG:4326', 'ESRI:102718'), start_zone.geom) 
JOIN zones AS end_zone 
ON ST_Within(ST_Transform(dropoff_point, 'EPSG:4326', 'ESRI:102718'), end_zone.geom);
SELECT * FROM joined USING SAMPLE 10 ROWS;
pickup_point dropoff_point start_zone end_zone trip_distance time
POINT (40.722223 -73.98385299999998) POINT (40.715507 -73.992438) East Village Lower East Side 10.3 00:19:16
POINT (40.648687 -73.783522) POINT (40.649567 -74.005812) JFK Airport Sunset Park West 23.57 00:28:00
POINT (40.761603 -73.96661299999998) POINT (40.760232 -73.96344499999998) Upper East Side South Sutton Place/Turtle Bay North 17.6 00:27:05
POINT (40.697212 -73.937495) POINT (40.652377 -73.93983299999998) Stuyvesant Heights East Flatbush/Farragut 13.55 00:24:00
POINT (40.721462 -73.993583) POINT (40.774205 -73.90441699999998) Lower East Side Steinway 28.75 01:03:00
POINT (40.716955 -74.004328) POINT (40.754688 -73.991612) TriBeCa/Civic Center Garment District 18.4 00:46:12
POINT (40.740052 -73.994918) POINT (40.75439 -73.98587499999998) Flatiron Garment District 24.2 00:35:25
POINT (40.763017 -73.95949199999998) POINT (40.763615 -73.959182) Lenox Hill East Lenox Hill West 18.4 00:33:46
POINT (40.865663 -73.927458) POINT (40.86537 -73.927352) Washington Heights North Washington Heights North 10.47 00:27:00
POINT (40.738408 -73.980345) POINT (40.696038 -73.955493) Gramercy Bedford 16.4 00:21:47

We can export the joined table to a GeoJSONSeq file using the GDAL copy function, passing in a GDAL layer creation option. Since GeoJSON only supports a single GEOMETRY per record, we use the ST_MakeLine function to combine the pickup and dropoff points into a single line geometry. The default coordinate reference system for GeoJSON is WGS84, but the coordinate pairs are expected to be in longitude/latitude, so we need to flip the geometry using the ST_FlipCoordinates function.

COPY (
    SELECT 
        ST_AsWKB(ST_FlipCoordinates(ST_MakeLine(pickup_point, dropoff_point))) 
        AS wkb_geometry,
        start_zone,
        end_zone,
        time::VARCHAR AS trip_time 
    FROM joined) 
TO 'joined.geojsonseq' 
WITH (FORMAT GDAL, DRIVER 'GeoJSONSeq', LAYER_CREATION_OPTIONS 'WRITE_BBOX=YES');
head -n 10 joined.geojsonseq
{ "type": "Feature", "properties": { "start_zone": "JFK Airport", "end_zone": "Park Slope", "trip_time": "00:52:00" }, "geometry": { "type": "LineString", "coordinates": [ [ -73.789923, 40.643515 ], [ -73.97608, 40.680395 ] ] } }
{ "type": "Feature", "properties": { "start_zone": "JFK Airport", "end_zone": "Park Slope", "trip_time": "00:35:00" }, "geometry": { "type": "LineString", "coordinates": [ [ -73.776445, 40.645422 ], [ -73.98427, 40.670782 ] ] } }
{ "type": "Feature", "properties": { "start_zone": "JFK Airport", "end_zone": "Park Slope", "trip_time": "00:45:42" }, "geometry": { "type": "LineString", "coordinates": [ [ -73.776878, 40.645065 ], [ -73.992153, 40.662571 ] ] } }
{ "type": "Feature", "properties": { "start_zone": "JFK Airport", "end_zone": "Park Slope", "trip_time": "00:36:00" }, "geometry": { "type": "LineString", "coordinates": [ [ -73.788028, 40.641508 ], [ -73.97584, 40.670927 ] ] } }
{ "type": "Feature", "properties": { "start_zone": "JFK Airport", "end_zone": "Park Slope", "trip_time": "00:47:58" }, "geometry": { "type": "LineString", "coordinates": [ [ -73.781855, 40.644749 ], [ -73.980129, 40.663663 ] ] } }
{ "type": "Feature", "properties": { "start_zone": "JFK Airport", "end_zone": "Park Slope", "trip_time": "00:32:10" }, "geometry": { "type": "LineString", "coordinates": [ [ -73.787494, 40.641559 ], [ -73.974694, 40.673479 ] ] } }
{ "type": "Feature", "properties": { "start_zone": "JFK Airport", "end_zone": "Park Slope", "trip_time": "00:36:59" }, "geometry": { "type": "LineString", "coordinates": [ [ -73.790138, 40.643342 ], [ -73.982721, 40.662379 ] ] } }
{ "type": "Feature", "properties": { "start_zone": "JFK Airport", "end_zone": "Park Slope", "trip_time": "00:32:00" }, "geometry": { "type": "LineString", "coordinates": [ [ -73.786952, 40.641248 ], [ -73.97421, 40.676237 ] ] } }
{ "type": "Feature", "properties": { "start_zone": "JFK Airport", "end_zone": "Park Slope", "trip_time": "00:33:21" }, "geometry": { "type": "LineString", "coordinates": [ [ -73.783892, 40.648514 ], [ -73.979283, 40.669721 ] ] } }
{ "type": "Feature", "properties": { "start_zone": "JFK Airport", "end_zone": "Park Slope", "trip_time": "00:35:45" }, "geometry": { "type": "LineString", "coordinates": [ [ -73.776643, 40.645272 ], [ -73.978873, 40.66723 ] ] } }

And there we have it! We pulled tabular data from parquet, combined it with geospatial data in a shapefile, cleaned and analyzed that combined data, and output it to a human readable geospatial format. The full set of currently supported functions and their implementation status can be found over at the docs in this table

What’s next?

While it’s probably going to take a while for us to catch up to the full set of functions provided by e.g. PostGIS, we believe that DuckDB’s vectorized execution model and columnar storage format will enable a whole new class of optimizations for geospatial processing that we’ve just begun exploring. Improving the performance of spatial joins and predicates is therefore high on our list of priorities.

There are also some limitations with our GEOMETRY type that we would eventually like to tackle, such as the fact that we don’t support additional Z and M dimensions, or don’t support the full range of geometry sub-types that are mandated by the OGC standard, like curves or polyhedral surfaces.

We’re also interested in supporting spherical and ellipsoidal calculations in the near future, perhaps in the form of a dedicated GEOGRAPHY type.

WASM builds are also just around the corner!

Please take a look at the GitHub repository for the full roadmap and to see what we’re currently working on. If you would like to help build this capability, please reach out on GitHub!

Conclusion

The DuckDB Spatial extension is another step towards making DuckDB a swiss army knife for data engineering and analytics. This extension provides a flexible and familiar GEOMETRY type, reprojectable between thousands of coordinate reference systems, coupled with the capability to export and import geospatial data between more than 50 different data sources. All embedded into a single extension with minimal runtime dependencies. This enables DuckDB to fit seamlessly into your existing GIS workflows regardless of which geospatial data formats or projections you’re working with.

We are excited to hear what you make of the DuckDB spatial extension. It’s still early days but we hope to have a lot more to share in the future as we continue making progress! If you have any questions, suggestions, ideas or issues, please don’t hesitate to reach out to us on Discord or GitHub!

continue reading
2023-04-28Hannes Mühleisen

DuckCon #3 in San Francisco

We are excited to hold our first “DuckCon” DuckDB user group meeting outside of Europe in San Francisco. The meeting will take place in the afternoon of June 29th. The meeting will be held at the San Francisco Museum of Modern Art (SFMOMA), in the Phyllis Wattis Theater.

In DuckCon #3, we (again) have a talk from DuckDB creators Hannes Mühleisen and Mark Raasveldt about the current state of DuckDB and future plans. Side note: Hannes will also speak at the Data + AI Summit earlier that day.

We are also excited to have talks by Lloyd Tabb, the creator of Google’s Malloy project and Josh Wills, the creator of the dbt-duckdb package.

In addition, we will have several lightning talks from DuckDB users.

Timetable

Time Title Presenter
4 PM Welcome to DuckCon! – Setting Up  
4:10 PM Introductions Hannes Mühleisen
4:15 PM State of the Duck Mark Raasveldt
4:40 PM ‘Data is Rectangular’ and Other Common Misconceptions Lloyd Tabb
5:05 PM DuckDBT: Not a database or a dbt adapter but a secret third thing Josh Wills
5:30 PM Watershed: Get to zero carbon, fast Jessica Zhu
5:35 PM Building a SQL editor around “fast” Hamilton Ulmer
5:40 PM KalDB: Log analytics with DuckDB Suman Karumuri
5:45 PM Pandas on DuckDBWith Ponder Aditya Parameswaran
5:50 PM Bringing AI to DuckDB with Lance columnar format for multi-modal AI Chang She
5:55 PM Closing thoughts  
6:00 PM End of DuckCon  
From 6 PM MotherDuck Party, see below  

Lightning Talks

We will have a 20-minute slot for five-minute lightning talks, where our users are invited to show and tell anything DuckDB-related. Please submit your lightning talk proposal in the pre-registration form below.

Registration Process

Attendance is free. While supplies last, you can still get a ticket here. You will need to show this ticket at the entrance to attend. Please contact [email protected] if you have any questions.

MotherDuck Party at 111 Minna

Following DuckCon, MotherDuck will host a party celebrating ducks at 111 Minna (located very close to SFMOMA). DuckCon attendees are cordially invited to attend to eat, drink, and play games. MotherDuck’s Chief Duck Herder will also demo the latest work bringing DuckDB to the cloud. Please note that you will have to separately register for the MotherDuck event.

continue reading
2023-04-21Tristan Celder

Introducing DuckDB for Swift

TL;DR: DuckDB now has a native Swift API. DuckDB on mobile here we go!

Today we’re excited to announce the DuckDB API for Swift. It enables developers on Swift platforms to harness the full power of DuckDB using a native Swift interface with support for great Swift features such as strong typing and concurrency. The API is available not only on Apple platforms, but on Linux too, opening up new opportunities for the growing Swift on Server ecosystem.

What’s included

DuckDB is designed to be fast, reliable and easy to use, and it’s this philosophy that also guided the creation of our new Swift API.

This initial release supports many of the great features of DuckDB right out of the box, including:

  • Queries via DuckDB’s enhanced SQL dialect: In addition to basic SQL, DuckDB supports arbitrary and nested correlated subqueries, window functions, collations, complex types (Swift arrays and structs), and more.
  • Import and export of JSON, CSV, and Parquet files: Beyond its built-in and super-efficient native file format, DuckDB supports reading in, and exporting out to, JSON, CSV, and Parquet files.
  • Strongly typed result sets: DuckDB’s strongly typed result sets are a natural fit for Swift. It’s simple to cast DuckDB columns to their native Swift equivalents, ready for presentation using SwiftUI or as part of an existing TabularData workflow.
  • Swift concurrency support: by virtue of their Sendable conformance, many of DuckDB’s core underlying types can be safely passed across concurrency contexts, easing the process of designing parallel processing workflows and ensuring responsive UIs.

Usage

To demonstrate just how well DuckDB works together with Swift, we’ve created an example project that uses raw data from NASA’s Exoplanet Archive loaded directly into DuckDB.

You’ll see how to:

  • Instantiate a DuckDB in-memory Database and Connection
  • Populate a DuckDB table with the contents of a remote CSV
  • Query a DuckDB database and prepare the results for presentation

Finally, we’ll present our analysis with the help of Apple’s TabularData Framework and Swift Charts.

Instantiating DuckDB

DuckDB supports both file-based and in-memory databases. In this example, as we don’t intend to persist the results of our Exoplanet analysis to disk, we’ll opt for an in-memory Database.

let database = try Database(store: .inMemory)

However, we can’t issue queries just yet. Much like other RDMSs, queries must be issued through a database connection. DuckDB supports multiple connections per database. This can be useful to support parallel processing, for example. In our project, we’ll need just the one connection that we’ll eventually access asynchronously.

let connection = try database.connect()

Finally, we’ll create an app-specific type that we’ll use to house our database and connection and through which we’ll eventually define our app-specific queries.

import DuckDB

final class ExoplanetStore {
  
  let database: Database
  let connection: Connection
  
  init(database: Database, connection: Connection) {
    self.database = database
    self.connection = connection
  }
}

Populating DuckDB with a remote CSV file

One problem with our current ExoplanetStore type is that it doesn’t yet contain any data to query. To fix that, we’ll load it with the data of every Exoplanet discovered to date from NASA’s Exoplanet Archive.

There are hundreds of configuration options for this incredible resource, but today we want each exoplanet’s name and its discovery year packaged as a CSV. Checking the docs gives us the following endpoint:

https://exoplanetarchive.ipac.caltech.edu/TAP/sync?query=select+pl_name+,+disc_year+from+pscomppars&format=csv

Once we have our CSV downloaded locally, we can use the following SQL command to load it as a new table within our DuckDB in-memory database. DuckDB’s read_csv_auto command automatically infers our table schema and the data is immediately available for analysis.

CREATE TABLE exoplanets AS (
	SELECT * FROM read_csv_auto('downloaded_exoplanets.csv')
); 

Let’s package this up as a new asynchronous factory method on our ExoplanetStore type:

import DuckDB
import Foundation

final class ExoplanetStore {

  // Factory method to create and prepare a new ExoplanetStore
  static func create() async throws -> ExoplanetStore {
	
	// Create our database and connection as described above
    let database = try Database(store: .inMemory)
    let connection = try database.connect()

	// Download the CSV from the exoplanet archive
	let (csvFileURL, _) = try await URLSession.shared.download(
		from: URL(string: "https://exoplanetarchive.ipac.caltech.edu/TAP/sync?query=select+pl_name+,+disc_year+from+pscomppars&format=csv")!)	
	
	// Issue our first query to DuckDB
    try connection.execute("""
	CREATE TABLE exoplanets AS (
		SELECT * FROM read_csv_auto('\(csvFileURL.path)')
	);
	""")
	
	// Create our pre-populated ExoplanetStore instance
    return ExoplanetStore(
	  database: database,
  	  connection: connection
	)
  }

  // Let's make the initializer we defined previously 
  // private. This prevents anyone accidentally instantiating
  // the store without having pre-loaded our Exoplanet CSV
  // into the database
  private init(database: Database, connection: Connection) {
	...
  }
}

Querying the Database

Now that the database is populated with data, it’s ready to be analyzed. Let’s create a query which we can use to plot a chart of the number of exoplanets discovered by year.

SELECT disc_year, COUNT(disc_year) AS Count
	FROM exoplanets
	GROUP BY disc_year
	ORDER BY disc_year

Issuing the query to DuckDB from within Swift is simple. We’ll again make use of an async function from which to issue our query. This means the callee won’t be blocked while the query is executing. We’ll then cast the result columns to Swift native types using DuckDB’s ResultSet cast(to:) family of methods, before finally wrapping them up in a DataFrame from the TabularData framework ready for presentation in the UI.

...

import TabularData

extension ExoplanetStore {

  // Retrieves the number of exoplanets discovered by year  
  func groupedByDiscoveryYear() async throws -> DataFrame {
    
	// Issue the query we described above
    let result = try connection.query("""
      SELECT disc_year, COUNT(disc_year) AS Count
        FROM exoplanets
        GROUP BY disc_year
        ORDER BY disc_year
      """)

    // Cast our DuckDB columns to their native Swift
	// equivalent types
    let discoveryYearColumn = result[0].cast(to: Int.self)
    let countColumn = result[1].cast(to: Int.self)
	
	// Use our DuckDB columns to instantiate TabularData
	// columns and populate a TabularData DataFrame
    return DataFrame(columns: [
      TabularData.Column(discoveryYearColumn)
        .eraseToAnyColumn(),
      TabularData.Column(countColumn)
        .eraseToAnyColumn(),
    ])
  }
}

Visualizing the Results

In just a few lines of code, our database has been created, populated and analyzed – all that’s left to do now is present the results.

And I have a feeling that we’re just getting started…

For the complete example project – including the SwiftUI views and Chart definitions used to create the screenshot above – clone the DuckDB Swift repo and open up the runnable app project located in Examples/SwiftUI/ExoplanetExplorer.xcodeproj.

We encourage you to modify the code, explore the Exoplanet Archive and DuckDB, and make some discoveries of your own – interplanetary or otherwise!

Conclusion

In this article we’ve introduced the brand new Swift API for DuckDB and demonstrated how quickly you can get up and running analyzing data.

With DuckDB’s incredible performance and analysis capabilities and Swift’s vibrant eco-system and platform support, there’s never been a better time to begin exploring analytical datasets in Swift.

We can’t wait to see what you do with it. Feel free to reach out on our Discord if you have an questions!


The Swift API for DuckDB is packaged using Swift Package Manager and lives in a new top-level repository available at https://github.com/duckdb/duckdb-swift.

continue reading
2023-04-14Tom Ebergen

The Return of the H2O.ai Database-like Ops Benchmark

TL;DR: We’ve resurrected the H2O.ai database-like ops benchmark with up to date libraries and plan to keep re-running it.

Skip directly to the results

The H2O.ai Database-like Ops Benchmark is a well-known benchmark in the data analytics and R community. The benchmark measures the groupby and join performance of various analytical tools like data.table, polars, dplyr, clickhouse, duckdb and more. Since July 2nd 2021, the benchmark has been dormant, with no result updates or maintenance. Many of the analytical systems measured in the benchmark have since undergone substantial improvements, leaving many of the maintainers curious as to where their analytical tool ranks on the benchmark.

DuckDB has decided to give the H2O.ai benchmark new life and maintain it for the foreseeable future. One reason the DuckDB project has decided to maintain the benchmark is because DuckDB has had 10 new minor releases since the most recent published results on July 2nd, 2021. After managing to run parts of the benchmark on a r3-8xlarge AWS box, DuckDB ranked as a top performer on the benchmark. Additionally, the DuckDB project wants to demonstrate it’s commitment to performance by consistently comparing DuckDB with other analytical systems. While DuckDB delivers differentiated ease of use, raw performance and scalability are critically important for solving tough problems fast. Plus, just like many of our fellow data folks, we have a need for speed. Therefore, the decision was made to fork the benchmark, modernize underlying dependencies and run the benchmark on the latest versions of the included systems. You can find the repository here.

The results of the new benchmark are very interesting, but first a quick summary of the benchmark and what updates took place.

The H2O.ai Database-like Ops Benchmark

There are 5 basic grouping tests and 5 advanced grouping tests. The 10 grouping queries all focus on a combination of the following

  • Low cardinality (a few big groups)
  • High cardinality (lots of very small groups)
  • Grouping integer types
  • Grouping string types

Each query is run only twice with both results being reported. This way we can see the performance of a cold run and any effects data caching may have. The idea is to avoid reporting any potential “best” results on a hot system. Data analysts only need to run a query once to get their answer. No one drives to the store a second time to get another litre of milk faster.

The time reported is the sum of the time it takes to run all 5 queries twice.

More information about the specific queries can be found below.

The Data and Queries

The queries have not changed since the benchmark went dormant. The data is generated in a rather simple manner. Inspecting the datagen files you can see that the columns are generated with small, medium, and large groups of char and int values. Similar generation logic applies to the join data generation.

Query SQL Objective
groupby #1 SELECT id1, sum(v1) AS v1 FROM tbl GROUP BY id1 Sum over large cardinality groups, grouped by varchar
groupby #2 SELECT id1, id2, sum(v1) AS v1 FROM tbl GROUP BY id1, id2 Sum over medium cardinality groups, grouped by varchars
groupby #3 SELECT id3, sum(v1) AS v1, mean(v3) AS v3 FROM tbl GROUP BY id3 Sum and mean over many small cardinality groups, grouped by varchar
groupby #4 SELECT id4, mean(v1) AS v1, mean(v2) AS v2, mean(v3) AS v3 FROM tbl GROUP BY id4 Mean over many large cardinality groups, grouped by integer
groupby #5 SELECT id6, sum(v1) AS v1, sum(v2) AS v2, sum(v3) AS v3 FROM tbl GROUP BY id6 Sum over many small groups, grouped by integer
advanced groupby #1 SELECT id4, id5, quantile_cont(v3, 0.5) AS median_v3, stddev(v3) AS sd_v3 FROM tbl GROUP BY id4, id5 quantile_cont over medium cardinality group, grouped by integers
advanced groupby #2 SELECT id3, max(v1)-min(v2) AS range_v1_v2 FROM tbl GROUP BY id3 Range selection over small cardinality groups, grouped by integer
advanced groupby #3 SELECT id6, v3 AS largest2_v3 FROM (SELECT id6, v3, row_number() OVER (PARTITION BY id6 ORDER BY v3 DESC) AS order_v3 FROM x WHERE v3 IS NOT NULL) sub_query WHERE order_v3 <= 2 Advanced group by query
advanced groupby #4 SELECT id2, id4, pow(corr(v1, v2), 2) AS r2 FROM tbl GROUP BY id2, id4 Arithmetic over medium sized groups, grouped by varchar, integer.
advanced groupby #5 SELECT id1, id2, id3, id4, id5, id6, sum(v3) AS v3, count(*) AS count FROM tbl GROUP BY id1, id2, id3, id4, id5, id6 Many many small groups, the number of groups is the cardinality of the dataset
join #1 SELECT x.*, small.id4 AS small_id4, v2 FROM x JOIN small USING (id1) Joining a large table (x) with a small-sized table on integer type
join #2 SELECT x.*, medium.id1 AS medium_id1, medium.id4 AS medium_id4, medium.id5 AS medium_id5, v2 FROM x JOIN medium USING (id2) Joining a large table (x) with a medium-sized table on integer type
join #3 SELECT x.*, medium.id1 AS medium_id1, medium.id4 AS medium_id4, medium.id5 AS medium_id5, v2 FROM x LEFT JOIN medium USING (id2) Left join a large table (x) with a medium-sized table on integer type
join #4 SELECT x.*, medium.id1 AS medium_id1, medium.id2 AS medium_id2, medium.id4 AS medium_id4, v2 FROM x JOIN medium USING (id5) Join a large table (x) with a medium table on varchar type
join #5 SELECT x.*, big.id1 AS big_id1, big.id2 AS big_id2, big.id4 AS big_id4, big.id5 AS big_id5, big.id6 AS big_id6, v2 FROM x JOIN big USING (id3) Join a large table (x) with a large table on integer type.

You can find more information about the queries in the Efficiency of Data Processing slides.

Modifications to the Benchmark & Hardware

No modifications have been made to the queries or the data generation. Some scripts required minor modifications so that the current version of the library could be run. The hardware used is slightly different as the exact AWS offering the benchmark previously used is no longer available. Base libraries have been updated as well. GPU libraries were not tested.

AWS is a m4.10xlarge

  • CPU model: Intel(R) Xeon(R) CPU E5-2676 v3 @ 2.40GHz
  • CPU cores: 40
  • RAM model: Unknown
  • Memory: 160GB
  • NO GPU specifications
  • R upgraded, 4.0.0 -> 4.2.2
  • Python upgraded 3.[6|7] -> 3.10

Changes made to install scripts of other systems

Pandas, Polars, Dask, and Clickhouse required changes to their setup/install scripts. The changes were relatively minor consisting mostly of syntax updates and data ingestion updates. Data ingestion did not affect the reporting timing results.

Results

You can also look at the results here. DuckDB’s timings have improved significantly since v0.2.7 (released over two years ago). A major contributor to our increased performance is parallel grouped aggregation, merged in March 2022, and parallel result set materialization. In addition, DuckDB now supports enum types, which makes DuckDB group by aggregation even faster. Improvements to the out-of-core hash join were merged as well, further improving the performance of our joins.

Questions about certain results?

Some solutions may report internal errors for some queries. Feel free to investigate the errors by using the _utils/repro.sh script and file a github issue to resolve any confusion. In addition, there are many areas in the code where certain query results are automatically nullified. If you believe that is the case for a query for your system or if you have any other questions, you can create a github issue to discuss.

Maintenance plan

DuckDB will continue to maintain this benchmark for the forseeable future. The process for re-running the benchmarks with updated library versions must still be decided.

Do you have any other questions? Would you like to have your system added to the benchmark? Please feel free to read the ReadMe.md in the repository, and if you still have questions, you can reach out to me at [email protected] or on our Discord!

continue reading
2023-03-03Laurens Kuiper

Shredding Deeply Nested JSON, One Vector at a Time

TL;DR: We’ve recently improved DuckDB’s JSON extension so JSON files can be directly queried as if they were tables.

JSON is not scary anymore! Jason IS scary though, even as a duck.

DuckDB has a JSON extension that can be installed and loaded through SQL:

INSTALL 'json';
LOAD 'json';

The JSON extension supports various functions to create, read, and manipulate JSON strings. These functions are similar to the JSON functionality provided by other databases such as PostgreSQL and MySQL. DuckDB uses yyjson internally to parse JSON, a high-performance JSON library written in ANSI C. Many thanks to the yyjson authors and contributors!

Besides these functions, DuckDB is now able to read JSON directly! This is done by automatically detecting the types and column names, then converting the values within the JSON to DuckDB’s vectors. The automated schema detection dramatically simplifies working with JSON data and subsequent queries on DuckDB’s vectors are significantly faster!

Reading JSON Automatically with DuckDB

Since the 0.7.0 update, DuckDB has added JSON table functions. To demonstrate these, we will read todos.json, a fake TODO list containing 200 fake TODO items (only the first two items are shown):

[
  {
    "userId": 1,
    "id": 1,
    "title": "delectus aut autem",
    "completed": false
  },
  {
    "userId": 1,
    "id": 2,
    "title": "quis ut nam facilis et officia qui",
    "completed": false
  },
]

Each TODO item is an entry in the JSON array, but in DuckDB, we’d like a table where each entry is a row. This is now (since DuckDB’s 0.7.0 release in February 2023) as easy as:

SELECT * FROM 'todos.json';
userId id title completed
1 1 delectus aut autem false
1 2 quis ut nam facilis et officia qui false
1 3 fugiat veniam minus false
1 4 et porro tempora true
1 5 laboriosam mollitia et enim quasi adipisci quia provident illum false

(Note: Only 5 rows shown)

Now, finding out which user completed the most TODO items is as simple as:

SELECT userId, sum(completed::int) total_completed
FROM 'todos.json'
GROUP BY userId
ORDER BY total_completed DESC
LIMIT 1;
userId total_completed
5 12

Under the hood, DuckDB recognizes the .json file extension in 'todos.json', and calls read_json_auto('todos.json') instead. This function is similar to our read_csv_auto function, which automatically infers column names and types for CSV files.

Like our other table functions, read_json_auto supports reading multiple files by passing a list, e.g., read_json_auto(['file1.json', 'file2.json']), but also globbing, e.g., read_json_auto('file*.json'). DuckDB will read multiple files in parallel.

Newline Delimited JSON

Not all JSON adheres to the format used in todos.json, which is an array of ‘records’. Newline-delimited JSON, or NDJSON, stores each row on a new line. DuckDB also supports reading (and writing!) this format. First, let’s write our TODO list as NDJSON:

COPY (SELECT * FROM 'todos.json') to 'todos2.json';

Again, DuckDB recognizes the .json suffix in the output file and automatically infers that we mean to use (FORMAT JSON). The created file looks like this (only the first two records are shown):

{"userId":1,"id":1,"title":"delectus aut autem","completed":false}
{"userId":1,"id":2,"title":"quis ut nam facilis et officia qui","completed":false}

DuckDB can read this file in precisely the same way as the original one:

SELECT * FROM 'todos2.json';

If your JSON file is newline-delimited, DuckDB can parallelize reading. This is specified with nd or the lines parameter:

SELECT * FROM read_ndjson_auto('todos2.json');
SELECT * FROM read_json_auto('todos2.json', lines='true');

You can also set lines='auto' to auto-detect whether the JSON file is newline-delimited.

Other JSON Formats

If using the read_json function directly, the format of the JSON can be specified using the json_format parameter. This parameter defaults to 'auto', which tells DuckDB to infer what kind of JSON we are dealing with. The first json_format is 'array_of_records', while the second is 'records'. This can be specified like so:

SELECT * FROM read_json('todos.json', auto_detect=true, json_format='array_of_records');
SELECT * FROM read_json('todos2.json', auto_detect=true, json_format='records');

Other supported formats are 'values' and 'array_of_values', which are similar to 'records' and 'array_of_records'. However, with these formats, each ‘record’ is not required to be a JSON object but can also be a JSON array, string, or anything supported in JSON.

Manual Schemas

What you may also have noticed is the auto_detect parameter. This parameter tells DuckDB to infer the schema, i.e., determine the names and types of the returned columns. These can manually be specified like so:

SELECT * FROM read_json('todos.json',
                        columns={userId: 'INT', id: 'INT', title: 'VARCHAR', completed: 'BOOLEAN'},
                        json_format='array_of_records');

You don’t have to specify all fields, just the ones you’re interested in:

SELECT * FROM read_json('todos.json',
                        columns={userId: 'INT', completed: 'BOOLEAN'},
                        json_format='array_of_records');

Now that we know how to use the new DuckDB JSON table functions let’s dive into some analytics!

GitHub Archive Examples

GH Archive is a project to record the public GitHub timeline, archive it, and make it easily accessible for further analysis. Every hour, a GZIP compressed, newline-delimited JSON file containing all public events on GitHub is uploaded. I’ve downloaded a whole day (2023-02-08) of activity using wget and stored the 24 files in a directory called gharchive_gz.

wget https://data.gharchive.org/2023-02-08-0.json.gz
wget https://data.gharchive.org/2023-02-08-1.json.gz
...
wget https://data.gharchive.org/2023-02-08-23.json.gz

Keep in mind that the data is compressed:

$ du -sh gharchive_gz
2.3G  gharchive_gz
$ gunzip -dc gharchive_gz/* | wc -c
 18396198934

One day of GitHub activity amounts to more than 18GB of JSON, which compresses to 2.3GB with GZIP.

To get a feel of what the data looks like, we run the following query:

SELECT json_group_structure(json)
FROM (
  SELECT *
  FROM read_ndjson_objects('gharchive_gz/*.json.gz')
  LIMIT 2048
);

Here, we use our read_ndjson_objects function, which reads the JSON objects in the file as raw JSON, i.e., as strings. The query reads the first 2048 records of JSON from the JSON files gharchive_gz directory and describes the structure. You can also directly query the JSON files from GH Archive using DuckDB’s httpfs extension, but we will be querying the files multiple times, so it is better to download them in this case.

I’ve formatted the result using an online JSON formatter & validator:

{
   "id":"VARCHAR",
   "type":"VARCHAR",
   "actor":{
      "id":"UBIGINT",
      "login":"VARCHAR",
      "display_login":"VARCHAR",
      "gravatar_id":"VARCHAR",
      "url":"VARCHAR",
      "avatar_url":"VARCHAR"
   },
   "repo":{
      "id":"UBIGINT",
      "name":"VARCHAR",
      "url":"VARCHAR"
   },
   "payload":{"..."},
   "public":"BOOLEAN",
   "created_at":"VARCHAR",
   "org":{
      "id":"UBIGINT",
      "login":"VARCHAR",
      "gravatar_id":"VARCHAR",
      "url":"VARCHAR",
      "avatar_url":"VARCHAR"
   }
}

I’ve left "payload" out because it consists of deeply nested JSON, and its formatted structure takes up more than 1000 lines!

So, how many records are we dealing with exactly? Let’s count it using DuckDB:

SELECT count(*) count FROM 'gharchive_gz/*.json.gz';
count
4434953

That’s around 4.4M daily events, which amounts to almost 200K events per hour. This query takes around 7.3s seconds on my laptop, a 2020 MacBook Pro with an M1 chip and 16GB of memory. This is the time it takes to decompress the GZIP compression and parse every JSON record.

To see how much time is spent decompressing GZIP in the query, I’ve also created a gharchive directory containing the same data but uncompressed. Running the same query on the uncompressed data takes around 5.4s, almost 2 seconds faster. So we got faster, but we also read more than 18GB of data from storage, as opposed to 2.3GB when it was compressed. So, this comparison really depends on the speed of your storage. I prefer to keep the data compressed.

As a side note, the speed of this query really shows how fast yyjson is!

So, what kind of events are in the GitHub data?

SELECT type, count(*) count
FROM 'gharchive_gz/*.json.gz'
GROUP BY type
ORDER BY count DESC;
type count
PushEvent 2359096
CreateEvent 624062
PullRequestEvent 366090
IssueCommentEvent 238660
WatchEvent 231486
DeleteEvent 154383
PullRequestReviewEvent 131107
IssuesEvent 88917
PullRequestReviewCommentEvent 79540
ForkEvent 64233
CommitCommentEvent 36823
ReleaseEvent 23004
MemberEvent 14872
PublicEvent 14500
GollumEvent 8180

This query takes around 7.4s, not much more than the count(*) query. So as we can see, data analysis is very fast once everything has been decompressed and parsed.

The most common event type is the PushEvent, taking up more than half of all events, unsurprisingly, which is people pushing their committed code to GitHub. The least common event type is the GollumEvent, taking up less than 1% of all events, which is a creation or update of a wiki page.

If we want to analyze the same data multiple times, decompressing and parsing every time is redundant. Instead, we can create a DuckDB table like so:

CREATE TABLE events AS
SELECT * EXCLUDE (payload)
FROM 'gharchive_gz/*.json.gz';

Which takes around 9s if you’re using an in-memory database. If you’re using an on-disk database, this takes around 13s and results in a database size of 444MB. When using an on-disk database, DuckDB ensures the table is persistent and performs all kinds of compression. Note that we have temporarily ignored the payload field using the convenient EXCLUDE clause.

To get a feel of what we read, we can ask DuckDB to describe the table:

DESCRIBE SELECT * FROM events;

This gives us the following:

cid name type notnull dflt_value pk
0 id BIGINT false   false
1 type VARCHAR false   false
2 actor STRUCT(id UBIGINT, login VARCHAR, display_login VARCHAR, gravatar_id VARCHAR, url VARCHAR, avatar_url VARCHAR) false   false
3 repo STRUCT(id UBIGINT, name VARCHAR, url VARCHAR) false   false
4 public BOOLEAN false   false
5 created_at TIMESTAMP false   false
6 org STRUCT(id UBIGINT, login VARCHAR, gravatar_id VARCHAR, url VARCHAR, avatar_url VARCHAR) false   false

As we can see, the "actor", "repo" and "org" fields, which are JSON objects, have been converted to DuckDB structs. The "id" column was a string in the original JSON but has been converted to a BIGINT by DuckDB’s automatic type detection. DuckDB can also detect a few different DATE/TIMESTAMP formats within JSON strings, as well as TIME and UUID.

Now that we’ve created the table, we can analyze it like any other DuckDB table! Let’s see how much activity there was in the duckdb/duckdb GitHub repository on this specific day:

SELECT type, count(*) count
FROM events
WHERE repo.name = 'duckdb/duckdb'
GROUP BY type
ORDER BY count DESC;
type count
PullRequestEvent 35
IssueCommentEvent 30
WatchEvent 29
PushEvent 15
PullRequestReviewEvent 14
IssuesEvent 9
PullRequestReviewCommentEvent 7
ForkEvent 3

That’s a lot of pull request activity! Note that this doesn’t mean that 35 pull requests were opened on this day, activity within a pull request is also counted. If we search through the pull requests for that day, we see that there are only 15. This is more activity than normal because most of the DuckDB developers were busy fixing bugs for the 0.7.0 release.

Now, let’s see who was the most active:

SELECT actor.login, count(*) count
FROM events
WHERE repo.name = 'duckdb/duckdb'
  AND type = 'PullRequestEvent'
GROUP BY actor.login
ORDER BY count desc
LIMIT 5;
login count
Mytherin 19
Mause 4
carlopi 3
Tmonster 2
lnkuiper 2

As expected, Mark (Mytherin, co-founder of DuckDB Labs) was the most active! My activity (lnkuiper, software engineer at DuckDB Labs) also shows up.

Handling inconsistent JSON schemas

So far, we have ignored the "payload" of the events. We’ve ignored it because the contents of this field are different based on the type of event. We can see how they differ with the following query:

SELECT json_group_structure(payload) structure
FROM (SELECT *
  FROM read_json(
    'gharchive_gz/*.json.gz',
    columns={
      id: 'BIGINT',
      type: 'VARCHAR',
      actor: 'STRUCT(id UBIGINT,
                     login VARCHAR,
                     display_login VARCHAR,
                     gravatar_id VARCHAR,
                     url VARCHAR,
                     avatar_url VARCHAR)',
      repo: 'STRUCT(id UBIGINT, name VARCHAR, url VARCHAR)',
      payload: 'JSON',
      public: 'BOOLEAN',
      created_at: 'TIMESTAMP',
      org: 'STRUCT(id UBIGINT, login VARCHAR, gravatar_id VARCHAR, url VARCHAR, avatar_url VARCHAR)'
    },
    lines='true'
  )
  WHERE type = 'WatchEvent'
  LIMIT 2048
);
structure
{“action”:”VARCHAR”}

The "payload" field is simple for events of type WatchEvent. However, if we change the type to PullRequestEvent, we get a JSON structure of more than 500 lines when formatted with a JSON formatter. We don’t want to look through all those fields, so we cannot use our automatic schema detection, which will try to get them all. Instead, we can manually supply the structure of the fields we’re interested in. DuckDB will skip reading the other fields. Another approach is to store the "payload" field as DuckDB’s JSON data type and parse it at query time (see the example later in this post!).

I’ve stripped down the JSON structure for the "payload" of events with the type PullRequestEvent to the things I’m actually interested in:

{
   "action":"VARCHAR",
   "number":"UBIGINT",
   "pull_request":{
      "url":"VARCHAR",
      "id":"UBIGINT",
      "title":"VARCHAR",
      "user":{
         "login":"VARCHAR",
         "id":"UBIGINT",
      },
      "body":"VARCHAR",
      "created_at":"TIMESTAMP",
      "updated_at":"TIMESTAMP",
      "assignee":{
         "login":"VARCHAR",
         "id":"UBIGINT",
      },
      "assignees":[
         {
            "login":"VARCHAR",
            "id":"UBIGINT",
         }
      ],
  }
}

This is technically not valid JSON because there are trailing commas. However, we try to allow trailing commas wherever possible in DuckDB, including JSON!

We can now plug this into the columns parameter of read_json, but we need to convert it to a DuckDB type first. I’m lazy, so I prefer to let DuckDB do this for me:

SELECT typeof(json_transform('{}', '{
   "action":"VARCHAR",
   "number":"UBIGINT",
   "pull_request":{
      "url":"VARCHAR",
      "id":"UBIGINT",
      "title":"VARCHAR",
      "user":{
         "login":"VARCHAR",
         "id":"UBIGINT",
      },
      "body":"VARCHAR",
      "created_at":"TIMESTAMP",
      "updated_at":"TIMESTAMP",
      "assignee":{
         "login":"VARCHAR",
         "id":"UBIGINT",
      },
      "assignees":[
         {
            "login":"VARCHAR",
            "id":"UBIGINT",
         }
      ],
  }
}'));

This gives us back a DuckDB type that we can plug the type into our function! Note that because we are not auto-detecting the schema, we have to supply timestampformat to be able to parse the timestamps correctly. The key "user" must be surrounded by quotes because it is a reserved keyword in SQL:

CREATE TABLE pr_events as
SELECT *
FROM read_json(
  'gharchive_gz/*.json.gz',
  columns={
    id: 'BIGINT',
    type: 'VARCHAR',
    actor: 'STRUCT(id UBIGINT,
                   login VARCHAR,
                   display_login VARCHAR,
                   gravatar_id VARCHAR,
                   url VARCHAR,
                   avatar_url VARCHAR)',
    repo: 'STRUCT(id UBIGINT, name VARCHAR, url VARCHAR)',
    payload: 'STRUCT(
                action VARCHAR,
                number UBIGINT,
                pull_request STRUCT(
                  url VARCHAR,
                  id UBIGINT,
                  title VARCHAR,
                  "user" STRUCT(
                    login VARCHAR,
                    id UBIGINT
                  ),
                  body VARCHAR,
                  created_at TIMESTAMP,
                  updated_at TIMESTAMP,
                  assignee STRUCT(login VARCHAR, id UBIGINT),
                  assignees STRUCT(login VARCHAR, id UBIGINT)[]
                )
              )',
    public: 'BOOLEAN',
    created_at: 'TIMESTAMP',
    org: 'STRUCT(id UBIGINT, login VARCHAR, gravatar_id VARCHAR, url VARCHAR, avatar_url VARCHAR)'
  },
  json_format='records',
  lines='true',
  timestampformat='%Y-%m-%dT%H:%M:%SZ'
)
WHERE type = 'PullRequestEvent';

This query completes in around 36s with an on-disk database (resulting size is 478MB) and 9s with an in-memory database. If you don’t care about preserving insertion order, you can speed the query up with this setting:

SET preserve_insertion_order=false;

With this setting, the query completes in around 27s with an on-disk database and 8.5s with an in-memory database. The difference between the on-disk and in-memory case is quite substantial here because DuckDB has to compress and persist much more data.

Now we can analyze pull request events! Let’s see what the maximum number of assignees is:

SELECT max(length(payload.pull_request.assignees)) max_assignees
FROM pr_events;
max_assignees
10

That’s a lot of people reviewing a single pull request!

We can check who was assigned the most:

WITH assignees AS (
  SELECT payload.pull_request.assignee.login assignee
  FROM pr_events
  UNION ALL
  SELECT unnest(payload.pull_request.assignees).login assignee
  FROM pr_events
)
SELECT assignee, count(*) count
FROM assignees
WHERE assignee NOT NULL
GROUP BY assignee
ORDER BY count DESC
LIMIT 5;
assignee count
poad 494
vinayakkulkarni 268
tmtmtmtm 198
fisker 98
icemac 84

That’s a lot of assignments! Although I suspect there are duplicates in here.

Storing as JSON to parse at query time

Specifying the JSON schema of the "payload" field was helpful because it allowed us to directly analyze what is there, and subsequent queries are much faster. Still, it can also be quite cumbersome if the schema is complex. If you don’t want to specify the schema of a field, you can set the type as 'JSON':

CREATE TABLE pr_events AS
SELECT *
FROM read_json(
  'gharchive_gz/*.json.gz',
  columns={
    id: 'BIGINT',
    type: 'VARCHAR',
    actor: 'STRUCT(id UBIGINT,
                   login VARCHAR,
                   display_login VARCHAR,
                   gravatar_id VARCHAR,
                   url VARCHAR,
                   avatar_url VARCHAR)',
    repo: 'STRUCT(id UBIGINT, name VARCHAR, url VARCHAR)',
    payload: 'JSON',
    public: 'BOOLEAN',
    created_at: 'TIMESTAMP',
    org: 'STRUCT(id UBIGINT, login VARCHAR, gravatar_id VARCHAR, url VARCHAR, avatar_url VARCHAR)'
  },
  json_format='records',
  lines='true',
  timestampformat='%Y-%m-%dT%H:%M:%SZ'
)
WHERE type = 'PullRequestEvent';

This will load the "payload" field as a JSON string, and we can use DuckDB’s JSON functions to analyze it when querying. For example:

SELECT DISTINCT payload->>'action' AS action, count(*) count
FROM pr_events
GROUP BY action
ORDER BY count DESC;

The ->> arrow is short-hand for our json_extract_string function. Creating the entire "payload" field as a column with type JSON is not the most efficient way to get just the "action" field, but this example is just to show the flexibility of read_json. The query results in the following table:

action count
opened 189096
closed 174914
reopened 2080

As we can see, only a few pull requests have been reopened.

Conclusion

DuckDB tries to be an easy-to-use tool that can read all kinds of data formats. In the 0.7.0 release, we have added support for reading JSON. JSON comes in many formats and all kinds of schemas. DuckDB’s rich support for nested types (LIST, STRUCT) allows it to fully ‘shred’ the JSON to a columnar format for more efficient analysis.

We are excited to hear what you think about our new JSON functionality. If you have any questions or suggestions, please reach out to us on Discord or GitHub!

continue reading
2023-02-24Guest post by Eduardo Blancas

JupySQL Plotting with DuckDB

TL;DR: JupySQL provides a seamless SQL experience in Jupyter and uses DuckDB to visualize larger than memory datasets in matplotlib.

continue reading
2023-02-13Mark Raasveldt

Announcing DuckDB 0.7.0

Image of the labrador duck

The DuckDB team is happy to announce the latest DuckDB version (0.7.0) has been released. This release of DuckDB is named “Labradorius” after the Labrador Duck (Camptorhynchus labradorius) that was native to North America.

To install the new version, please visit the installation guide. The full release notes can be found here.

continue reading
2022-11-25Pedro Holanda

DuckCon 2023 - 2nd edition

The DuckDB team is excited to invite you all for our second DuckCon user group meeting. It will take place the day before FOSDEM in Brussels on Feb 3rd, 2023, at the Hilton Hotel.

In this edition, we will have the DuckDB creators Hannes Mühleisen, and Mark Raasveldt, talking about the current state of DuckDB and future plans. We will also have one talk regarding building out-of-tree extensions from DuckDB contributors Pedro Holanda, and Sam Ansmink and one invited talk from MotherDuck’s Founding Engineers, Boaz Leskes and Yves Le Maout. In addition, we will have a session with lightning talks from our users. We will close the event with a networking session with food and drinks where users are invited to interact directly with the DuckDB team.

continue reading
2022-11-14Mark Raasveldt

Announcing DuckDB 0.6.0

Image of white-headed duck

The DuckDB team is happy to announce the latest DuckDB version (0.6.0) has been released. This release of DuckDB is named “Oxyura” after the White-headed duck (Oxyura leucocephala) which is an endangered species native to Eurasia.

To install the new version, please visit the installation guide. Note that the release is still being rolled out, so not all artifacts may be published yet. The full release notes can be found here.

continue reading
2022-10-28Mark Raasveldt

Lightweight Compression in DuckDB

Matroshka Ducks (ducks going from big to small)

TL;DR: DuckDB supports efficient lightweight compression that is automatically used to keep data size down without incurring high costs for compression and decompression.

When working with large amounts of data, compression is critical for reducing storage size and egress costs. Compression algorithms typically reduce data set size by 75-95%, depending on how compressible the data is. Compression not only reduces the storage footprint of a data set, but also often improves performance as less data has to be read from disk or over a network connection.

continue reading
2022-10-12Guest post by Jacob Matson

Modern Data Stack in a Box with DuckDB

Duck on a box

TL;DR: A fast, free, and open-source Modern Data Stack (MDS) can now be fully deployed on your laptop or to a single machine using the combination of DuckDB, Meltano, dbt, and Apache Superset.

This post is a collaboration with Jacob Matson and cross-posted on dataduel.co.

Summary

There is a large volume of literature (1, 2, 3) about scaling data pipelines. “Use Kafka! Build a lake house! Don’t build a lake house, use Snowflake! Don’t use Snowflake, use XYZ!” However, with advances in hardware and the rapid maturation of data software, there is a simpler approach. This article will light up the path to highly performant single node analytics with an MDS-in-a-box open source stack: Meltano, DuckDB, dbt, & Apache Superset on Windows using Windows Subsystem for Linux (WSL). There are many options within the MDS, so if you are using another stack to build an MDS-in-a-box, please share it with the community on the DuckDB Twitter, GitHub, or Discord, or the dbt slack! Or just stop by for a friendly debate about our choice of tools!

continue reading
2022-09-30Hannes Mühleisen

Querying Postgres Tables Directly From DuckDB

TL;DR: DuckDB can now directly query tables stored in PostgreSQL and speed up complex analytical queries without duplicating data.

continue reading
2022-07-27Pedro Holanda

Persistent Storage of Adaptive Radix Trees in DuckDB

DuckDB ART

TL;DR: DuckDB uses Adaptive Radix Tree (ART) Indexes to enforce constraints and to speed up query filters. Up to this point, indexes were not persisted, causing issues like loss of indexing information and high reload times for tables with data constraints. We now persist ART Indexes to disk, drastically diminishing database loading times (up to orders of magnitude), and we no longer lose track of existing indexes. This blog post contains a deep dive into the implementation of ART storage, benchmarks, and future work. Finally, to better understand how our indexes are used, I’m asking you to answer the following survey. It will guide us when defining our future roadmap.

continue reading
2022-05-27Richard Wesley

Range Joins in DuckDB

TL;DR: DuckDB has fully parallelised range joins that can efficiently join millions of range predicates.

Range intersection joins are an important operation in areas such as temporal analytics, and occur when two inequality conditions are present in a join predicate. Database implementations often rely on slow O(N^2) algorithms that compare every pair of rows for these operations. Instead, DuckDB leverages its fast sorting logic to implement two highly optimized parallel join operators for these kinds of range predicates, resulting in 20-30x faster queries. With these operators, DuckDB can be used effectively in more time-series-oriented use cases.

continue reading
2022-05-04Alex Monahan

Friendlier SQL with DuckDB

Chewbacca_the_duck

An elegant user experience is a key design goal of DuckDB. This goal guides much of DuckDB’s architecture: it is simple to install, seamless to integrate with other data structures like Pandas, Arrow, and R Dataframes, and requires no dependencies. Parallelization occurs automatically, and if a computation exceeds available memory, data is gracefully buffered out to disk. And of course, DuckDB’s processing speed makes it easier to get more work accomplished.

However, SQL is not famous for being user-friendly. DuckDB aims to change that! DuckDB includes both a Relational API for dataframe-style computation, and a highly Postgres-compatible version of SQL. If you prefer dataframe-style computation, we would love your feedback on our roadmap. If you are a SQL fan, read on to see how DuckDB is bringing together both innovation and pragmatism to make it easier to write SQL in DuckDB than anywhere else. Please reach out on GitHub or Discord and let us know what other features would simplify your SQL workflows. Join us as we teach an old dog new tricks!

continue reading
2022-03-07Hannes Mühleisen and Mark Raasveldt

Parallel Grouped Aggregation in DuckDB

TL;DR: DuckDB has a fully parallelized aggregate hash table that can efficiently aggregate over millions of groups.

Grouped aggregations are a core data analysis command. It is particularly important for large-scale data analysis (“OLAP”) because it is useful for computing statistical summaries of huge tables. DuckDB contains a highly optimized parallel aggregation capability for fast and scalable summarization.

Jump straight to the benchmarks?

continue reading
2022-01-06Richard Wesley

DuckDB Time Zones: Supporting Calendar Extensions

TL;DR: The DuckDB ICU extension now provides time zone support.

Time zone support is a common request for temporal analytics, but the rules are complex and somewhat arbitrary. The most well supported library for locale-specific operations is the International Components for Unicode (ICU). DuckDB already provided collated string comparisons using ICU via an extension (to avoid dependencies), and we have now connected the existing ICU calendar and time zone functions to the main code via the new TIMESTAMP WITH TIME ZONE (or TIMESTAMPTZ for short) data type. The ICU extension is pre-bundled in DuckDB’s Python client and can be optionally installed in the remaining clients.

In this post, we will describe how time works in DuckDB and what time zone functionality has been added.

continue reading
2021-12-03Pedro Holanda and Jonathan Keane

DuckDB quacks Arrow: A zero-copy data integration between Apache Arrow and DuckDB

TL;DR: The zero-copy integration between DuckDB and Apache Arrow allows for rapid analysis of larger than memory datasets in Python and R using either SQL or relational APIs.

This post is a collaboration with and cross-posted on the Arrow blog.

continue reading
2021-11-26Pedro Holanda

DuckDB - The Lord of Enums:
The Fellowship of the Categorical and Factors.

dict-enc

String types are one of the most commonly used types. However, often string columns have a limited number of distinct values. For example, a country column will never have more than a few hundred unique entries. Storing a data type as a plain string causes a waste of storage and compromises query performance. A better solution is to dictionary encode these columns. In dictionary encoding, the data is split into two parts: the category and the values. The category stores the actual strings, and the values stores a reference to the strings. This encoding is depicted below.

continue reading
2021-11-12Richard Wesley

Fast Moving Holistic Aggregates

TL;DR: DuckDB, a free and Open-Source analytical data management system, has a windowing API that can compute complex moving aggregates like interquartile ranges and median absolute deviation much faster than the conventional approaches.

In a previous post, we described the DuckDB windowing architecture and mentioned the support for some advanced moving aggregates. In this post, we will compare the performance various possible moving implementations of these functions and explain how DuckDB’s performant implementations work.

continue reading
2021-10-29André Kohn and Dominik Moritz

DuckDB-Wasm: Efficient Analytical SQL in the Browser

TL;DR: DuckDB-Wasm is an in-process analytical SQL database for the browser. It is powered by WebAssembly, speaks Arrow fluently, reads Parquet, CSV and JSON files backed by Filesystem APIs or HTTP requests and has been tested with Chrome, Firefox, Safari and Node.js. You can try it in your browser at shell.duckdb.org or on Observable.

continue reading
2021-10-13Richard Wesley

Windowing in DuckDB

TL;DR: DuckDB, a free and Open-Source analytical data management system, has a state-of-the-art windowing engine that can compute complex moving aggregates like inter-quartile ranges as well as simpler moving averages.

Window functions (those using the OVER clause) are important tools for analysing data series, but they can be slow if not implemented carefully. In this post, we will take a look at how DuckDB implements windowing. We will also see how DuckDB can leverage its aggregate function architecture to compute useful moving aggregates such as moving inter-quartile ranges (IQRs).

continue reading
2021-08-27Laurens Kuiper

Fastest table sort in the West - Redesigning DuckDB’s sort

TL;DR: DuckDB, a free and Open-Source analytical data management system, has a new highly efficient parallel sorting implementation that can sort much more data than fits in main memory.

Database systems use sorting for many purposes, the most obvious purpose being when a user adds an ORDER BY clause to their query. Sorting is also used within operators, such as window functions. DuckDB recently improved its sorting implementation, which is now able to sort data in parallel and sort more data than fits in memory. In this post, we will take a look at how DuckDB sorts, and how this compares to other data management systems.

continue reading
2021-06-25Hannes Mühleisen and Mark Raasveldt

Querying Parquet with Precision using DuckDB

TL;DR: DuckDB, a free and open source analytical data management system, can run SQL queries directly on Parquet files and automatically take advantage of the advanced features of the Parquet format.

Apache Parquet is the most common “Big Data” storage format for analytics. In Parquet files, data is stored in a columnar-compressed binary format. Each Parquet file stores a single table. The table is partitioned into row groups, which each contain a subset of the rows of the table. Within a row group, the table data is stored in a columnar fashion.

continue reading
2021-05-14Mark Raasveldt and Hannes Mühleisen

Efficient SQL on Pandas with DuckDB

TL;DR: DuckDB, a free and open source analytical data management system, can efficiently run SQL queries directly on Pandas DataFrames.

Recently, an article was published advocating for using SQL for Data Analysis. Here at team DuckDB, we are huge fans of SQL. It is a versatile and flexible language that allows the user to efficiently perform a wide variety of data transformations, without having to care about how the data is physically represented or how to do these data transformations in the most optimal way.

continue reading
2021-01-25Laurens Kuiper

Testing out DuckDB's Full Text Search Extension

TL;DR: DuckDB now has full-text search functionality, similar to the FTS5 extension in SQLite. The main difference is that our FTS extension is fully formulated in SQL. We tested it out on TREC disks 4 and 5.

Searching through textual data stored in a database can be cumbersome, as SQL does not provide a good way of formulating questions such as “Give me all the documents about Mallard Ducks”: string patterns with LIKE will only get you so far. Despite SQL’s shortcomings here, storing textual data in a database is commonplace. Consider the table products (id INT, name VARCHAR, description VARCHAR) - it would be useful to search through the name and description columns for a website that sells these products.

continue reading