DuckDB Tricks – Part 3

Andra Ionescu and Gabor Szarnyas

2024-11-29 · 7 min

TL;DR: In this new installment of the DuckDB Tricks series, we present features for convenient handling of tables and performance optimization tips for Parquet and CSV files.

Overview

We continue our DuckDB Tricks series with a third part, where we showcase friendly SQL features and performance optimizations.

Operation	SQL instructions
Excluding columns from a table	`EXCLUDE`/`COLUMNS(...)`, `NOT SIMILAR TO`
Renaming columns with pattern matching	`COLUMNS(...) AS ...`
Loading with globbing	`FROM '*.csv'`
Reordering Parquet files	`COPY (FROM ... ORDER BY ...) TO ...`
Hive partitioning	`hive_partitioning = true`

We'll use a subset of the Dutch railway services dataset, which was already featured in a blog post earlier this year. This time, we'll use the CSV files between January and October 2024: services-2024-01-to-10.zip. If you would like to follow the examples, download and decompress the data set before proceeding.

Excluding Columns from a Table

First, let's look at the data in the CSV files. We pick the CSV file for August and inspect it with the DESCRIBE statement.

DESCRIBE FROM 'services-2024-08.csv';

The result is a table with the column names and the column types.

column_name	column_type	null	key	default	extra
Service:RDT-ID	BIGINT	YES	NULL	NULL	NULL
Service:Date	DATE	YES	NULL	NULL	NULL
Service:Type	VARCHAR	YES	NULL	NULL	NULL
Service:Company	VARCHAR	YES	NULL	NULL	NULL
Service:Train number	BIGINT	YES	NULL	NULL	NULL
…	…	…	…	…	…

Now, let's use SUMMARIZE to inspect some statistics about the columns.

SUMMARIZE FROM 'services-2024-08.csv';

With SUMMARIZE, we get 10 statistics about our data (min, max, approx_unique, etc.). If we want to remove a few of them the result, we can use the EXCLUDE modifier. For example, to exclude min, max and the quantiles q25, q50, q75, we can use issue the following command:

SELECT * EXCLUDE(min, max, q25, q50, q75) 
FROM (SUMMARIZE FROM 'services-2024-08.csv');

Alternatively, we can use the COLUMNS expression with the NOT SIMILAR TO operator. This works with a regular expression:

SELECT COLUMNS(c -> c NOT SIMILAR TO 'min|max|q.*') 
FROM (SUMMARIZE FROM 'services-2024-08.csv');

In both cases, the resulting table will contain the 5 remaining statistical columns:

column_name	column_type	approx_unique	avg	std	count	null_percentage
Service:RDT-ID	BIGINT	259022	14200071.03736433	59022.836209662266	1846574	0.00
Service:Date	DATE	32	NULL	NULL	1846574	0.00
Service:Type	VARCHAR	20	NULL	NULL	1846574	0.00
Service:Company	VARCHAR	12	NULL	NULL	1846574	0.00
Service:Train number	BIGINT	17264	57781.81688196628	186353.76365744913	1846574	0.00
…	…	…	…	…	…	…

Renaming Columns with Pattern Matching

Upon inspecting the columns, we see that their names contain spaces and semicolons (:). These special characters makes writing queries a bit tedious as they necessitate quoting column names with double quotes. For example, we have to write "Service:Company" in the following query:

SELECT DISTINCT "Service:Company" AS company,
FROM 'services-2024-08.csv'
ORDER BY company;

Let's see how we can rename the columns using the COLUMNS expression. To replace the special characters (up to 2), we can write the following query:

SELECT COLUMNS('(.*?)_*$') AS "\1"
FROM (
    SELECT COLUMNS('(\w*)\W*(\w*)\W*(\w*)') AS "\1_\2_\3"
    FROM 'services-2024-08.csv'
);

Add DESCRIBE at the beginning of the query and we can see the renamed columns:

column_name	column_type	null	key	default	extra
Service_RDT_ID	BIGINT	YES	NULL	NULL	NULL
Service_Date	DATE	YES	NULL	NULL	NULL
Service_Type	VARCHAR	YES	NULL	NULL	NULL
Service_Company	VARCHAR	YES	NULL	NULL	NULL
Service_Train_number	BIGINT	YES	NULL	NULL	NULL
…	…	…	…	…	…

Let's break down the query starting with the first COLUMNS expression:

SELECT COLUMNS('(\w*)\W*(\w*)\W*(\w*)') AS "\1_\2_\3"

Here, we use regular expression with (\w*) groups that capture 0…n word characters ([0-9A-Za-z_]). Meanwhile, the expression \W* captures 0…n non-word characters ([^0-9A-Za-z_]). In the alias part we refer to the capture group i with \i so "\1_\2_\3" means that we only keep the word characters and separate their groups with underscores (_). However, because some column names contain words separated by a space, while others don't, after this SELECT statement we get column names with a trailing underscore (_), e.g., Service_Date_. Thus, we need an additional processing step:

SELECT COLUMNS('(.*?)_*$') AS "\1"

Here, we capture the group of characters without the trailing underscore(s) and rename the columns to \1, which removes the trailing underscores.

To make writing queries even more convenient, we can rely on the case-insensitivity of identifiers to query the column names in lowercase:

SELECT DISTINCT service_company
FROM (
    SELECT COLUMNS('(.*?)_*$') AS "\1"
    FROM (
       SELECT COLUMNS('(\w*)\W*(\w*)\W*(\w*)') AS "\1_\2_\3"
       FROM 'services-2024-08.csv'
    )
)
ORDER BY service_company;

Service_Company
Arriva
Blauwnet
Breng
DB
Eu Sleeper
…

The returned column name preserves its original cases even though we used lowercase letters in the query.

Loading with Globbing

Now that we can simplify the column names, let's ingest all 3 months of data to a table:

CREATE OR REPLACE TABLE services AS
    SELECT COLUMNS('(.*?)_*$') AS "\1" 
    FROM (
        SELECT COLUMNS('(\w*)\W*(\w*)\W*(\w*)') AS "\1_\2_\3" 
        FROM 'services-2024-*.csv'
    );

In the inner FROM clause, we use the * glob syntax to match all files. DuckDB automatically detects that all files have the same schema and unions them together. We have now a table with all the data from January to October, amounting to almost 20 million rows.

Reordering Parquet Files

Suppose we want to analyze the average delay of the Intercity Direct trains operated by the Nederlandse Spoorwegen (NS), measured at the final destination of the train service. While we can run this analysis directly on the .csv files, the lack of metadata (such as schema and min-max indexes) will limit the performance. Let's measure this in the CLI client by turning on the timer:

.timer on

SELECT avg("Stop:Arrival delay")
FROM 'services-*.csv'
WHERE "Service:Company" = 'NS'
  AND "Service:Type" = 'Intercity direct'
  AND "Stop:Departure time" IS NULL;

This query takes about 1.8 seconds. Now, if we run the same query on services table that's already loaded to DuckDB, the query is much faster:

SELECT avg(Stop_Arrival_delay)
FROM services
WHERE Service_Company = 'NS'
  AND Service_Type = 'Intercity direct'
  AND Stop_Departure_time IS NULL;

The run time is about 35 milliseconds.

If we would like to use an external binary file format, we can also export the database to a single Parquet file:

EXPORT DATABASE 'railway' (FORMAT parquet);

We can then directly query it as follows:

SELECT avg(Stop_Arrival_delay)
FROM 'railway/services.parquet'
WHERE Service_Company = 'NS'
  AND Service_Type = 'Intercity direct'
  AND Stop_Departure_time IS NULL;

The runtime for this format is about 90 milliseconds – somewhat slower than DuckDB's own file format but about 20× faster than reading the raw CSV files.

If we have a priori knowledge of the fields a query filters on, we can reorder the Parquet file to improve query performance.

COPY
(FROM 'railway/services.parquet' ORDER BY Service_Company, Service_Type)
TO 'railway/services.parquet';

If we run the query again, it's noticeably faster, taking only 35 milliseconds. This is thanks to partial reading, which uses the zonemaps (min-max indexes) to limit the amount of data that has to be scanned. Reordering the file allows DuckDB to skip more data, leading to faster query times.

Hive Partitioning

To speed up queries even further, we can use Hive partitioning to create a directory layout on disk that matches the filtering used in the queries.

COPY services
TO 'services-parquet-hive'
(FORMAT parquet, PARTITION_BY (Service_Company, Service_Type));

Let's peek into the directory from DuckDB's CLI using the .sh dot command:

.sh tree services-parquet-hive

services-parquet-hive
├── Service_Company=Arriva
│   ├── Service_Type=Extra%20trein
│   │   └── data_0.parquet
│   ├── Service_Type=Nachttrein
│   │   └── data_0.parquet
│   ├── Service_Type=Snelbus%20ipv%20trein
│   │   └── data_0.parquet
│   ├── Service_Type=Sneltrein
│   │   └── data_0.parquet
│   ├── Service_Type=Stopbus%20ipv%20trein
│   │   └── data_0.parquet
│   ├── Service_Type=Stoptrein
│   │   └── data_0.parquet
│   └── Service_Type=Taxibus%20ipv%20trein
│       └── data_0.parquet
├── Service_Company=Blauwnet
│   ├── Service_Type=Intercity
│   │   └── data_0.parquet
...

We can now run the query on the Hive partitioned data set by passing the hive_partitioning = true flag:

SELECT avg(Stop_Arrival_delay)
FROM read_parquet(
         'services-parquet-hive/**/*.parquet',
         hive_partitioning = true
     )
WHERE Service_Company = 'NS'
  AND Service_Type = 'Intercity direct'
  AND Stop_Departure_time IS NULL;

This query now takes about 20 milliseconds as DuckDB can use the directory structure to limit the reads even further. And the neat thing about Hive partitioning is that it even works with CSV files!

COPY services
TO 'services-csv-hive'
(FORMAT csv, PARTITION_BY (Service_Company, Service_Type));

SELECT avg(Stop_Arrival_delay)
FROM read_csv('services-csv-hive/**/*.csv', hive_partitioning = true)
WHERE Service_Company = 'NS'
  AND Service_Type = 'Intercity direct'
  AND Stop_Departure_time IS NULL;

While the CSV files lack any sort of metadata, DuckDB can rely on the directory structure to limit the scans to the relevant directories, resulting in execution times around 150 milliseconds, more than 10× faster compared to reading all CSV files.

If all these formats and results got your head spinning, no worries. We got your covered with this summary table:

Format	Query runtime (ms)
DuckDB file format	35
CSV (vanilla)	1800
CSV (Hive-partitioned)	150
Parquet (vanilla)	90
Parquet (reordered)	35
Parquet (Hive-partitioned)	20

Oh, and we forgot to report the result. The average delay of Intercity Direct trains is 3 minutes!

Closing Thoughts

That's it for part three of DuckDB tricks. If you have a trick that would like to share, please share it with the DuckDB team on our social media sites, or submit it to the DuckDB Snippets site (maintained by our friends at MotherDuck).

DuckDB Tricks – Part 3

Overview

Dataset

Excluding Columns from a Table

Renaming Columns with Pattern Matching

Loading with Globbing

Reordering Parquet Files

Hive Partitioning

Closing Thoughts

In this article

Recent Posts

DuckLake 0.2

Discovering DuckDB Use Cases via GitHub