DuckDB Tricks – Part 3

Andra Ionescu and Gabor Szarnyas
Published on 2024-11-29

TL;DR: In this new installment of the DuckDB Tricks series, we present features for convenient handling of tables and performance optimization tips for Parquet and CSV files.

Overview

We continue our DuckDB Tricks series with a third part, where we showcase friendly SQL features and performance optimizations.

Operation SQL instructions
Excluding columns from a table EXCLUDE or COLUMNS(…) and NOT SIMILAR TO
Renaming columns with pattern matching COLUMNS(…) AS …
Loading with globbing FROM '*.csv'
Reordering Parquet files COPY (FROM … ORDER BY …) TO …
Hive partitioning hive_partitioning = true

Dataset

We'll use a subset of the Dutch railway services dataset, which was already featured in a blog post earlier this year. This time, we'll use the CSV files between January and October 2024: services-2024-01-to-10.zip. If you would like to follow the examples, download and decompress the data set before proceeding.

Excluding Columns from a Table

First, let's look at the data in the CSV files. We pick the CSV file for August and inspect it with the DESCRIBE statement.

DESCRIBE FROM 'services-2024-08.csv';

The result is a table with the column names and the column types.

column_name column_type null key default extra
Service:RDT-ID BIGINT YES NULL NULL NULL
Service:Date DATE YES NULL NULL NULL
Service:Type VARCHAR YES NULL NULL NULL
Service:Company VARCHAR YES NULL NULL NULL
Service:Train number BIGINT YES NULL NULL NULL

Now, let's use SUMMARIZE to inspect some statistics about the columns.

SUMMARIZE FROM 'services-2024-08.csv';

With SUMMARIZE, we get 10 statistics about our data (min, max, approx_unique, etc.). If we want to remove a few of them the result, we can use the EXCLUDE modifier. For example, to exclude min, max and the quantiles q25, q50, q75, we can use issue the following command:

SELECT * EXCLUDE(min, max, q25, q50, q75) 
FROM (SUMMARIZE FROM 'services-2024-08.csv');

Alternatively, we can use the COLUMNS expression with the NOT SIMILAR TO operator. This works with a regular expression:

SELECT COLUMNS(c -> c NOT SIMILAR TO 'min|max|q.*') 
FROM (SUMMARIZE FROM 'services-2024-08.csv');

In both cases, the resulting table will contain the 5 remaining statistical columns:

column_name column_type approx_unique avg std count null_percentage
Service:RDT-ID BIGINT 259022 14200071.03736433 59022.836209662266 1846574 0.00
Service:Date DATE 32 NULL NULL 1846574 0.00
Service:Type VARCHAR 20 NULL NULL 1846574 0.00
Service:Company VARCHAR 12 NULL NULL 1846574 0.00
Service:Train number BIGINT 17264 57781.81688196628 186353.76365744913 1846574 0.00

Renaming Columns with Pattern Matching

Upon inspecting the columns, we see that their names contain spaces and semicolons (:). These special characters makes writing queries a bit tedious as they necessitate quoting column names with double quotes. For example, we have to write "Service:Company" in the following query:

SELECT DISTINCT "Service:Company" AS company,
FROM 'services-2024-08.csv'
ORDER BY company;

Let's see how we can rename the columns using the COLUMNS expression. To replace the special characters (up to 2), we can write the following query:

SELECT COLUMNS('(.*?)_*$') AS "\1"
FROM (
    SELECT COLUMNS('(\w*)\W*(\w*)\W*(\w*)') AS "\1_\2_\3"
    FROM 'services-2024-08.csv'
);

Add DESCRIBE at the beginning of the query and we can see the renamed columns:

column_name column_type null key default extra
Service_RDT_ID BIGINT YES NULL NULL NULL
Service_Date DATE YES NULL NULL NULL
Service_Type VARCHAR YES NULL NULL NULL
Service_Company VARCHAR YES NULL NULL NULL
Service_Train_number BIGINT YES NULL NULL NULL

Let's break down the query starting with the first COLUMNS expression:

SELECT COLUMNS('(\w*)\W*(\w*)\W*(\w*)') AS "\1_\2_\3"

Here, we use regular expression with (\w*) groups that capture 0…n word characters ([0-9A-Za-z_]). Meanwhile, the expression \W* captures 0…n non-word characters ([^0-9A-Za-z_]). In the alias part we refer to the capture group i with \i so "\1_\2_\3" means that we only keep the word characters and separate their groups with underscores (_). However, because some column names contain words separated by a space, while others don't, after this SELECT statement we get column names with a trailing underscore (_), e.g., Service_Date_. Thus, we need an additional processing step:

SELECT COLUMNS('(.*?)_*$') AS "\1"

Here, we capture the group of characters without the trailing underscore(s) and rename the columns to \1, which removes the trailing underscores.

To make writing queries even more convenient, we can rely on the case-insensitivity of identifiers to query the column names in lowercase:

SELECT DISTINCT service_company
FROM (
    SELECT COLUMNS('(.*?)_*$') AS "\1"
    FROM (
       SELECT COLUMNS('(\w*)\W*(\w*)\W*(\w*)') AS "\1_\2_\3"
       FROM 'services-2024-08.csv'
    )
)
ORDER BY service_company;
Service_Company
Arriva
Blauwnet
Breng
DB
Eu Sleeper

The returned column name preserves its original cases even though we used lowercase letters in the query.

Loading with Globbing

Now that we can simplify the column names, let's ingest all 3 months of data to a table:

CREATE OR REPLACE TABLE services AS
    SELECT COLUMNS('(.*?)_*$') AS "\1" 
    FROM (
        SELECT COLUMNS('(\w*)\W*(\w*)\W*(\w*)') AS "\1_\2_\3" 
        FROM 'services-2024-*.csv'
    );

In the inner FROM clause, we use the * glob syntax to match all files. DuckDB automatically detects that all files have the same schema and unions them together. We have now a table with all the data from January to October, amounting to almost 20 million rows.

Reordering Parquet Files

Suppose we want to analyze the average delay of the Intercity Direct trains operated by the Nederlandse Spoorwegen (NS), measured at the final destination of the train service. While we can run this analysis directly on the .csv files, the lack of metadata (such as schema and min-max indexes) will limit the performance. Let's measure this in the CLI client by turning on the timer:

.timer on
SELECT avg("Stop:Arrival delay")
FROM 'services-*.csv'
WHERE "Service:Company" = 'NS'
  AND "Service:Type" = 'Intercity direct'
  AND "Stop:Departure time" IS NULL;

This query takes about 1.8 seconds. Now, if we run the same query on services table that's already loaded to DuckDB, the query is much faster:

SELECT avg(Stop_Arrival_delay)
FROM services
WHERE Service_Company = 'NS'
  AND Service_Type = 'Intercity direct'
  AND Stop_Departure_time IS NULL;

The run time is about 35 milliseconds.

If we would like to use an external binary file format, we can also export the database to a single Parquet file:

EXPORT DATABASE 'railway' (FORMAT PARQUET);

We can then directly query it as follows:

SELECT avg(Stop_Arrival_delay)
FROM 'railway/services.parquet'
WHERE Service_Company = 'NS'
  AND Service_Type = 'Intercity direct'
  AND Stop_Departure_time IS NULL;

The runtime for this format is about 90 milliseconds – somewhat slower than DuckDB's own file format but about 20× faster than reading the raw CSV files.

If we have a priori knowledge of the fields a query filters on, we can reorder the Parquet file to improve query performance.

COPY
(FROM 'railway/services.parquet' ORDER BY Service_Company, Service_Type)
TO 'railway/services.parquet';

If we run the query again, it's noticeably faster, taking only 35 milliseconds. This is thanks to partial reading, which uses the zonemaps (min-max indexes) to limit the amount of data that has to be scanned. Reordering the file allows DuckDB to skip more data, leading to faster query times.

Hive Partitioning

To speed up queries even further, we can use Hive partitioning to create a directory layout on disk that matches the filtering used in the queries.

COPY services
TO 'services-parquet-hive'
(FORMAT PARQUET, PARTITION_BY (Service_Company, Service_Type));

Let's peek into the directory from DuckDB's CLI using the .sh dot command:

.sh tree services-parquet-hive
services-parquet-hive
├── Service_Company=Arriva
│   ├── Service_Type=Extra%20trein
│   │   └── data_0.parquet
│   ├── Service_Type=Nachttrein
│   │   └── data_0.parquet
│   ├── Service_Type=Snelbus%20ipv%20trein
│   │   └── data_0.parquet
│   ├── Service_Type=Sneltrein
│   │   └── data_0.parquet
│   ├── Service_Type=Stopbus%20ipv%20trein
│   │   └── data_0.parquet
│   ├── Service_Type=Stoptrein
│   │   └── data_0.parquet
│   └── Service_Type=Taxibus%20ipv%20trein
│       └── data_0.parquet
├── Service_Company=Blauwnet
│   ├── Service_Type=Intercity
│   │   └── data_0.parquet
…

We can now run the query on the Hive partitioned data set by passing the hive_partitioning = true flag:

SELECT avg(Stop_Arrival_delay)
FROM read_parquet(
         'services-parquet-hive/**/*.parquet',
         hive_partitioning = true
     )
WHERE Service_Company = 'NS'
  AND Service_Type = 'Intercity direct'
  AND Stop_Departure_time IS NULL;

This query now takes about 20 milliseconds as DuckDB can use the directory structure to limit the reads even further. And the neat thing about Hive partitioning is that it even works with CSV files!

COPY services
TO 'services-csv-hive'
(FORMAT CSV, PARTITION_BY (Service_Company, Service_Type));

SELECT avg(Stop_Arrival_delay)
FROM read_csv('services-csv-hive/**/*.csv', hive_partitioning = true)
WHERE Service_Company = 'NS'
  AND Service_Type = 'Intercity direct'
  AND Stop_Departure_time IS NULL;

While the CSV files lack any sort of metadata, DuckDB can rely on the directory structure to limit the scans to the relevant directories, resulting in execution times around 150 milliseconds, more than 10× faster compared to reading all CSV files.

If all these formats and results got your head spinning, no worries. We got your covered with this summary table:

Format Query runtime (ms)
DuckDB file format 35
CSV (vanilla) 1800
CSV (Hive-partitioned) 150
Parquet (vanilla) 90
Parquet (reordered) 35
Parquet (Hive-partitioned) 20

Oh, and we forgot to report the result. The average delay of Intercity Direct trains is 3 minutes!

Closing Thoughts

That's it for part three of DuckDB tricks. If you have a trick that would like to share, please share it with the DuckDB team on our social media sites, or submit it to the DuckDB Snippets site (maintained by our friends at MotherDuck).