Analyzing Railway Traffic in the Netherlands

Gabor Szarnyas

2024-05-31 · 11 min

TL;DR: We use a real-world railway dataset to demonstrate some of DuckDB's key features, including querying different file formats, connecting to remote endpoints, and using advanced SQL features.

Introduction

The Netherlands, the birthplace of DuckDB, has an area of about 42,000 km² with a population of about 18 million people. The high density of the country is a key factor in its extensive railway network, which consists of 3,223 km of tracks and 397 stations.

Information about this network's stations and services is available in the form of open datasets. These high-quality datasets are maintained by the team behind the Rijden de Treinen (Are the trains running?) application.

In this post, we'll demonstrate some of DuckDB's analytical capabilities on the Dutch railway network dataset. Unlike most of our other blog posts, this one doesn't introduce a new feature or release: instead, it demonstrates several existing features using a single domain. Some of the queries explained in this blog post are shown in simplified form on DuckDB's landing page.

Loading the Data

For our initial queries, we'll use the 2023 railway services dataset. To get this dataset, download the services-2023.csv.gz file (330 MB) and load it into DuckDB.

First, start the DuckDB command line client on a persistent database:

duckdb railway.db

Then, load the services-2023.csv.gz file into the services table.

CREATE TABLE services AS
    FROM 'services-2023.csv.gz';

Despite the seemingly simple query, there is quite a lot going on here. Let's deconstruct the query:

First, there is no need to explicitly define a schema for our services table, nor is it necessary to use a COPY ... FROM statement. DuckDB automatically detects that the 'services-2023.csv.gz' refers to a gzip-compressed CSV file, so it calls the read_csv function, which decompresses the file and infers its schema from its content using the CSV sniffer.
Second, the query makes use of DuckDB's FROM-first syntax, which allows users to omit the SELECT * clause. Hence, the SQL statement FROM 'services-2023.csv.gz'; is a shorthand for SELECT * FROM 'services-2023.csv.gz';.
Third, the query creates a table called services and populates it with the result from the CSV reader. This is achieved using a CREATE TABLE ... AS statement.

Using DuckDB v0.10.3, loading the dataset takes approximately 5 seconds on an M2 MacBook Pro. To check the amount of data loaded, we can run the following query which pretty-prints the number of rows in the services table:

SELECT format('{:,}', count(*)) AS num_services
FROM services;

num_services
21,239,393

We can see that more than 21 million train services ran in the Netherlands in 2023.

Finding the Busiest Station per Month

Let's ask a simple query first: What were the busiest railway stations in the Netherlands in the first 6 months of 2023?

First, for every month, let's compute the number of services passing through each station. To do so, we extract the month from the service's date using the month function, then perform a group-by aggregation with a count(*):

SELECT
    month("Service:Date") AS month,
    "Stop:Station name" AS station,
    count(*) AS num_services
FROM services
GROUP BY month, station
LIMIT 5;

Note that this query showcases a common redundancy in SQL: we list the names of non-aggregated columns in both the SELECT and the GROUP BY clauses. Using DuckDB's GROUP BY ALL feature, we can eliminate this. At the same time, let's also turn this result into an intermediate table called services_per_month using a CREATE TABLE ... AS statement:

CREATE TABLE services_per_month AS
    SELECT
        month("Service:Date") AS month,
        "Stop:Station name" AS station,
        count(*) AS num_services
    FROM services
    GROUP BY ALL;

To answer the question, we can use the arg_max(arg, val) aggregation function, which returns the column arg in the row with the maximum value val. We filter on the month and return the results:

SELECT
    month,
    arg_max(station, num_services) AS station,
    max(num_services) AS num_services
FROM services_per_month
WHERE month <= 6
GROUP BY ALL;

month	station	num_services
1	Utrecht Centraal	34760
2	Utrecht Centraal	32300
3	Utrecht Centraal	37386
4	Amsterdam Centraal	33426
5	Utrecht Centraal	35383
6	Utrecht Centraal	35632

Maybe surprisingly, in most months, the busiest railway station is not in Amsterdam but in the country's 4th largest city, Utrecht, thanks to its central geographic location.

Finding the Top-3 Busiest Stations for Each Summer Month

Let's change the question to: Which are the top-3 busiest stations for each summer month? The arg_max() function only helps us find the top-1 value but it is not sufficient for finding top-k results.

Using a Window Function (`OVER`)

DuckDB has extensive support for SQL features, including window functions and we can use the rank() function to find top-k values. Additionally, we use make_date to reconstruct the date, strftime to turn it into the month's name and array_agg:

SELECT month, month_name, array_agg(station) AS top3_stations
FROM (
    SELECT
        month,
        strftime(make_date(2023, month, 1), '%B') AS month_name,
        rank() OVER
            (PARTITION BY month ORDER BY num_services DESC) AS rank,
        station,
        num_services
    FROM services_per_month
    WHERE month BETWEEN 6 AND 8
)
WHERE rank <= 3
GROUP BY ALL
ORDER BY month;

This gives the following result:

month	month_name	top3_stations
6	June	[Utrecht Centraal, Amsterdam Centraal, Schiphol Airport]
7	July	[Utrecht Centraal, Amsterdam Centraal, Schiphol Airport]
8	August	[Utrecht Centraal, Amsterdam Centraal, Amsterdam Sloterdijk]

We can see that the top 3 spots are shared between four stations: Utrecht Centraal, Amsterdam Centraal, Schiphol Airport, and Amsterdam Sloterdijk.

Using the `max_by(arg, val, n)` Function

Starting with DuckDB version 1.1.0, you can use a variant of the max_by function that accepts a third parameter, n, for the number of rows. The resulting code is more concise and faster than the one using a window function.

SELECT
    month,
    strftime(make_date(2023, month, 1), '%B') AS month_name,
    max_by(station, num_services, 3) AS stations,
FROM services_per_month
WHERE month BETWEEN 6 AND 8
GROUP BY ALL
ORDER BY month;

Directly Querying Parquet Files through HTTPS or S3

DuckDB supports querying remote files, including CSV and Parquet, via the HTTP(S) protocol and the S3 API. For example, we can run the following query:

SELECT "Service:Date", "Stop:Station name"
FROM 'https://blobs.duckdb.org/nl-railway/services-2023.parquet'
LIMIT 3;

It returns the following result:

Service:Date	Stop:Station name
2023-01-01	Rotterdam Centraal
2023-01-01	Delft
2023-01-01	Den Haag HS

Using the remote Parquet file, the query for answering Which are the top-3 busiest stations for each summer month? can be run directly on a remote Parquet file without creating any local tables. To do this, we can define the services_per_month table as a common table expression in the WITH clause. The rest of the query remains the same:

WITH services_per_month AS (
    SELECT
        month("Service:Date") AS month,
        "Stop:Station name" AS station,
        count(*) AS num_services
    FROM 'https://blobs.duckdb.org/nl-railway/services-2023.parquet'
    GROUP BY ALL
)
SELECT month, month_name, array_agg(station) AS top3_stations
FROM (
    SELECT
        month,
        strftime(make_date(2023, month, 1), '%B') AS month_name,
        rank() OVER
            (PARTITION BY month ORDER BY num_services DESC) AS rank,
        station,
        num_services
    FROM services_per_month
    WHERE month BETWEEN 6 AND 8
)
WHERE rank <= 3
GROUP BY ALL
ORDER BY month;

This query yields the same result as the query above, and completes (depending on the network speed) in about 1–2 seconds. This speed is possible because DuckDB doesn't need to download the whole Parquet file to evaluate the query: while the file size is 309 MB, it only uses about 20 MB of network traffic, approximately 6% of the total file size.

The reduction in network traffic is possible because of partial reading along both the columns and the rows of the data. First, Parquet's columnar layout allows the reader to only access the required columns. Second, the zonemaps available in the Parquet file's metadata allow the filter pushdown optimization (e.g., the reader only fetches row groups with dates in the summer months). Both of these optimizations are implemented via HTTP range requests, saving considerable traffic and time when running queries on remote Parquet files.

Largest Distance between Train Stations in the Netherlands

Let's answer the following question: Which two train stations in the Netherlands have the largest distance between them when traveling via rail? For this, we'll use two datasets. The first, stations-2022-01.csv, contains information on the railway stations (station name, country, etc.). We can simply load and query this dataset as follows:

CREATE TABLE stations AS
    FROM 'https://blobs.duckdb.org/data/stations-2022-01.csv';

SELECT
    id,
    name_short,
    name_long,
    country,
    printf('%.2f', geo_lat) AS latitude,
    printf('%.2f', geo_lng) AS longitude
FROM stations
LIMIT 5;

id	name_short	name_long	country	latitude	longitude
266	Den Bosch	's-Hertogenbosch	NL	51.69	5.29
269	Dn Bosch O	's-Hertogenbosch Oost	NL	51.70	5.32
227	't Harde	't Harde	NL	52.41	5.89
8	Aachen	Aachen Hbf	D	50.77	6.09
818	Aachen W	Aachen West	D	50.78	6.07

The second dataset, tariff-distances-2022-01.csv, contains the station distances. The distances are defined as the shortest route on the railway network and they are used to calculate the tariffs for ticket. Let's peek into this file:

head -n 9 tariff-distances-2022-01.csv | cut -d, -f1-9

Station,AC,AH,AHP,AHPR,AHZ,AKL,AKM,ALM
AC,XXX,82,83,85,90,71,188,32
AH,82,XXX,1,3,8,77,153,98
AHP,83,1,XXX,2,9,78,152,99
AHPR,85,3,2,XXX,11,80,150,101
AHZ,90,8,9,11,XXX,69,161,106
AKL,71,77,78,80,69,XXX,211,96
AKM,188,153,152,150,161,211,XXX,158
ALM,32,98,99,101,106,96,158,XXX

We can see that the distances are encoded as a matrix with the diagonal entries set to XXX. As explained in the dataset's description, this string implies that the two stations are the same station. If we just load the values as XXX, the CSV reader will assume that all columns have the type VARCHAR instead of numeric values. While this can be cleaned up later, it's a lot easier to avoid this problem altogether. To do so, we use the read_csv function and set the nullstr parameter to XXX:

CREATE TABLE distances AS
    FROM read_csv(
        'https://blobs.duckdb.org/data/tariff-distances-2022-01.csv',
        nullstr = 'XXX'
    );

To make the NULL values visible in the command line output, we set the .nullvalue dot command to NULL:

.nullvalue NULL

Then, using the DESCRIBE statement, we can confirm that DuckDB has inferred the column correctly as BIGINT:

FROM (DESCRIBE distances)
LIMIT 5;

column_name	column_type	null	key	default	extra
Station	VARCHAR	YES	NULL	NULL	NULL
AC	BIGINT	YES	NULL	NULL	NULL
AH	BIGINT	YES	NULL	NULL	NULL
AHP	BIGINT	YES	NULL	NULL	NULL
AHPR	BIGINT	YES	NULL	NULL	NULL

To show the first 9 columns, we can run the following query with the #1, #2, etc. column indexes in the SELECT statement:

SELECT #1, #2, #3, #4, #5, #6, #7, #8, #9
FROM distances
LIMIT 8;

Station	AC	AH	AHP	AHPR	AHZ	AKL	AKM	ALM
AC	NULL	82	83	85	90	71	188	32
AH	82	NULL	1	3	8	77	153	98
AHP	83	1	NULL	2	9	78	152	99
AHPR	85	3	2	NULL	11	80	150	101
AHZ	90	8	9	11	NULL	69	161	106
AKL	71	77	78	80	69	NULL	211	96
AKM	188	153	152	150	161	211	NULL	158
ALM	32	98	99	101	106	96	158	NULL

We can see that the data was loaded correctly but the wide table format is a bit unwieldy for further processing: to query for pairs of stations, we need to first turn it into a long table using the UNPIVOT statement. Naïvely, we would write something like the following:

CREATE TABLE distances_long AS
    UNPIVOT distances
    ON AC, AH, AHP, ...

However, we have almost 400 stations, so spelling out their names would be quite tedious. Fortunately, DuckDB has a trick to help with this: the COLUMNS(*) expression lists all columns and its optional EXCLUDE clause can remove given column names from the list. Therefore, the expression COLUMNS(* EXCLUDE station) lists all column names except station, precisely what we need for the UNPIVOT command:

CREATE TABLE distances_long AS
    UNPIVOT distances
    ON COLUMNS (* EXCLUDE station)
    INTO NAME other_station VALUE distance;

This results in the following table:

SELECT station, other_station, distance
FROM distances_long
LIMIT 3;

Station	other_station	distance
AC	AH	82
AC	AHP	83
AC	AHPR	85

Now we can join the distances_long table on the stations table along both the start and end stations, then filter for stations which are located in the Netherlands. We introduce symmetry breaking (station < other_station) to ensure that the same pair of stations only occurs once in the output. Finally, we select the top-3 results:

SELECT
    s1.name_long AS station1,
    s2.name_long AS station2,
    distances_long.distance
FROM distances_long
JOIN stations s1 ON distances_long.station = s1.code
JOIN stations s2 ON distances_long.other_station = s2.code
WHERE s1.country = 'NL'
  AND s2.country = 'NL'
  AND station < other_station
ORDER BY distance DESC
LIMIT 3;

The results show that there are pairs of train stations, which are at least 425 km away – quite the distance for such a small country!

station1	station2	distance
Eemshaven	Vlissingen	426
Eemshaven	Vlissingen Souburg	425
Bad Nieuweschans	Vlissingen	425

Conclusion

In this post, we demonstrated some of DuckDB's key features, including automatic detection of formats based on filenames, auto-inferencing the schema of CSV files, direct Parquet querying, remote querying, window functions, unpivot, several friendly SQL features (such as FROM-first, GROUP BY ALL, and COLUMNS(*)), and so on. The combination of these allows for formulating queries using different file formats (CSV, Parquet), data sources (local, HTTPS, S3), and SQL features. This helps users answer queries quickly and efficiently.

In the next installment, we'll take a look at temporal data using AsOf joins and geospatial data using the DuckDB spatial extension.