Command Line Data Processing: Using DuckDB as a Unix Tool
TL;DR: DuckDB's CLI client is portable to many platforms and architectures. It handles CSV files conveniently and offers users the same rich SQL syntax everywhere. These characteristics make DuckDB an ideal tool to complement traditional Unix tools for data processing in the command line.
In this blog post, we dive into the terminal to compare DuckDB with traditional tools used in Unix shells (Bash, Zsh, etc.). We solve several problems requiring operations such as projection and filtering to demonstrate the differences between using SQL queries in DuckDB versus specialized command line tools. In the process, we will show off some cool features such as DuckDB's powerful CSV reader and the positional join operator. Let's get started!
The Unix Philosophy
To set the stage, let's recall the Unix philosophy. This states that programs should:
- do one thing and do it well,
- work together, and
- handle text streams.
Unix-like systems such as macOS, Linux and WSL in Windows have embraced this philosophy.
Tools such as
grep
,
sed
, and
sort
are ubiquitious and widely used in shell scripts.
As a purpose-built data processing tool, DuckDB fits the Unix philosophy quite well. First, it was designed to be a fast in-process analytical SQL database system (do one thing and do it well). Second, it has a standalone command line client, which can consume and produce CSV files (work together), and also supports reading and writing text streams (handle text streams). Thanks to these, DuckDB works well in the ecosystem of Unix CLI tools, as shown in several posts.
Portability and Usability
While Unix CLI tools are fast, robust, and available on all major platforms, they often have cumbersome syntax that's difficult to remember.
To make matters worse, these tools often come with slight differences between systems – think of the differences between GNU sed
and macOS's sed
or the differences between regex syntax among programs, which is aptly captured by Donald Knuth's quip “I define Unix as 30 definitions of regular expressions living under one roof.”
While there are shells specialized specifically for dataframe processing, such as the Nushell project, older Unix shells (e.g., the Bourne shell sh
and Bash) are still the most wide-spread, especially on servers.
At the same time, we have DuckDB, an extremely portable database system which uses the same SQL syntax on all platforms. With version 1.0.0 released recently, DuckDB's syntax – based on the proven and widely used PostgeSQL dialect – is now in a stable state. Another attractive feature of DuckDB is that it offers an interactive shell, which aids quick debugging. Moreover, DuckDB is available in several host languages as well as in the browser via WebAssembly, so if you ever decide to use your SQL scripts outside of the shell, DuckDB SQL scripts can be ported to a wide variety of environments without any changes.
Data Processing with Unix Tools and DuckDB
In the following, we give examples for implementing simple data processing tasks using the CLI tools provided in most Unix shells and using DuckDB SQL queries. We use DuckDB v1.0.0 and run it in in-memory mode. This mode makes sense for the problems we are tackling, as we do not create any tables and the operations are not memory-intensive, so there is no data to persist or to spill on disk.
Datasets
We use the four input files capturing information on cities and airports in the Netherlands.
pop.csv
, the population of each of the top-10 most populous cities.
city,province,population
Amsterdam,North Holland,905234
Rotterdam,South Holland,656050
The Hague,South Holland,552995
Utrecht,Utrecht,361924
Eindhoven,North Brabant,238478
Groningen,Groningen,234649
Tilburg,North Brabant,224702
Almere,Flevoland,218096
Breda,North Brabant,184716
Nijmegen,Gelderland,179073
area.csv
, the area of each of the top-10 most populous cities.
city,area
Amsterdam,219.32
Rotterdam,324.14
The Hague,98.13
Utrecht,99.21
Eindhoven,88.92
Groningen,197.96
Tilburg,118.13
Almere,248.77
Breda,128.68
Nijmegen,57.63
cities-airports.csv
, the IATA codes of civilian airports serving given cities.
city,IATA
Amsterdam,AMS
Haarlemmermeer,AMS
Eindhoven,EIN
Groningen,GRQ
Eelde,GRQ
Maastricht,MST
Beek,MST
Rotterdam,RTM
The Hague,RTM
airport-names.csv
, the airport names belonging to given IATA codes.
IATA,airport name
AMS,Amsterdam Airport Schiphol
EIN,Eindhoven Airport
GRQ,Groningen Airport Eelde
MST,Maastricht Aachen Airport
RTM,Rotterdam The Hague Airport
You can download all input files as a single zip file.
Projecting Columns
Projecting columns is a very common data processing step. Let's take the pop.csv
file and project the first and last columns, city
and population
.
Unix Shell: cut
In the Unix shell, we use the cut
command and specify the file's delimiter (-d
) and the columns to be projected (-f
).
cut -d , -f 1,3 pop.csv
This produces the following output:
city,population
Amsterdam,905234
Rotterdam,656050
The Hague,552995
Utrecht,361924
Eindhoven,238478
Groningen,234649
Tilburg,224702
Almere,218096
Breda,184716
Nijmegen,179073
DuckDB: SELECT
In DuckDB, we can use the CSV reader to load the data, then use the SELECT
clause with column indexes (#i
) to designate the columns to be projected:
SELECT #1, #3 FROM 'pop.csv';
Note that we did not have to define any schema or load the data to a table.
Instead, we simply used 'pop.csv'
in the FROM
clause as we would do with a regular table.
DuckDB detects that this is a CSV file and invokes the read_csv
function, which automatically infers the CSV file's dialect (delimiter, presence of quotes, etc.) as well as the schema of the table.
This allows us to simply project columns using SELECT #1, #3
.
We could also use the more readable syntax SELECT city, population
.
To make the output of the solutions using Unix tools and DuckDB equivalent, we wrap the query into a COPY ... TO
statement:
COPY (
SELECT #1, #3 FROM 'pop.csv'
) TO '/dev/stdout/';
This query produces the same result as the Unix command's output shown above.
To turn this into a standalone CLI command, we can invoke the DuckDB command line client with the -c ⟨query⟩
argument, which runs the SQL query and exits once it's finished.
Using this technique, the query above can be turned into the following one-liner:
duckdb -c "COPY (SELECT #1, #3 FROM 'pop.csv') TO '/dev/stdout/'"
In the following, we'll omit the code blocks using the standalone duckdb
command: all solutions can be executed in the duckdb -c ⟨query⟩
template and yield the same result as the solutions using Unix tools.
Sorting Files
Another common task is to sort files based on given columns.
Let's rank the cities within provinces based on their populations.
To do so, we need to sort the pop.csv
file first based on the name of the province
using an ascending order, then on the population
using a descending order.
We then return the province
column first, followed by the city
and the population
columns.
Unix Shell: sort
In the Unix shell, we rely on the sort
tool.
We specify the CSV file's separator with the -t
argument and set the keys to sort on using -k
arguments.
We first sort on the second column (province
) with -k 2,2
.
Then, we sort on the third column (population
), setting the ordering to be reversed (r
) and numeric (n
) with -k 3rn
.
Note that we need to handle the header of the file separately: we take the first row with head -n 1
and the rest of the rows with tail -n +2
, sort the latter, and glue them back together with the header.
Finally, we perform a projection to reorder the columns.
Unfortunately, the cut
command cannot reorder the columns, so we use awk
instead:
(head -n 1 pop.csv; tail -n +2 pop.csv \
| sort -t , -k 2,2 -k 3rn) \
| awk -F , '{ print $2 "," $1 "," $3}'
The result is the following:
province,city,population
Flevoland,Almere,218096
Gelderland,Nijmegen,179073
Groningen,Groningen,234649
North Brabant,Eindhoven,238478
North Brabant,Tilburg,224702
North Brabant,Breda,184716
North Holland,Amsterdam,905234
South Holland,Rotterdam,656050
South Holland,The Hague,552995
Utrecht,Utrecht,361924
DuckDB: ORDER BY
In DuckDB, we simply load the CSV and specify the column ordering via SELECT province, city, population
, then set the sorting criteria on the selected columns (province ASC
and population DESC
).
The CSV reader automatically detects types, so the sorting is numeric by default. Finally, we surround the query with a COPY
statement to print the results to the standard output.
COPY (
SELECT province, city, population
FROM 'pop.csv'
ORDER BY province ASC, population DESC
) TO '/dev/stdout/';
Intersecting Columns
A common task is to calculate the intersection of two columns, i.e., to find entities that are present in both. Let's find the cities that are both in the top-10 most populous cities and have their own airports.
Unix Shell: comm
The Unix solution for intersection uses the comm
tool, intended to compare two sorted files line-by-line.
We first cut
the relevant colum from both files.
Due to the sorting requirement, we apply sort
on both inputs before performing the intersection.
The intersection is performed using comm -12
where the argument -12
means that we only want to keep lines that are in both files.
We again rely on head
and tail
to treat the headers and the rest of the files separately during processing and glue them together at the end.
head -n 1 pop.csv | cut -d , -f 1; \
comm -12 \
<(tail -n +2 pop.csv | cut -d , -f 1 | sort) \
<(tail -n +2 cities-airports.csv | cut -d , -f 1 | sort)
The script produces the following output:
city
Amsterdam
Eindhoven
Groningen
Rotterdam
The Hague
DuckDB: INTERSECT ALL
The DuckDB solution reads the CSV files, projects the city
fields and applies the INTERSECT ALL
clause to calculate the intersection:
COPY (
SELECT city FROM 'pop.csv'
INTERSECT ALL
SELECT city FROM 'cities-airports.csv'
) TO '/dev/stdout/';
Pasting Rows Together
Pasting rows together line-by-line is a recurring task.
In our example, we know that the pop.csv
and the area.csv
files have an equal number of rows, so we can produce a single file that contains both the population and the area of every city in the dataset.
Unix Shell: paste
In the Unix shell, we use the paste
command and remove the duplicate city
field using cut
:
paste -d , pop.csv area.csv | cut -d , -f 1,2,3,5
The output is the following:
city,province,population,area
Amsterdam,North Holland,905234,219.32
Rotterdam,South Holland,656050,324.14
The Hague,South Holland,552995,98.13
Utrecht,Utrecht,361924,99.21
Eindhoven,North Brabant,238478,88.92
Groningen,Groningen,234649,197.96
Tilburg,North Brabant,224702,118.13
Almere,Flevoland,218096,248.77
Breda,North Brabant,184716,128.68
Nijmegen,Gelderland,179073,57.63
DuckDB: POSITIONAL JOIN
In DuckDB, we can use a POSITIONAL JOIN
.
This join type is one of DuckDB's SQL extensions and it provides a concise syntax to combine tables row-by-row based on each row's position in the table.
Joining the two tables together using POSITIONAL JOIN
results in two city
columns – we use the EXCLUDE
clause to remove the duplicate column:
COPY (
SELECT pop.*, area.* EXCLUDE city
FROM 'pop.csv'
POSITIONAL JOIN 'area.csv'
) TO '/dev/stdout/';
Filtering
Filtering is another very common operation. For this, we'll use cities-airports.csv
file.
For each airport, this file contains its IATA
code and the main cities that it serves:
city,IATA
Amsterdam,AMS
Haarlemmermeer,AMS
Eindhoven,EIN
...
Let's try to formulate two queries:
-
Find all cities whose name ends in
dam
. -
Find all airports whose IATA code is equivalent to the first three letters of a served city's name, but the city's name does not end in
dam
.
Unix Shell: grep
To answer the first question in the Unix shell, we use grep
and the regular expression ^[^,]*dam,
:
grep "^[^,]*dam," cities-airports.csv
In this expression, ^
denotes the start of the line, [^,]*
searches for a string that does not contain the comma character (the separator).
The expression dam,
ensures that the end of the string in the first field is dam
.
The output is:
Amsterdam,AMS
Rotterdam,RTM
Let's try to answer the second question. For this, we need to match the first three characters in the city
field to the IATA
field but we need to do so in a case-insensitive manner.
We also need to use a negative condition to exclude the lines where the city's name ends in dam
.
Both of these requirements are difficult to achieve with a single grep
or egrep
command as they lack support for two features.
First, they do not support case-insensitive matching using a backreference (grep -i
alone is not sufficient to ensure this).
Second, they do not support negative lookbehinds.
Therefore, we use pcregrep
, and formulate our question as follows:
pcregrep -i '^([a-z]{3}).*?(?<!dam),\1$' cities-airports.csv
Here, we call pcregrep
with the case-insensitive flag (-i
), which in pcregrep
also affects backreferences such as \1
.
We capture the first three letters with ([a-z]{3})
(e.g., Ams
) and match it to the second field with the backreference: ,\1$
.
We use a non-greedy .*?
to seek to the end of the first field, then apply a negative lookbehind with the (?<!dam)
expression to ensure that the field does not end in dam
.
The result is a single line:
Eindhoven,EIN
DuckDB: WHERE ... LIKE
Let's answer the questions now in DuckDB.
To answer the first question, we can use LIKE
for pattern matching.
The header should not be part of the output, so we disable it with HEADER false
.
The complete query looks like follows:
COPY (
FROM 'cities-airports.csv'
WHERE city LIKE '%dam'
) TO '/dev/stdout/' (HEADER false);
For the second question, we use string slicing to extract the first three characters, upper
to ensure case-insensitivity, and NOT LIKE
for the negative condition:
COPY (
FROM 'cities-airports.csv'
WHERE upper(city[1:3]) = IATA
AND city NOT LIKE '%dam'
) TO '/dev/stdout/' (HEADER false);
These queries return exactly the same results as the solutions using grep
and pcregrep
.
In both of these queries, we used the FROM
-first syntax.
If the SELECT
clause is omitted, the query is executed as if SELECT *
was used, i.e., it returns all columns.
Joining Files
Joining tables is an essential task in data processing. Our next example is going to use a join to return city name–airport name combinations.
This is achieved by joining the cities-airports.csv
and the airport-names.csv
files on their IATA code fields.
Unix Shell: join
Unix tools support joining files via the join
command, which joins lines of two sorted inputs on a common field.
To make this work, we sort the files based on their IATA
fields, then perform the join on the first file's 2nd column (-1 2
) and the second file's 1st column (-2 1
).
We have to omit the header for the join
command to work, so we do just that and construct a new header with an echo
command:
echo "IATA,city,airport name"; \
join -t , -1 2 -2 1 \
<(tail -n +2 cities-airports.csv | sort -t , -k 2,2) \
<(tail -n +2 airport-names.csv | sort -t , -k 1,1)
The result is the following:
IATA,city,airport name
AMS,Amsterdam,Amsterdam Airport Schiphol
AMS,Haarlemmermeer,Amsterdam Airport Schiphol
EIN,Eindhoven,Eindhoven Airport
GRQ,Eelde,Groningen Airport Eelde
GRQ,Groningen,Groningen Airport Eelde
MST,Beek,Maastricht Aachen Airport
MST,Maastricht,Maastricht Aachen Airport
RTM,Rotterdam,Rotterdam The Hague Airport
RTM,The Hague,Rotterdam The Hague Airport
DuckDB
In DuckDB, we load the CSV files and connect them using the NATURAL JOIN
clause, which joins on column(s) with the same name.
To ensure that the result matches with that of the Unix solution, we use the ORDER BY ALL
clause, which sorts the result on all columns, starting from the first one, and stepping through them for tie-breaking to the last column.
COPY (
SELECT "IATA", "city", "airport name"
FROM 'cities-airports.csv'
NATURAL JOIN 'airport-names.csv'
ORDER BY ALL
) TO '/dev/stdout/';
Replacing Strings
You may have noticed that we are using very clean datasets. This is of course very unrealistic, so in an evil twist, let's reduce the data quality a bit:
- Replace the space in the province's name with an underscore, e.g., turning
North Holland
toNorth_Holland
. - Add thousand separating commas, e.g., turning
905234
to905,234
. - Change the CSV's separator to the semicolon character (
;
).
And while we're at it, also fetch the data set via HTTPS this time, using the URL https://duckdb.org/data/cli/pop.csv
.
Unix Shell: curl
and sed
In Unix, remote data sets are typically fetched via curl
.
The output of curl
is piped into the subsequent processing steps, in this case, a bunch of sed
commands.
curl -s https://duckdb.org/data/cli/pop.csv \
| sed 's/\([^,]*,.*\) \(.*,[^,]*\)/\1_\2/g' \
| sed 's/,/;/g' \
| sed 's/\([0-9][0-9][0-9]\)$/,\1/'
This results in the following output:
city;province;population
Amsterdam;North_Holland;905,234
Rotterdam;South_Holland;656,050
The Hague;South_Holland;552,995
Utrecht;Utrecht;361,924
Eindhoven;North_Brabant;238,478
Groningen;Groningen;234,649
Tilburg;North_Brabant;224,702
Almere;Flevoland;218,096
Breda;North_Brabant;184,716
Nijmegen;Gelderland;179,073
DuckDB: httpfs
and regexp_replace
In DuckDB, we use the following query:
COPY (
SELECT
city,
replace(province, ' ', '_') AS province,
regexp_replace(population::VARCHAR, '([0-9][0-9][0-9])$', ',\1')
AS population
FROM 'https://duckdb.org/data/cli/pop.csv'
) TO '/dev/stdout/' (DELIMITER ';');
Note that the FROM
clause now has an HTTPS URL instead of a simple CSV file.
The presence of the https://
prefix triggers DuckDB to load the httpfs
extension and use it to fetch the JSON document.
We use the replace
function to substitute the spaces with underscores,
and the regexp_replace
function for the replacement using a regular expression.
(We could have also used string formatting functions such as format
and printf
).
To change the separator to a semicolon, we serialize the file using the COPY
statement with the DELIMITER ';'
option.
Reading JSON
As a final exercise, let's query the number of stars given to the duckdb/duckdb
repository on GitHub.
Unix Shell: curl
and jq
In Unix tools, we can use curl
to get the JSON file from https://api.github.com
and pipe its output to jq
to query the JSON object.
curl -s https://api.github.com/repos/duckdb/duckdb \
| jq ".stargazers_count"
DuckDB: read_json
In DuckDB, we use the read_json
function, invoking it with the remote HTTPS endpoint's URL.
The schema of the JSON file is detected automatically, so we can simply use SELECT
to return the required field.
SELECT stargazers_count
FROM read_json('https://api.github.com/repos/duckdb/duckdb');
Output
Both of these commands return the current number of stars of the repository.
Performance
At this point, you might be wondering about the performance of the DuckDB solutions. After all, all of our prior examples have only consisted of a few lines, so benchmarking them against each other will not result in any measurable performance differences. So, let's switch to the Dutch railway services dataset that we used in a previous blog post and formulate a different problem.
We'll use the 2023 railway services file (services-2023.csv.gz
) and count the number of Intercity services that operated in that year.
In Unix, we can use the gzcat
command to decompress the csv.gz
file into a pipeline. Then, we can use grep
or pcregrep
(which is more performant), and top it off with the wc
command to count the number of lines (-l
).
In DuckDB, the built-in CSV reader also supports compressed CSV files, so we can use that without any extra configuration.
gzcat services-2023.csv.gz | grep '^[^,]*,[^,]*,Intercity,' | wc -l
gzcat services-2023.csv.gz | pcregrep '^[^,]*,[^,]*,Intercity,' | wc -l
duckdb -c "SELECT count(*) FROM 'services-2023.csv.gz' WHERE \"Service:Type\" = 'Intercity';"
We also test the tools on uncompressed input:
gunzip -k services-2023.csv.gz
grep '^[^,]*,[^,]*,Intercity,' services-2023.csv | wc -l
pcregrep '^[^,]*,[^,]*,Intercity,' services-2023.csv | wc -l
duckdb -c "SELECT count(*) FROM 'services-2023.csv' WHERE \"Service:Type\" = 'Intercity';"
To reduce the noise in the measurements, we used the hyperfine
benchmarking tool and took the mean execution time of 10 runs.
The experiments were carried out on a MacBook Pro with a 12-core M2 Pro CPU and 32 GB RAM, running macOS Sonoma 14.5.
To reproduce them, run the grep-vs-duckdb-microbenchmark.sh
script.
The following table shows the runtimes of the solutions on both compressed and uncompressed inputs:
Tool | Runtime (compressed) | Runtime (uncompressed) |
---|---|---|
grep 2.6.0-FreeBSD | 20.9 s | 20.5 s |
pcregrep 8.45 | 3.1 s | 2.9 s |
DuckDB 1.0.0 | 4.2 s | 1.2 s |
The results show that on compressed input, grep
was the slowest, while DuckDB is slightly edged out by gzcat
+pcregrep
, which ran in 3.1 seconds compared to DuckDB's 4.2 seconds.
On uncompressed input, DuckDB can utilize all CPU cores from the get-go (instead of starting with a single-threaded decompression step), allowing it to outperform both grep
and pcregrep
by a significant margin: 2.5× faster than pcregrep
and more than 15× faster than grep
.
While this example is quite simple, as queries get more complex, there are more opportunities for optimization and larger intermediate dataset may be produced. While both of these can be tackled within a shell script (by manually implementing optimizations and writing the intermediate datasets to disk), these will likely be less efficient than what a DBMS can come up with. Shell scripts implementing complex pipelines can also be very brittle and need to be rethought even for small changes, making the performance advantage of using a database even more significant for more complex problems.
Summary
In this post, we used DuckDB as a standalone CLI application, and explored its abilities to complement or substitute existing command line tools (sort
, grep
, comm
, join
, etc.).
While we obviously like DuckDB a lot and prefer to use it in many cases, we also believe Unix tools have their place:
on most systems, they are already pre-installed and a well-chosen toolchain of Unix commands can be
fast,
efficient,
and portable (thanks to POSIX-compliance).
Additionally, they can be very concise for certain problems.
However, to reap their benefits, you will need to learn the syntax and quirks of each tool such as grep
variants, awk
as well as advanced ones such as xargs
and parallel
.
In the meantime, DuckDB's SQL is easy-to-learn (you likely know quite a bit of it already) and DuckDB handles most of the optimization for you.
If you have a favorite CLI use case for DuckDB, let us know on social media or submit it to DuckDB snippets. Happy hacking!