DuckDB Tricks – Part 1
TL;DR: We use a simple example data set to present a few tricks that are useful when using DuckDB.
In this blog post, we present five simple DuckDB operations that we found particularly useful for interactive use cases. The operations are summarized in the following table:
Operation | Snippet |
---|---|
Pretty-printing floats | SELECT (10 / 9)::DECIMAL(15, 3); |
Copying the schema | CREATE TABLE tbl AS FROM example LIMIT 0; |
Shuffling data | FROM example ORDER BY hash(rowid + 42); |
Specifying types when reading CSVs | FROM read_csv('example.csv', types = {'x': 'DECIMAL(15, 3)'}); |
Updating CSV files in-place | COPY (SELECT s FROM 'example.csv') TO 'example.csv'; |
Creating the example data set
We start by creating a data set that we'll use in the rest of the blog post. To this end, we define a table, populate it with some data and export it to a CSV file.
CREATE TABLE example (s STRING, x DOUBLE);
INSERT INTO example VALUES ('foo', 10/9), ('bar', 50/7), ('qux', 9/4);
COPY example TO 'example.csv';
Wait a bit, that’s way too verbose! DuckDB’s syntax has several SQL shorthands including the “friendly SQL” clauses.
Here, we combine the VALUES
clause with the FROM
-first syntax, which makes the SELECT
clause optional.
With these, we can compress the data creation script to ~60% of its original size.
The new formulation omits the schema definition and creates the CSV with a single command:
COPY (FROM VALUES ('foo', 10/9), ('bar', 50/7), ('qux', 9/4) t(s, x))
TO 'example.csv';
Regardless of which script we run, the resulting CSV file will look like this:
s,x
foo,1.1111111111111112
bar,7.142857142857143
qux,2.25
Let’s continue with the code snippets and their explanations.
Pretty-printing floating-point numbers
When printing a floating-point number to the output, the fractional parts can be difficult to read and compare. For example, the following query returns three numbers between 1 and 8 but their printed widths are very different due to their fractional parts.
SELECT x
FROM 'example.csv';
┌────────────────────┐
│ x │
│ double │
├────────────────────┤
│ 1.1111111111111112 │
│ 7.142857142857143 │
│ 2.25 │
└────────────────────┘
By casting a column to a DECIMAL
with a fixed number of digits after the decimal point, we can pretty-print it as follows:
SELECT x::DECIMAL(15, 3) AS x
FROM 'example.csv';
┌───────────────┐
│ x │
│ decimal(15,3) │
├───────────────┤
│ 1.111 │
│ 7.143 │
│ 2.250 │
└───────────────┘
A typical alternative solution is to use the printf
or format
functions, e.g.:
SELECT printf('%.3f', x)
FROM 'example.csv';
However, these approaches require us to specify a formatting string that's easy to forget.
What's worse, the statement above returns string values, which makes subsequent operations (e.g., sorting) more difficult.
Therefore, unless keeping the full precision of the floating-point numbers is a concern, casting to DECIMAL
values should be the preferred solution for most use cases.
Copying the schema of a table
To copy the schema from a table without copying its data, we can use LIMIT 0
.
CREATE TABLE example AS
FROM 'example.csv';
CREATE TABLE tbl AS
FROM example
LIMIT 0;
This will result in an empty table with the same schema as the source table:
DESCRIBE tbl;
┌─────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐
│ column_name │ column_type │ null │ key │ default │ extra │
│ varchar │ varchar │ varchar │ varchar │ varchar │ varchar │
├─────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤
│ s │ VARCHAR │ YES │ │ │ │
│ x │ DOUBLE │ YES │ │ │ │
└─────────────┴─────────────┴─────────┴─────────┴─────────┴─────────┘
Alternatively, in the CLI client, we can run the .schema
dot command:
.schema
This will return the schema of the table.
CREATE TABLE example(s VARCHAR, x DOUBLE);
After editing the table’s name (e.g., example
to tbl
), this query can be used to create a new table with the same schema.
Shuffling data
Sometimes, we need to introduce some entropy into the ordering of the data by shuffling it.
To shuffle non-deterministically, we can simply sort on a random value provided the random()
function:
FROM 'example.csv' ORDER BY random();
Shuffling deterministically is a bit more tricky. To achieve this, we can order on the hash, of the rowid
pseudocolumn. Note that this column is only available in physical tables, so we first have to load the CSV in a table, then perform the shuffle operation as follows:
CREATE OR REPLACE TABLE example AS FROM 'example.csv';
FROM example ORDER BY hash(rowid + 42);
The result of this shuffle operation is deterministic – if we run the script repeatedly, it will always return the following table:
┌─────────┬────────────────────┐
│ s │ x │
│ varchar │ double │
├─────────┼────────────────────┤
│ bar │ 7.142857142857143 │
│ qux │ 2.25 │
│ foo │ 1.1111111111111112 │
└─────────┴────────────────────┘
Note that the + 42
is only necessary to nudge the first row from its position – as hash(0)
returns 0
, the smallest possible value, using it for ordering leaves the first row in its place.
Specifying types in the CSV loader
DuckDB’s CSV loader auto-detects types from a short list of BOOLEAN
, BIGINT
, DOUBLE
, TIME
, DATE
, TIMESTAMP
and VARCHAR
.
In some cases, it’s desirable to override the detected type of a given column with a type outside of this list.
For example, we may want to treat column x
as a DECIMAL
value from the get-go.
We can do this on a per-column basis with the types
argument of the read_csv
function:
CREATE OR REPLACE TABLE example AS
FROM read_csv('example.csv', types = {'x': 'DECIMAL(15, 3)'});
Then, we can simply query the table to see the result:
FROM example;
┌─────────┬───────────────┐
│ s │ x │
│ varchar │ decimal(15,3) │
├─────────┼───────────────┤
│ foo │ 1.111 │
│ bar │ 7.143 │
│ qux │ 2.250 │
└─────────┴───────────────┘
Updating CSV files in-place
In DuckDB, it is possible to read, process and write CSV files in-place. For example, to project the column s
into the same file, we can simply run:
COPY (SELECT s FROM 'example.csv') TO 'example.csv';
The resulting example.csv
file will have the following content:
s
foo
bar
qux
Note that this trick is not possible in Unix shells without a workaround.
One might be tempted to run the following command on the example.csv
file and expect the same result:
cut -d, -f1 example.csv > example.csv
However, due to the intricacies of Unix pipelines, executing this command leaves us with an empty example.csv
file.
The solution is to use different file names, then perform a rename operation:
cut -d, -f1 example.csv > tmp.csv && mv tmp.csv example.csv
Closing thoughts
That’s it for today. The tricks shown in this post are available on duckdbsnippets.com. If you have a trick that would like to share, please submit it there, or send it to us via social media or Discord. Happy hacking!