DuckDB parallelizes the workload based on row groups, i.e., groups of rows that are stored together at the storage level. A row group in DuckDB’s database format consists of max. 122,880 rows. Parallelism starts at the level of row groups, therefore, for a query to run on k threads, it needs to scan at least k * 122,880 rows.
Note that in certain cases DuckDB may launch too many threads (e.g., due to HyperThreading), which can lead to slowdowns. In these cases, it’s worth manually limiting the number of threads using
SET threads = X.
A key strength of DuckDB is support for larger-than-memory workloads, i.e., it is able to process data sets that are larger than the available system memory (also known as out-of-core processing). It can also run queries where the intermediate results cannot fit into memory. This section explains the prerequisites, scope, and known limitations of larger-than-memory processing in DuckDB.
Larger-than-memory workloads are supported by spilling to disk. If DuckDB is connected to a persistent database file, DuckDB will create a temporary directory named
<database_file_name>.tmp when the available memory is no longer sufficient to continue processing.
If DuckDB is running in in-memory mode, it cannot use disk to offload data if it does not fit into main memory.
To enable offloading in the absence of a persistent database file, use the
SET temp_directory statement:
SET temp_directory = '/path/to/temp_dir.tmp/'
Some operators cannot output a single row until the last row of their input has been seen. These are called blocking operators as they require their entire input to be buffered, and are the most memory-instensive operators in relational database systems. The main blocking operators are the following:
OVER ... (PARTITION BY ... ORDER BY ...),
DuckDB supports larger-than-memory processing for all of these operators.
DuckDB strives to always complete workloads even if they are larger-than-memory. That said, there are some limitations at the moment:
- If multiple blocking operators appear in the same query, DuckDB may still throw an out-of-memory exception due to the complex interplay of these operators.
- Some aggregate functions, such as
string_agg(), do not support offloading to disk.
- Aggregate functions that use sorting are holistic, i.e., they need all inputs before the aggregation can start. As DuckDB cannot yet offload some complex intermediate aggregate states to disk, these functions can cause an out-of-memory exception when run on large data sets.
PIVOToperation internally uses the
list()function, therefore it is subject to the same limitation.
If your queries are not performing as well as expected, it’s worth studying their query plans:
EXPLAINto print the physical query plan without running the query.
EXPLAIN ANALYZEto run and profile the query. This will show the CPU time that each step in the query takes. Note that due to multi-threading, adding up the individual times will be larger than the total query processing time.
Query plans can point to the root of performance issues. A few general directions:
- Avoid nested loop joins in favor of hash joins.
- A scan that does not include a filter pushdown for a filter condition that is later applied performs unnecessary IO. Try rewriting the query to apply a pushdown.
- Bad join orders where the cardinality of an operator explodes to billions of tuples should be avoided at all costs.
Prepared statements can improve performance when running the same query many times, but with different parameters. When a statement is prepared, it completes several of the initial portions of the query execution process (parsing, planning, etc.) and caches their output. When it is executed, those steps can be skipped, improving performance. This is beneficial mostly for repeatedly running small queries (with a runtime of < 100ms) with different sets of parameters.
Note that it is not a primary design goal for DuckDB to quickly execute many small queries concurrently. Rather, it is optimized for running larger, less frequent queries.
DuckDB uses synchronous IO when reading remote files. This means that each DuckDB thread can make at most one HTTP request at a time. If a query must make many small requests over the network, increasing DuckDB’s
threads setting to larger than the total number of CPU cores (approx. 2-5 times CPU cores) can improve parallelism and performance.
DuckDB will perform best when reusing the same database connection many times. Disconnecting and reconnecting on every query will incur some overhead, which can reduce performance when running many small queries. DuckDB also caches some data and metadata in memory, and that cache is lost when the last open connection is closed. Frequently, a single connection will work best, but a connection pool may also be used.
Using multiple connections can parallelize some operations, although it is typically not necessary. DuckDB does attempt to parallelize as much as possible within each individual query, but it is not possible to parallelize in all cases. Making multiple connections can process more operations concurrently. This can be more helpful if DuckDB is not CPU limited, but instead bottlenecked by another resource like network transfer speed.
When importing or exporting data sets that are much larger than the available memory, out of memory errors may occur. In these cases, it’s worth setting the
preserve_insertion_order configuration option to
SET preserve_insertion_order = false;
This allows the systems to re-order any results that do not contain
ORDER BY clauses, potentially reducing memory usage.