DuckDB is an in-process
SQL OLAP database management system
Why DuckDB?
Simple
- In-process, serverless
- C++11, no dependencies, single file build
- APIs for Python/R/Java/…
All the benefits of a database, none of the hassle.
Installation
Choose your environment to use for DuckDB
- Python
- R
- Java
- node.js
- C++
- CLI
- ODBC
Latest release: DuckDB 0.4.0 System detected: Other Installations
pip install duckdb==0.4.0
install.packages("duckdb")
<dependency>
<groupId>org.duckdb</groupId>
<artifactId>duckdb_jdbc</artifactId>
<version>0.4.0</version>
</dependency>
More Options
npm install duckdb
https://github.com/
https://github.com/
https://github.com/
https://github.com
https://github.com
https://github.com
https://github.com/
https://github.com/
When to use DuckDB
- Processing and storing tabular datasets, e.g. from CSV or Parquet files
- Interactive data analysis, e.g. Joining & aggregate multiple large tables
- Concurrent large changes, to multiple large tables, e.g. appending rows, adding/removing/updating columns
- Large result set transfer to client
When to not use DuckDB
- High-volume transactional use cases (e.g. tracking orders in a webshop)
- Large client/server installations for centralized enterprise data warehousing
- Writing to a single database from multiple concurrent processes
Blog
ArchiveRange Joins in DuckDB
TL;DR: DuckDB has fully parallelised range joins that can efficiently join millions of range predicates. Range intersection joins are an important operation in areas such as temporal analytics, and occur when two inequality conditions are present in a join predicate. Database implementations often rely on slow O(N^2) algorithms that compare […]
continue readingFriendlier SQL with DuckDB
An elegant user experience is a key design goal of DuckDB. This goal guides much of DuckDB’s architecture: it is simple to install, seamless to integrate with other data structures like Pandas, Arrow, and R Dataframes, and requires no dependencies. Parallelization occurs automatically, and if a computation exceeds available memory, […]
continue readingParallel Grouped Aggregation in DuckDB
TL;DR: DuckDB has a fully parallelized aggregate hash table that can efficiently aggregate over millions of groups. Grouped aggregations are a core data analysis command. It is particularly important for large-scale data analysis (“OLAP”) because it is useful for computing statistical summaries of huge tables. DuckDB contains a highly optimized […]
continue reading