DuckDB is an in-process
SQL OLAP database management system

Why DuckDB?

Simple

  • In-process, serverless
  • C++11, no dependencies, single file build
  • APIs for Python/R/Java/…
more

Feature-rich

  • Transactions, persistence
  • Extensive SQL support
  • Direct Parquet & CSV querying
more

Fast

  • Vectorized engine
  • Optimized for analytics
  • Parallel query processing
more

Free

  • Free & Open Source
  • Permissive MIT License
more

All the benefits of a database, none of the hassle.

When to use DuckDB

  • Processing and storing tabular datasets, e.g. from CSV or Parquet files
  • Interactive data analysis, e.g. Joining & aggregate multiple large tables
  • Concurrent large changes, to multiple large tables, e.g. appending rows, adding/removing/updating columns
  • Large result set transfer to client

When to not use DuckDB

  • High-volume transactional use cases (e.g. tracking orders in a webshop)
  • Large client/server installations for centralized enterprise data warehousing
  • Writing to a single database from multiple concurrent processes

Blog

Archive
2022-05-27

Range Joins in DuckDB

TL;DR: DuckDB has fully parallelised range joins that can efficiently join millions of range predicates. Range intersection joins are an important operation in areas such as temporal analytics, and occur when two inequality conditions are present in a join predicate. Database implementations often rely on slow O(N^2) algorithms that compare […]

continue reading
2022-05-04

Friendlier SQL with DuckDB

An elegant user experience is a key design goal of DuckDB. This goal guides much of DuckDB’s architecture: it is simple to install, seamless to integrate with other data structures like Pandas, Arrow, and R Dataframes, and requires no dependencies. Parallelization occurs automatically, and if a computation exceeds available memory, […]

continue reading
2022-03-07

Parallel Grouped Aggregation in DuckDB

TL;DR: DuckDB has a fully parallelized aggregate hash table that can efficiently aggregate over millions of groups. Grouped aggregations are a core data analysis command. It is particularly important for large-scale data analysis (“OLAP”) because it is useful for computing statistical summaries of huge tables. DuckDB contains a highly optimized […]

continue reading