2022-11-25Pedro Holanda

DuckCon 2023 - 2nd edition

The DuckDB team is excited to invite you all for our second DuckCon user group meeting. It will take place the day before FOSDEM in Brussels on Feb 3rd, 2023, at the Hilton Hotel.

In this edition, we will have the DuckDB creators Hannes Mühleisen, and Mark Raasveldt, talking about the current state of DuckDB and future plans. We will also have one talk regarding building out-of-tree extensions from DuckDB contributors Pedro Holanda, and Sam Ansmink and one invited talk from MotherDuck’s Founding Engineers, Boaz Leskes and Yves Le Maout. In addition, we will have a session with lightning talks from our users. We will close the event with a networking session with food and drinks where users are invited to interact directly with the DuckDB team.

continue reading
2022-11-14Mark Raasveldt

Announcing DuckDB 0.6.0

Image of white-headed duck

The DuckDB team is happy to announce the latest DuckDB version (0.6.0) has been released. This release of DuckDB is named “Oxyura” after the White-headed duck (Oxyura leucocephala) which is an endangered species native to Eurasia.

To install the new version, please visit the installation guide. Note that the release is still being rolled out, so not all artifacts may be published yet. The full release notes can be found here.

continue reading
2022-10-28Mark Raasveldt

Lightweight Compression in DuckDB

Matroshka Ducks (ducks going from big to small)

TLDR: DuckDB supports efficient lightweight compression that is automatically used to keep data size down without incurring high costs for compression and decompression.

When working with large amounts of data, compression is critical for reducing storage size and egress costs. Compression algorithms typically reduce data set size by 75-95%, depending on how compressible the data is. Compression not only reduces the storage footprint of a data set, but also often improves performance as less data has to be read from disk or over a network connection.

continue reading
2022-10-12Guest post by Jacob Matson

Modern Data Stack in a Box with DuckDB

Duck on a box

TLDR: A fast, free, and open-source Modern Data Stack (MDS) can now be fully deployed on your laptop or to a single machine using the combination of DuckDB, Meltano, dbt, and Apache Superset.

This post is a collaboration with Jacob Matson and cross-posted on dataduel.co.

Summary

There is a large volume of literature (1, 2, 3) about scaling data pipelines. “Use Kafka! Build a lake house! Don’t build a lake house, use Snowflake! Don’t use Snowflake, use XYZ!” However, with advances in hardware and the rapid maturation of data software, there is a simpler approach. This article will light up the path to highly performant single node analytics with an MDS-in-a-box open source stack: Meltano, DuckDB, dbt, & Apache Superset on Windows using Windows Subsystem for Linux (WSL). There are many options within the MDS, so if you are using another stack to build an MDS-in-a-box, please share it with the community on the DuckDB Twitter, GitHub, or Discord, or the dbt slack! Or just stop by for a friendly debate about our choice of tools!

continue reading
2022-09-30Hannes Mühleisen

Querying Postgres Tables Directly From DuckDB

TLDR: DuckDB can now directly query queries stored in PostgreSQL and speed up complex analytical queries without duplicating data.

continue reading
2022-07-27Pedro Holanda

Persistent Storage of Adaptive Radix Trees in DuckDB

DuckDB ART

TLDR: DuckDB uses Adaptive Radix Tree (ART) Indexes to enforce constraints and to speed up query filters. Up to this point, indexes were not persisted, causing issues like loss of indexing information and high reload times for tables with data constraints. We now persist ART Indexes to disk, drastically diminishing database loading times (up to orders of magnitude), and we no longer lose track of existing indexes. This blog post contains a deep dive into the implementation of ART storage, benchmarks, and future work. Finally, to better understand how our indexes are used, I’m asking you to answer the following survey. It will guide us when defining our future roadmap.

continue reading
2022-05-27Richard Wesley

Range Joins in DuckDB

TL;DR: DuckDB has fully parallelised range joins that can efficiently join millions of range predicates.

Range intersection joins are an important operation in areas such as temporal analytics, and occur when two inequality conditions are present in a join predicate. Database implementations often rely on slow O(N^2) algorithms that compare every pair of rows for these operations. Instead, DuckDB leverages its fast sorting logic to implement two highly optimized parallel join operators for these kinds of range predicates, resulting in 20-30x faster queries. With these operators, DuckDB can be used effectively in more time-series-oriented use cases.

continue reading
2022-05-04Alex Monahan

Friendlier SQL with DuckDB

Chewbacca_the_duck

An elegant user experience is a key design goal of DuckDB. This goal guides much of DuckDB’s architecture: it is simple to install, seamless to integrate with other data structures like Pandas, Arrow, and R Dataframes, and requires no dependencies. Parallelization occurs automatically, and if a computation exceeds available memory, data is gracefully buffered out to disk. And of course, DuckDB’s processing speed makes it easier to get more work accomplished.

However, SQL is not famous for being user-friendly. DuckDB aims to change that! DuckDB includes both a Relational API for dataframe-style computation, and a highly Postgres-compatible version of SQL. If you prefer dataframe-style computation, we would love your feedback on our roadmap. If you are a SQL fan, read on to see how DuckDB is bringing together both innovation and pragmatism to make it easier to write SQL in DuckDB than anywhere else. Please reach out on GitHub or Discord and let us know what other features would simplify your SQL workflows. Join us as we teach an old dog new tricks!

continue reading
2022-03-07Hannes Mühleisen and Mark Raasveldt

Parallel Grouped Aggregation in DuckDB

TL;DR: DuckDB has a fully parallelized aggregate hash table that can efficiently aggregate over millions of groups.

Grouped aggregations are a core data analysis command. It is particularly important for large-scale data analysis (“OLAP”) because it is useful for computing statistical summaries of huge tables. DuckDB contains a highly optimized parallel aggregation capability for fast and scalable summarization.

Jump straight to the benchmarks?

continue reading
2022-01-06Richard Wesley

DuckDB Time Zones: Supporting Calendar Extensions

TLDR: The DuckDB ICU extension now provides time zone support.

Time zone support is a common request for temporal analytics, but the rules are complex and somewhat arbitrary. The most well supported library for locale-specific operations is the International Components for Unicode (ICU). DuckDB already provided collated string comparisons using ICU via an extension (to avoid dependencies), and we have now connected the existing ICU calendar and time zone functions to the main code via the new TIMESTAMP WITH TIME ZONE (or TIMESTAMPTZ for short) data type. The ICU extension is pre-bundled in DuckDB’s Python and R clients and can be optionally installed in the remaining clients.

In this post, we will describe how time works in DuckDB and what time zone functionality has been added.

continue reading
2021-12-03Pedro Holanda and Jonathan Keane

DuckDB quacks Arrow: A zero-copy data integration between Apache Arrow and DuckDB

TLDR: The zero-copy integration between DuckDB and Apache Arrow allows for rapid analysis of larger than memory datasets in Python and R using either SQL or relational APIs.

This post is a collaboration with and cross-posted on the Arrow blog.

continue reading
2021-11-26Pedro Holanda

DuckDB - The Lord of Enums:
The Fellowship of the Categorical and Factors.

dict-enc

String types are one of the most commonly used types. However, often string columns have a limited number of distinct values. For example, a country column will never have more than a few hundred unique entries. Storing a data type as a plain string causes a waste of storage and compromises query performance. A better solution is to dictionary encode these columns. In dictionary encoding, the data is split into two parts: the category and the values. The category stores the actual strings, and the values stores a reference to the strings. This encoding is depicted below.

continue reading
2021-11-12Richard Wesley

Fast Moving Holistic Aggregates

TLDR: DuckDB, a free and Open-Source analytical data management system, has a windowing API that can compute complex moving aggregates like interquartile ranges and median absolute deviation much faster than the conventional approaches.

In a previous post, we described the DuckDB windowing architecture and mentioned the support for some advanced moving aggregates. In this post, we will compare the performance various possible moving implementations of these functions and explain how DuckDB’s performant implementations work.

continue reading
2021-10-29André Kohn and Dominik Moritz

DuckDB-Wasm: Efficient Analytical SQL in the Browser

TLDR: DuckDB-Wasm is an in-process analytical SQL database for the browser. It is powered by WebAssembly, speaks Arrow fluently, reads Parquet, CSV and JSON files backed by Filesystem APIs or HTTP requests and has been tested with Chrome, Firefox, Safari and Node.js. You can try it in your browser at shell.duckdb.org or on Observable.

continue reading
2021-10-13Richard Wesley

Windowing in DuckDB

TLDR: DuckDB, a free and Open-Source analytical data management system, has a state-of-the-art windowing engine that can compute complex moving aggregates like inter-quartile ranges as well as simpler moving averages.

Window functions (those using the OVER clause) are important tools for analysing data series, but they can be slow if not implemented carefully. In this post, we will take a look at how DuckDB implements windowing. We will also see how DuckDB can leverage its aggregate function architecture to compute useful moving aggregates such as moving inter-quartile ranges (IQRs).

continue reading
2021-08-27Laurens Kuiper

Fastest table sort in the West - Redesigning DuckDB’s sort

TLDR: DuckDB, a free and Open-Source analytical data management system, has a new highly efficient parallel sorting implementation that can sort much more data than fits in main memory.

Database systems use sorting for many purposes, the most obvious purpose being when a user adds an ORDER BY clause to their query. Sorting is also used within operators, such as window functions. DuckDB recently improved its sorting implementation, which is now able to sort data in parallel and sort more data than fits in memory. In this post, we will take a look at how DuckDB sorts, and how this compares to other data management systems.

continue reading
2021-06-25Hannes Mühleisen and Mark Raasveldt

Querying Parquet with Precision using DuckDB

TLDR: DuckDB, a free and open source analytical data management system, can run SQL queries directly on Parquet files and automatically take advantage of the advanced features of the Parquet format.

Apache Parquet is the most common “Big Data” storage format for analytics. In Parquet files, data is stored in a columnar-compressed binary format. Each Parquet file stores a single table. The table is partitioned into row groups, which each contain a subset of the rows of the table. Within a row group, the table data is stored in a columnar fashion.

continue reading
2021-05-14Mark Raasveldt and Hannes Mühleisen

Efficient SQL on Pandas with DuckDB

TLDR: DuckDB, a free and open source analytical data management system, can efficiently run SQL queries directly on Pandas DataFrames.

Recently, an article was published advocating for using SQL for Data Analysis. Here at team DuckDB, we are huge fans of SQL. It is a versatile and flexible language that allows the user to efficiently perform a wide variety of data transformations, without having to care about how the data is physically represented or how to do these data transformations in the most optimal way.

continue reading
2021-01-25Laurens Kuiper

Testing out DuckDB's Full Text Search Extension

TLDR: DuckDB now has full-text search functionality, similar to the FTS5 extension in SQLite. The main difference is that our FTS extension is fully formulated in SQL. We tested it out on TREC disks 4 and 5.

Searching through textual data stored in a database can be cumbersome, as SQL does not provide a good way of formulating questions such as “Give me all the documents about Mallard Ducks”: string patterns with LIKE will only get you so far. Despite SQL’s shortcomings here, storing textual data in a database is commonplace. Consider the table products (id INT, name VARCHAR, description VARCHAR) - it would be useful to search through the name and description columns for a website that sells these products.

continue reading