TLDR: The DuckDB ICU extension now provides time zone support.
Time zone support is a common request for temporal analytics, but the rules are complex and somewhat arbitrary.
The most well supported library for locale-specific operations is the International Components for Unicode (ICU).
DuckDB already provided collated string comparisons using ICU via an extension (to avoid dependencies),
and we have now connected the existing ICU calendar and time zone functions to the main code
via the new
TIMESTAMP WITH TIME ZONE (or
TIMESTAMPTZ for short) data type. The ICU extension is pre-bundled in DuckDB’s Python and R clients and can be optionally installed in the remaining clients.
In this post, we will describe how time works in DuckDB and what time zone functionality has been added.continue reading
TLDR: The zero-copy integration between DuckDB and Apache Arrow allows for rapid analysis of larger than memory datasets in Python and R using either SQL or relational APIs.
This post is a collaboration with and cross-posted on the Arrow blog.continue reading
String types are one of the most commonly used types. However, often string columns have a limited number of distinct values. For example, a country column will never have more than a few hundred unique entries. Storing a data type as a plain string causes a waste of storage and compromises query performance. A better solution is to dictionary encode these columns. In dictionary encoding, the data is split into two parts: the category and the values. The category stores the actual strings, and the values stores a reference to the strings. This encoding is depicted below.continue reading
TLDR: DuckDB, a free and Open-Source analytical data management system, has a windowing API that can compute complex moving aggregates like interquartile ranges and median absolute deviation much faster than the conventional approaches.
In a previous post, we described the DuckDB windowing architecture and mentioned the support for some advanced moving aggregates. In this post, we will compare the performance various possible moving implementations of these functions and explain how DuckDB’s performant implementations work.continue reading
TLDR: DuckDB-Wasm is an in-process analytical SQL database for the browser. It is powered by WebAssembly, speaks Arrow fluently, reads Parquet, CSV and JSON files backed by Filesystem APIs or HTTP requests and has been tested with Chrome, Firefox, Safari and Node.js. You can try it in your browser at shell.duckdb.org or on Observable.continue reading
TLDR: DuckDB, a free and Open-Source analytical data management system, has a state-of-the-art windowing engine that can compute complex moving aggregates like inter-quartile ranges as well as simpler moving averages.
Window functions (those using the
OVER clause) are important tools for analysing data series,
but they can be slow if not implemented carefully.
In this post, we will take a look at how DuckDB implements windowing.
We will also see how DuckDB can leverage its aggregate function architecture
to compute useful moving aggregates such as moving inter-quartile ranges (IQRs).
TLDR: DuckDB, a free and Open-Source analytical data management system, has a new highly efficient parallel sorting implementation that can sort much more data than fits in main memory.
Database systems use sorting for many purposes, the most obvious purpose being when a user adds an
ORDER BY clause to their query.
Sorting is also used within operators, such as window functions.
DuckDB recently improved its sorting implementation, which is now able to sort data in parallel and sort more data than fits in memory.
In this post, we will take a look at how DuckDB sorts, and how this compares to other data management systems.
TLDR: DuckDB, a free and open source analytical data management system, can run SQL queries directly on Parquet files and automatically take advantage of the advanced features of the Parquet format.
Apache Parquet is the most common “Big Data” storage format for analytics. In Parquet files, data is stored in a columnar-compressed binary format. Each Parquet file stores a single table. The table is partitioned into row groups, which each contain a subset of the rows of the table. Within a row group, the table data is stored in a columnar fashion.continue reading
TLDR: DuckDB, a free and open source analytical data management system, can efficiently run SQL queries directly on Pandas DataFrames.
Recently, an article was published advocating for using SQL for Data Analysis. Here at team DuckDB, we are huge fans of SQL. It is a versatile and flexible language that allows the user to efficiently perform a wide variety of data transformations, without having to care about how the data is physically represented or how to do these data transformations in the most optimal way.continue reading
TLDR: DuckDB now has full-text search functionality, similar to the FTS5 extension in SQLite. The main difference is that our FTS extension is fully formulated in SQL. We tested it out on TREC disks 4 and 5.
Searching through textual data stored in a database can be cumbersome, as SQL does not provide a good way of formulating questions such as “Give me all the documents about Mallard Ducks”: string patterns with
LIKE will only get you so far. Despite SQL’s shortcomings here, storing textual data in a database is commonplace. Consider the table
products (id INT, name VARCHAR, description VARCHAR) - it would be useful to search through the
description columns for a website that sells these products.