DuckDB with Vaex
Version current
Version:

Vaex is a high performance DataFrame library in Python. Vaex is a hybrid DataFrame, as it supports both Numpy’s and Apache Arrow’s data structures. Vaex DataFrames can export data as Apache Arrow Table, which can be directly used by DuckDB. Since DuckDB can output results as an Apache Arrow Table which can be easily turned into a Vaex DataFrame, one can easily alternate between DuckDB and Vaex.

The following example shows how one can use both DuckDB and Vaex DataFrame for a simple exploratory work.

Installation

pip install duckdb
pip install vaex

Vaex DataFrame to DuckDB

A Vaex DataFrame can be exported as an Arrow Table via the to_arrow_table() method. This operation does not take extra memory if the data being exported is already in memory or memory-mapped. The exported Arrow Table can be queried directly via DuckDB.

Let’s use the well known Titanic dataset that also ships with Vaex, to do some operations like filling missing values and creating new columns. Then we will export the DataFrame to an Arrow Table:

import duckdb
import vaex

df = vaex.datasets.titanic()

df['age'] = df.age.fillna(df.age.mean())
df['fare'] = df.age.fillna(df.fare.mean())
df['family_size'] = (df.sibsp + df.parch + 1)
df['fare_per_family_member'] = df.fare / df.family_size
df['name_title'] = df['name'].str.replace('.* ([A-Z][a-z]+)\..*', "\\1", regex=True)

arrow_table  = df.to_arrow_table()

Now we can directly query the Arrow Table using DuckDB, the output of which can be another Arrow Table, which can be used for subsequent DuckDB queries, or it can be converted to a Vaex DataFrame:

query_result_arrow_table = duckdb.query('''

    SELECT

        pclass,
        MEAN(age) as age,
        MEAN(family_size) as family_size,
        MEAN(fare_per_family_member) as fare_per_family_member,
        COUNT(DISTINCT(name_title)) as distinct_titles,
        LIST(DISTINCT(name_title))

    FROM arrow_table

    GROUP BY pclass
    ORDER BY pclass
''').arrow()

DuckDB to Vaex DataFrame

The output of a DuckDB query can be an Arrow Table, which can be easily converted to a Vaex DataFrame via the vaex.from_arrow_table() method. One can also pass data around via Pandas DataFrames, but Arrow is faster.

We can use the query result from above and convert it to a vaex DataFrame:

df_from_duckdb = vaex.from_arrow_table(query_result_arrow_table)

One can then continue to use Vaex, and also export the data or part of it to an Arrow Table to be used with DuckDB as needed.

To learn more about Vaex, feel free to explore their Documentation.

Search Shortcut cmd + k | ctrl + k