DuckDB with DataFusion
Version 0.9.1

DataFusion is a DataFrame and SQL library built in Rust with bindings for Python. It uses Apache Arrow’s columnar format as its memory model. DataFusion can output results as Apache Arrow, and DuckDB can read those results directly. DuckDB can also rapidly output results to Apache Arrow, which can be easily converted to a DataFusion DataFrame. Due to the interoperability of Apache Arrow, workflows can alternate between DuckDB and DataFusion with ease!

This example workflow is also available as a Google Colab notebook.


pip install --quiet duckdb datafusion pyarrow

DataFusion to DuckDB

To convert from DataFusion to DuckDB, first save DataFusion results into Arrow batches using the collect function, and then create an Arrow table using PyArrow’s Table.from_batches function. Then include that Arrow Table in the FROM clause of a DuckDB query.

As a note, Pandas is not required as a first step prior to using DataFusion, but was helpful for generating example data to reuse in the second example below.

Import the libraries, create an example Pandas DataFrame, then convert to DataFusion.

import duckdb
import pyarrow as pa
import pandas as pd
import datafusion as df
from datafusion import functions as f

pandas_df = pd.DataFrame(
        "A": [1, 2, 3, 4, 5],
        "fruits": ["banana", "banana", "apple", "apple", "banana"],
        "B": [5, 4, 3, 2, 1],
        "cars": ["beetle", "audi", "beetle", "beetle", "beetle"],

arrow_table = table = pa.Table.from_pandas(pandas_df)
arrow_batches = table.to_batches()

ctx = SessionContext()
datafusion_df = ctx.create_dataframe([arrow_batches])

Calculate a new DataFusion DataFrame and output it to a variable as an Apache Arrow table.

arrow_batches = (
datafusion_to_arrow = (

Then query the Apache Arrow table using DuckDB, and output the results as another Apache Arrow table for use in a subsequent DuckDB or DataFusion operation.

output = duckdb.query("""
    first(sum_A_by_fruits) AS sum_A
  FROM datafusion_to_arrow

DuckDB to DataFusion

DuckDB can output results as Apache Arrow tables, which can be imported into DataFusion with the DataFusion DataFrame constructor. The same approach could be used with Pandas DataFrames, but Arrow is a faster way to pass data between DuckDB and DataFusion.

This example reuses the original Pandas DataFrame created above as a starting point. As a note, Pandas is not required as a first step, but was only used to generate example data.

After the import statements and example DataFrame creation above, query the Pandas DataFrame using DuckDB and output the results as an Arrow table.

duckdb_to_arrow = duckdb.query("""
    'fruits' AS literal_string_fruits,
    SUM(B) FILTER (cars = 'beetle') OVER () AS B,
    SUM(A) FILTER (B > 2) OVER (PARTITION BY cars) AS sum_A_by_cars,
    SUM(A) OVER (PARTITION BY fruits) AS sum_A_by_fruits
  FROM df

Load the Apache Arrow table into DataFusion using the DataFusion DataFrame constructor.

datafusion_df_2 = ctx.create_dataframe([duckdb_to_arrow.to_batches()])

Complete a calculation using DataFusion, then output the results as another Apache Arrow table for use in a subsequent DuckDB or DataFusion operation.

output_2 = (

To learn more about DataFusion, feel free to explore their GitHub repository!

Search Shortcut cmd + k | ctrl + k