DuckDB with Polars
Version current
Version:

Polars is a DataFrames library built in Rust with bindings for Python and Node.js. It uses Apache Arrow’s columnar format as its memory model. Polars can output results as Apache Arrow (which is often a zero-copy operation), and DuckDB can read those results directly. DuckDB can also rapidly output results to Apache Arrow, which can be easily converted to a Polars DataFrame. Due to the interoperability of Apache Arrow, workflows can alternate between DuckDB and Polars with ease!

This example workflow is also available as a Google Collab notebook.

Installation

pip install duckdb
pip install -U 'polars[pyarrow]'

Polars to DuckDB

To convert from Polars to DuckDB, first save Polars results into an Arrow table using the to_arrow function. Then include that Arrow Table in the FROM clause of a DuckDB query. This example was based on the Polars Readme.

As a note, Pandas is not required as a first step prior to using Polars, but was helpful for generating example data to reuse in the second example below.

Import the libraries, create an example Pandas DataFrame, then convert to Polars.

import duckdb
import polars as pl
import pandas as pd

df = pd.DataFrame(
    {
        "A": [1, 2, 3, 4, 5],
        "fruits": ["banana", "banana", "apple", "apple", "banana"],
        "B": [5, 4, 3, 2, 1],
        "cars": ["beetle", "audi", "beetle", "beetle", "beetle"],
    }
)

polars_df = pl.DataFrame(df)

Calculate a new Polars DataFrame and output it to a variable as an Apache Arrow table.

polars_to_arrow = (
    polars_df
    .sort("fruits")
    .select(
        [
            "fruits",
            "cars",
            pl.lit("fruits").alias("literal_string_fruits"),
            pl.col("B").filter(pl.col("cars") == "beetle").sum(),
            pl.col("A").filter(pl.col("B") > 2).sum().over("cars").alias("sum_A_by_cars"),     # groups by "cars"
            pl.col("A").sum().over("fruits").alias("sum_A_by_fruits"),                         # groups by "fruits"
            pl.col("A").reverse().over("fruits").alias("rev_A_by_fruits"),                     # groups by "fruits
            pl.col("A").sort_by("B").over("fruits").alias("sort_A_by_B_by_fruits"),            # groups by "fruits"
        ]
    )
    .to_arrow()
)

Then query the Apache Arrow table using DuckDB, and output the results as another Apache Arrow table for use in a subsequent DuckDB or Polars operation.

output = duckdb.query("""
  SELECT 
    fruits,
    first(sum_A_by_fruits) as sum_A
  FROM polars_to_arrow
  GROUP BY ALL
  ORDER BY ALL
""").arrow()

DuckDB to Polars

DuckDB can output results as Apache Arrow tables, which can be imported into Polars with the Polars DataFrame constructor. The same approach could be used with Pandas DataFrames, but Arrow is a faster way to pass data between DuckDB and Polars.

This example reuses the original Pandas DataFrame created above as a starting point. As a note, Pandas is not required as a first step, but was only used to generate example data.

After the import statements and example DataFrame creation above, query the Pandas DataFrame using DuckDB and output the results as an Arrow table.

duckdb_to_arrow = duckdb.query("""
  SELECT
    fruits,
    cars,
    'fruits' as literal_string_fruits,
    SUM(B) FILTER (cars = 'beetle') OVER () as B,
    SUM(A) FILTER (B > 2) OVER (PARTITION BY cars) as sum_A_by_cars,
    SUM(A) OVER (PARTITION BY fruits) as sum_A_by_fruits
  FROM df
  ORDER BY
    fruits,
    df.B
""").arrow()

Load the Apache Arrow table into Polars using the Polars DataFrame constructor.

polars_df_2 = pl.DataFrame(duckdb_to_arrow)

Complete a calculation using Polars, then output the results as another Apache Arrow table for use in a subsequent DuckDB or Polars operation.

output = (
    polars_df_2
    .groupby('fruits')
    .agg(
        pl.col('sum_A_by_fruits')
        .first()
        .sort_by('fruits')
        )
).to_arrow()

To learn more about Polars, feel free to explore their Python API Reference!

Search Shortcut cmd + k | ctrl + k