Spark API

Search Shortcut cmd + k | ctrl + k

Search cmd+k ctrl+k

Installation
Documentation

Overview
Connect

Data Import

Client APIs

Overview
C

Python

ADBC
ODBC

Configuration

Extensions

Guides

Development

Internals

Documentation / Client APIs / Python

0.10 (stable)

Spark API

The DuckDB Spark API implements the PySpark API, allowing you to use the familiar Spark API to interact with DuckDB. All statements are translated to DuckDB’s internal plans using our relational API and executed using DuckDB’s query engine.

Warning The DuckDB Spark API is currently experimental and features are still missing. We are very interested in feedback. Please report any functionality that you are missing, either through Discord or on GitHub.

Example

from duckdb.experimental.spark.sql import SparkSession as session
from duckdb.experimental.spark.sql.functions import lit, col
import pandas as pd

spark = session.builder.getOrCreate()

pandas_df = pd.DataFrame({
    'age': [34, 45, 23, 56],
    'name': ['Joan', 'Peter', 'John', 'Bob']
})

df = spark.createDataFrame(pandas_df)
df = df.withColumn(
    'location', lit('Seattle')
)
res = df.select(
    col('age'),
    col('location')
).collect()

print(res)

[
    Row(age=34, location='Seattle'),
    Row(age=45, location='Seattle'),
    Row(age=23, location='Seattle'),
    Row(age=56, location='Seattle')
]

Contribution Guidelines

Contributions to the experimental Spark API are welcome. When making a contribution, please follow these guidelines:

Instead of using temporary files, use our pytest testing framework.
When adding new functions, ensure that method signatures comply with those in the PySpark API.

About this page

Last modified: 2024-05-02

Example

Contribution Guidelines

About this page

In this article