Jupyter Notebooks
Version 0.4.0

DuckDB in Jupyter Notebooks

DuckDB’s Python client can be used directly in Jupyter notebooks with no additional configuration if desired. However, additional libraries can be used to simplify SQL query development. This guide will describe how to utilize those additional libraries. See other guides in the Python section for how to use DuckDB and Python together.

As a small note, for maximum performance converting large output datasets to Pandas Dataframes, using DuckDB directly may be desirable. However, the difference is typically quite small.

This example workflow is also available as a Google Collab notebook.

Library Installation

Four additional libraries improve the DuckDB experience in Jupyter notebooks.

  1. Pandas
    • Clean table visualizations and compatibility with other analysis
  2. ipython-sql
    • Convert a Jupyter code cell into a SQL cell
  3. SQLAlchemy
    • Used by ipython-sql to connect to databases
  4. duckdb_engine (DuckDB SQLAlchemy driver)
    • Used by SQLAlchemy to connect to DuckDB
# Run these pip install commands from the command line if Jupyter Notebook is not yet installed.
# Otherwise, see Google Collab link above for an in-notebook example
pip install duckdb

# Install Jupyter Notebook
pip install notebook

# Install supporting libraries
pip install pandas       # conda install pandas
pip install ipython-sql 
pip install SQLAlchemy
pip install duckdb-engine

Library Import and Configuration

Next, open a Jupyter Notebook and import the relevant libraries.

import duckdb
import pandas as pd
import sqlalchemy
# No need to import duckdb_engine
#  SQLAlchemy will auto-detect the driver needed based on your connection string!

# Import ipython-sql Jupyter extension to create SQL cells
%load_ext sql

Set configrations on ipython-sql to directly output data to Pandas and to simplify the output that is printed to the notebook.

%config SqlMagic.autopandas = True
%config SqlMagic.feedback = False
%config SqlMagic.displaycon = False

Connect ipython-sql to DuckDB using a SQLAlchemy-style connection string. You may either connect to an in memory DuckDB, or a file backed db.

%sql duckdb:///:memory:
# %sql duckdb:///path/to/file.db

Querying DuckDB

Single line SQL queries can be run using %sql at the start of a line. Query results will be displayed as a Pandas DF.

%sql SELECT 'Off and flying!' as a_duckdb_column

An entire Jupyter cell can be used as a SQL cell by placing %%sql at the start of the cell. Query results will be displayed as a Pandas DF.

%%sql
SELECT
    schema_name,
    function_name
FROM duckdb_functions()
ORDER BY ALL DESC
LIMIT 5

To return query results into a Pandas dataframe for future usage, use << as an assignment operator. This can be used with both the %sql and %%sql Jupyter magics.

%sql my_df << SELECT 'Off and flying!' as a_duckdb_column

Querying Pandas Dataframes

DuckDB is able to find and query any dataframe stored as a variable in the Jupyter notebook.

input_df = pd.DataFrame.from_dict({"i":[1, 2, 3],
                                  "j":["one", "two", "three"]})

The dataframe being queried can be specified just like any other table in the FROM clause.

%sql output_df << SELECT sum(i) as total_i FROM input_df

Summary

You now have the ability to alternate between SQL and Pandas in a simple and highly performant way! Dataframes can be read as tables in SQL, and SQL results can be output into Dataframes. Happy analyzing!

Search Shortcut cmd + k | ctrl + k