Jupyter Notebooks
Version 0.9.2

DuckDB’s Python client can be used directly in Jupyter notebooks with no additional configuration if desired. However, additional libraries can be used to simplify SQL query development. This guide will describe how to utilize those additional libraries. See other guides in the Python section for how to use DuckDB and Python together.

In this example, we use the JupySQL package.

This example workflow is also available as a Google Colab notebook.

Library Installation

Four additional libraries improve the DuckDB experience in Jupyter notebooks.

  1. jupysql
    • Convert a Jupyter code cell into a SQL cell
  2. Pandas
    • Clean table visualizations and compatibility with other analysis
  3. matplotlib
    • Plotting with Python
  4. duckdb-engine (DuckDB SQLAlchemy driver)
    • Used by SQLAlchemy to connect to DuckDB (optional)
# Run these pip install commands from the command line if Jupyter Notebook is not yet installed.
# Otherwise, see Google Collab link above for an in-notebook example
pip install duckdb

# Install Jupyter Notebook (Note: you can also install JupyterLab: pip install jupyterlab) 
pip install notebook

# Install supporting libraries
pip install jupysql
pip install pandas
pip install matplotlib
pip install duckdb-engine

Library Import and Configuration

Open a Jupyter Notebook and import the relevant libraries.

Connecting to DuckDB Natively

To connect to DuckDB, run:

import duckdb
import pandas as pd

%load_ext sql
conn = duckdb.connect()
%sql conn --alias duckdb

Connecting to DuckDB via SQLAlchemy Using duckdb_engine

Alternatively, you can connect to DuckDB via SQLAlchemy using duckdb_engine. See the performance and feature differences.

import duckdb
import pandas as pd
# No need to import duckdb_engine
#  jupysql will auto-detect the driver needed based on the connection string!

# Import jupysql Jupyter extension to create SQL cells
%load_ext sql

Set configurations on jupysql to directly output data to Pandas and to simplify the output that is printed to the notebook.

%config SqlMagic.autopandas = True
%config SqlMagic.feedback = False
%config SqlMagic.displaycon = False

Connect jupysql to DuckDB using a SQLAlchemy-style connection string. Either connect to a new in-memory DuckDB, the default connection or a file backed db.

%sql duckdb:///:default:
# %sql duckdb:///:memory:
# %sql duckdb:///path/to/file.db

The %sql command and duckdb.sql share the same default connection if you provide duckdb:///:default: as the SQLAlchemy connection string.

Querying DuckDB

Single line SQL queries can be run using %sql at the start of a line. Query results will be displayed as a Pandas DF.

%sql SELECT 'Off and flying!' AS a_duckdb_column

An entire Jupyter cell can be used as a SQL cell by placing %%sql at the start of the cell. Query results will be displayed as a Pandas DF.

%%sql
SELECT
    schema_name,
    function_name
FROM duckdb_functions()
ORDER BY ALL DESC
LIMIT 5

To store the query results in a Python variable, use << as an assignment operator. This can be used with both the %sql and %%sql Jupyter magics.

%sql res << SELECT 'Off and flying!' AS a_duckdb_column

If the %config SqlMagic.autopandas = True option is set, the variable is a Pandas dataframe, otherwise, it is a ResultSet that can be converted to Pandas with the DataFrame() function.

Querying Pandas Dataframes

DuckDB is able to find and query any dataframe stored as a variable in the Jupyter notebook.

input_df = pd.DataFrame.from_dict({"i":[1, 2, 3],
                                  "j":["one", "two", "three"]})

The dataframe being queried can be specified just like any other table in the FROM clause.

%sql output_df << SELECT sum(i) AS total_i FROM input_df

Visualizing DuckDB Data

The most common way to plot datasets in Python is to load them using Pandas and then use matplotlib or seaborn for plotting. This approach requires loading all data into memory which is highly inefficient. The plotting module in JupySQL runs computations in the SQL engine. This delegates memory management to the engine and ensures that intermediate computations do not keep eating up memory, efficiently plotting massive datasets.

Install and Load DuckDB httpfs extension

DuckDB’s httpfs extension allows parquet and CSV files to be queried remotely over http. These examples query a parquet file that contains historical taxi data from NYC. Using the parquet format allows DuckDB to only pull the rows and columns into memory that are needed rather than downloading the entire file. DuckDB can be used to process local parquet files as well, which may be desirable if querying the entire parquet file, or running multiple queries that require large subsets of the file.

%%sql
INSTALL httpfs;
LOAD httpfs;

Boxplot & Histogram

To create a boxplot, call %sqlplot boxplot, passing the name of the table and the column to plot. In this case, the name of the table is the URL of the remotely stored parquet file.

%sqlplot boxplot --table https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-01.parquet --column trip_distance

Boxplot of the trip_distance column

Now, create a query that filters by the 90th percentile. Note the use of the --save, and --no-execute functions. This tells JupySQL to store the query, but skips execution. It will be referenced in the next plotting call.

%%sql --save short-trips --no-execute
SELECT *
FROM "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-01.parquet"
WHERE trip_distance < 6.3

To create a histogram, call %sqlplot histogram and pass the name of the table, the column to plot, and the number of bins. This uses --with short-trips so JupySQL uses the query defined previously and therefore only plots a subset of the data.

%sqlplot histogram --table short-trips --column trip_distance --bins 10 --with short-trips

Histogram of the trip_distance column

Summary

You now have the ability to alternate between SQL and Pandas in a simple and highly performant way! You can plot massive datasets directly through the engine (avoiding both the download of the entire file and loading all of it into Pandas in memory). Dataframes can be read as tables in SQL, and SQL results can be output into Dataframes. Happy analyzing!

Search Shortcut cmd + k | ctrl + k