JupySQL Plotting with DuckDB

Guest post by Eduardo Blancas2023-02-24

TL;DR: JupySQL provides a seamless SQL experience in Jupyter and uses DuckDB to visualize larger than memory datasets in matplotlib.

Introduction

Data visualization is essential for every data practitioner since it allows us to find patterns that otherwise would be hard to see. The typical approach for plotting tabular datasets involves Pandas and Matplotlib. However, this technique quickly falls short as our data grows, given that Pandas introduces a significant memory overhead, making it challenging even to plot a medium-sized dataset.

In this blog post, we'll use JupySQL and DuckDB to efficiently plot larger-than-memory datasets in our laptops. JupySQL is a fork of ipython-sql, which adds SQL cells to Jupyter, that is being actively maintained and enhanced by the team at Ploomber.

Combining JupySQL with DuckDB enables a powerful and user friendly local SQL processing experience, especially when combined with JupySQL's new plotting capabilities. There is no need to get beefy (and expensive!) EC2 machines or configure complex distributed frameworks! Get started with JupySQL and DuckDB with our Jupyter Notebook guide, or go directly to an example collab notebook!

We want JupySQL to offer the best SQL experience in Jupyter, so if you have any feedback, please open an issue on GitHub!

The problem

One significant limitation when using pandas and matplotlib for data visualization is that we need to load all our data into memory, making it difficult to plot larger-than-memory datasets. Furthermore, given the overhead that pandas introduces, we might be unable to visualize some smaller datasets that we might think "fit" into memory.

Let's load a sample .parquet dataset using pandas to show the memory overhead:

from urllib.request import urlretrieve

_ = urlretrieve("https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-01.parquet", 
                "yellow_tripdata_2022-01.parquet")

The downloaded .parquet file takes 36MB of disk space:

ls -lh *.parquet
-rw-r--r--  1 eduardo  staff    36M Jan 18 14:45 yellow_tripdata_2022-01.parquet

Now let's load the .parquet as a data frame and see how much memory it takes:

import pandas as pd

df = pd.read_parquet("yellow_tripdata_2022-01.parquet")

df_mb = df.memory_usage().sum() / (1024 ** 2)
print(f"Data frame takes {df_mb:.0f} MB")
Data frame takes 357 MB

As you can see, we're using almost 10x as much memory as the file size. Given this overhead, we must be much more conservative about what larger-than-memory means, as "medium" files might not fit into memory once loaded. But this is just the beginning of our memory problems.

When plotting data, we often need to preprocess it before it's suitable for visualization. However, if we're not careful, these preprocessing steps will copy our data, dramatically increasing memory. Let's show a practical example.

Our sample dataset contains an observation for each NYC yellow cab trip in January 2022. Let's create a boxplot for the trip's distance:

import matplotlib.pyplot as plt

plt.boxplot(df.trip_distance)
_ = plt.title("Trip distance")

8-0

Wow! It looks like some new yorkers really like taxi rides! Let's put the taxi fans aside to improve the visualization and compute the 99th percentile to use it as the cutoff value:

cutoff = df.trip_distance.quantile(q=0.99)
cutoff
19.7

Now, we need to filter out observations larger than the cutoff value; but before we do it, let's create a utility function to capture memory usage:

import psutil

def memory_used():
    """Returns memory used in MB"""
    mem = psutil.Process().memory_full_info().uss / (1024 ** 2)
    print(f"Memory used: {mem:.0f} MB")

memory_used()
Memory used: 941 MB

Let's now filter out the observations:

df_ = df[df.trip_distance < cutoff]

Plot the histogram:

plt.boxplot(df_.trip_distance)
_ = plt.title("Trip distance (top 1% observations removed)")

16-0

We now see more reasonable numbers with the top 1% outliers removed. There are a few trips over 10 miles (perhaps some uptown new yorkers going to Brooklyn for some delicious pizza?)

How much memory are we using now?

memory_used()
Memory used: 1321 MB

380MB more! Loading a 36MB parquet file turned into >700MB in memory after loading and applying one preprocessing step!

So, in reality, when we use pandas, what fits in memory is much smaller than we think, and even with a laptop equipped with 16GB of RAM, we'll be extremely limited in terms of what size of a dataset we process. Of course, we could save a lot of memory by exclusively loading the column we plotted and deleting unneeded data copies; however, let's face it, this never happens in practice. When exploring data, we rarely know ahead of time which columns we'll need; furthermore, our time is better spent analyzing and visualizing the data than manually deleting data copies.

When facing this challenge, we might consider using a distributed framework; however, this adds so much complexity to the process, and it only partially solves the problem since we'd need to write code to compute the statistics in a distributed fashion. Alternatively, we might consider getting a larger machine, a relatively straightforward (but expensive!) approach if we can access cloud resources. However, this still requires us to move our data, set up a new environment, etc. Fortunately for us, there's DuckDB!

DuckDB: A highly scalable backend for statistical visualizations

When using functions such as hist (histogram) or boxplot, matplotlib performs two steps:

  1. Compute summary statistics
  2. Plot data

For example, boxplot calls another function called boxplot_stats that returns the statistics required to draw the plot. To create a boxplot, we need several summary statistics, such as the 25th percentile, 50th percentile, and 75th percentile. The following diagram shows a boxplot along with the labels for each part:

The bottleneck in the pandas + matplotlib approach is the boxplot_stats function since it requires a numpy.array or pandas.Series as an input, forcing us to load all our data into memory. However, we can implement a new version of boxplot_stats that pushes the data aggregation step to another analytical engine.

We chose DuckDB because it's extremely powerful and easy to use. There is no need to spin up a server or manage complex configurations: install it with pip install, point it to your data files, and that's it; you can start aggregating millions and millions of data points in no time!

You can see the full implementation here; essentially, we translated matplotlib's boxplot_stats from Python into SQL. For example, the following query will compute the three percentiles we need: 25th, 50th (median), and 75th:

%load_ext sql
%sql duckdb://
%%sql
-- We calculate the percentiles all at once and 
-- then convert from list format into separate columns
-- (Improving performance by reducing duplicate work)
WITH stats AS (
  SELECT
    percentile_disc([0.25, 0.50, 0.75]) WITHIN GROUP 
      (ORDER BY "trip_distance") AS percentiles
  FROM "yellow_tripdata_2022-01.parquet"
)
SELECT
  percentiles[1] AS q1,
  percentiles[2] AS median,
  percentiles[3] AS q3
FROM stats;
q1 median q3
1.04 1.74 3.13

Once we compute all the statistics, we call the bxp function, which draws the boxplot from the input statistics.

This process is already implemented in JupySQL, and you can create a boxplot with the %sqlplot boxplot command. Let's see how. But first, let's check how much memory we're using, so we can compare it to the pandas version:

memory_used()
Memory used: 1351 MB

Let's create the boxplot:

%sqlplot boxplot --table yellow_tripdata_2022-01.parquet --column trip_distance

26-1

Again, we see all these outliers. Let's compute the cutoff value:

%%sql
SELECT percentile_disc(0.99) WITHIN GROUP (ORDER BY trip_distance)
FROM 'yellow_tripdata_2022-01.parquet'
quantile_disc(0.99 ORDER BY trip_distance)
19.7

Let's define a query that filters out the top 1% of observations. The --save option allows us to store this SQL expression and we choose not to execute it.

%%sql --save no-outliers --no-execute
SELECT *
FROM 'yellow_tripdata_2022-01.parquet'
WHERE trip_distance < 19.7;

We can now use no-outliers in our %sqlplot boxplot command:

%sqlplot boxplot --table no-outliers --column trip_distance --with no-outliers

32-1

memory_used()
Memory used: 1375 MB

Memory usage remained pretty much the same (23MB difference, mostly due to the newly imported modules). Since we're relying on DuckDB for the data aggregation step, the SQL engine takes care of loading, aggregating, and freeing up memory as soon as we're done; this is much more efficient than loading all our data at the same time and keeping unwanted data copies!

Using DuckDB to compute histogram statistics

We can extend our recipe to other statistical visualizations, such as histograms.

A histogram allows us to visualize the distribution of a dataset, enabling us to find patterns such as modality, outliers, range of values, etc. Like with the boxplot, when using pandas + matplotlib, creating a histogram involves loading all our data at once into memory; then, matplotlib aggregates and plots it.

In our case, we'll push the aggregation to DuckDB, which will compute the bin positions (X-axis) and heights (Y-axis), then we'll pass this to maptlotlib's bar function to create the histogram.

The implementation involves two steps.

First, given the number of bins chosen by the user (N_BINS), we compute the BIN_SIZE:

%%sql
SELECT (max(trip_distance) - min(trip_distance)) / N_BINS
FROM 'yellow_tripdata_2022-01.parquet';

Then, using the BIN_SIZE, we find the number of observations that fall into each bin:

%%sql
SELECT
    floor("trip_distance" / BIN_SIZE) * BIN_SIZE,
    count(*) AS count
FROM "yellow_tripdata_2022-01.parquet"
GROUP BY 1
ORDER BY 1;

The intuition for the second query is as follows: given that we have N_BINS, the floor("trip_distance" / BIN_SIZE) portion will assign each observation to their corresponding bin (1, 2, …, N_BINS), then, we multiply by the bin size to get the value in the X axis, while the count represents the value in the Y axis. Once we have that, we call the bar plotting function.

All these steps are implemented in the %sqplot histogram command:

%sqlplot histogram --table no-outliers --column trip_distance --with no-outliers

37-1

Final thoughts

This blog post demonstrated a powerful approach for plotting large datasets powered using JupySQL and DuckDB. If you need to visualize large datasets, DuckDB offers unmatched simplicity and flexibility!

At Ploomber, we're working on building a full-fledged SQL client for Jupyter! Exciting features like automated dataset profiling, autocompletion, and more are coming! So keep an eye on updates! If there are features you think we should add to offer the best SQL experience in Jupyter, please open an issue!

JupySQL is an actively maintained fork of ipython-sql, and it keeps full compatibility with it. If you want to learn more, check out the GitHub repository and the documentation.

Try it out

To try it yourself, check out this collab notebook, or here's a snippet you can paste into Jupyter:

from urllib.request import urlretrieve

urlretrieve("https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2022-01.parquet", 
            "yellow_tripdata_2022-01.parquet")

%pip install jupysql duckdb-engine --quiet
%load_ext sql
%sql duckdb://
%sqlplot boxplot --table yellow_tripdata_2022-01.parquet --column trip_distance

Note: the commands that begin with % or %% will only work on Jupyter/IPython. If you want to try this in a regular Python session, check out the Python API.