Search Shortcut cmd + k | ctrl + k
dplyr

R dplyr pipeline syntax support for DuckDB - transpiles dplyr verbs to SQL

Maintainer(s): mrchypark

Installing and Loading

INSTALL dplyr FROM community;
LOAD dplyr;

Example

-- Transform R dplyr pipeline syntax directly in DuckDB
-- First, create a sample table
CREATE TABLE iris AS SELECT * FROM (VALUES
  (5.1, 3.5, 1.4, 0.2, 'setosa'),
  (4.9, 3.0, 1.4, 0.2, 'setosa'),
  (7.0, 3.2, 4.7, 1.4, 'versicolor'),
  (6.4, 3.2, 4.5, 1.5, 'versicolor'),
  (6.3, 3.3, 6.0, 2.5, 'virginica'),
  (5.8, 2.7, 5.1, 1.9, 'virginica')
) AS t(sepal_length, sepal_width, petal_length, petal_width, species);

-- Run a pipeline directly as a statement (parser extension)
iris %>%
  filter(sepal_length > 5) %>%
  select(species, sepal_length, petal_length) %>%
  arrange(desc(sepal_length));
┌────────────┬──────────────┬──────────────┐
  species    sepal_length  petal_length 
├────────────┼──────────────┼──────────────┤
 versicolor           7.0           4.7 
 versicolor           6.4           4.5 
 virginica            6.3           6.0 
 virginica            5.8           5.1 
 setosa               5.1           1.4 
└────────────┴──────────────┴──────────────┘

-- Group and summarize with dplyr syntax
iris %>%
  group_by(species) %>%
  summarise(avg_sepal = mean(sepal_length), count = n());
  ┌───────────┬───────┐
   avg_sepal  count 
    double    int64 
  ├───────────┼───────┤
        6.05      2 
         5.0      2 
         6.7      2 
  └───────────┴───────┘

-- Create derived columns with mutate
iris %>%
  mutate(sepal_ratio = sepal_length / sepal_width) %>%
  select(species, sepal_ratio) %>%
  arrange(sepal_ratio);

-- Embed a pipeline inside normal SQL using (| ... |)
SELECT species, COUNT(*) AS n
FROM (| iris %>% filter(sepal_length > 5) %>% select(species) |)
GROUP BY species
ORDER BY n DESC;

About dplyr

dplyr is a DuckDB extension that brings R's popular dplyr package syntax to DuckDB. It transpiles dplyr pipeline syntax (using the %>% pipe operator) directly to SQL, allowing R users to write familiar data manipulation code in DuckDB.

Supported dplyr Verbs:

  • select() - Choose columns by name
  • filter() - Filter rows based on conditions
  • mutate() - Create new columns or modify existing ones
  • arrange() - Sort rows (supports desc() for descending order)
  • group_by() - Group data by one or more columns
  • summarise() / summarize() - Aggregate grouped data
  • rename() - Rename columns

Aggregation Functions:

  • n() - Count rows
  • mean() / avg() - Average
  • sum() - Sum
  • min() / max() - Minimum / Maximum
  • median() / mode() - DuckDB-specific aggregates
  • Other aggregate function names are passed through to DuckDB (uppercased).

Key Features:

  • Native R dplyr syntax support with %>% pipe operator
  • Transpiles to optimized SQL at runtime
  • Works with any DuckDB table (pipelines must start with a table name)
  • Error messages include helpful diagnostics
  • Embed pipelines in normal SQL using (| ... |) (e.g., SELECT * FROM (| iris %>% filter(x > 0) |);)

Usage:

-- Using the dplyr() table function
SELECT * FROM dplyr('table_name %>% filter(x > 5) %>% select(a, b)');

-- Or write a pipeline directly as a statement (parser extension)
table_name %>% filter(x > 5) %>% select(a, b);

For more information, visit the GitHub repository.

Added Functions

function_name function_type description comment examples
dplyr table NULL NULL