R dplyr pipeline syntax support for DuckDB - transpiles dplyr verbs to SQL
Maintainer(s):
mrchypark
Installing and Loading
INSTALL dplyr FROM community;
LOAD dplyr;
Example
-- Transform R dplyr pipeline syntax directly in DuckDB
-- First, create a sample table
CREATE TABLE iris AS SELECT * FROM (VALUES
(5.1, 3.5, 1.4, 0.2, 'setosa'),
(4.9, 3.0, 1.4, 0.2, 'setosa'),
(7.0, 3.2, 4.7, 1.4, 'versicolor'),
(6.4, 3.2, 4.5, 1.5, 'versicolor'),
(6.3, 3.3, 6.0, 2.5, 'virginica'),
(5.8, 2.7, 5.1, 1.9, 'virginica')
) AS t(sepal_length, sepal_width, petal_length, petal_width, species);
-- Run a pipeline directly as a statement (parser extension)
iris %>%
filter(sepal_length > 5) %>%
select(species, sepal_length, petal_length) %>%
arrange(desc(sepal_length));
┌────────────┬──────────────┬──────────────┐
│ species │ sepal_length │ petal_length │
├────────────┼──────────────┼──────────────┤
│ versicolor │ 7.0 │ 4.7 │
│ versicolor │ 6.4 │ 4.5 │
│ virginica │ 6.3 │ 6.0 │
│ virginica │ 5.8 │ 5.1 │
│ setosa │ 5.1 │ 1.4 │
└────────────┴──────────────┴──────────────┘
-- Group and summarize with dplyr syntax
iris %>%
group_by(species) %>%
summarise(avg_sepal = mean(sepal_length), count = n());
┌───────────┬───────┐
│ avg_sepal │ count │
│ double │ int64 │
├───────────┼───────┤
│ 6.05 │ 2 │
│ 5.0 │ 2 │
│ 6.7 │ 2 │
└───────────┴───────┘
-- Create derived columns with mutate
iris %>%
mutate(sepal_ratio = sepal_length / sepal_width) %>%
select(species, sepal_ratio) %>%
arrange(sepal_ratio);
-- Embed a pipeline inside normal SQL using (| ... |)
SELECT species, COUNT(*) AS n
FROM (| iris %>% filter(sepal_length > 5) %>% select(species) |)
GROUP BY species
ORDER BY n DESC;
About dplyr
dplyr is a DuckDB extension that brings R's popular dplyr package syntax to DuckDB.
It transpiles dplyr pipeline syntax (using the %>% pipe operator) directly to SQL,
allowing R users to write familiar data manipulation code in DuckDB.
Supported dplyr Verbs:
select()- Choose columns by namefilter()- Filter rows based on conditionsmutate()- Create new columns or modify existing onesarrange()- Sort rows (supportsdesc()for descending order)group_by()- Group data by one or more columnssummarise()/summarize()- Aggregate grouped datarename()- Rename columns
Aggregation Functions:
n()- Count rowsmean()/avg()- Averagesum()- Summin()/max()- Minimum / Maximummedian()/mode()- DuckDB-specific aggregates- Other aggregate function names are passed through to DuckDB (uppercased).
Key Features:
- Native R dplyr syntax support with
%>%pipe operator - Transpiles to optimized SQL at runtime
- Works with any DuckDB table (pipelines must start with a table name)
- Error messages include helpful diagnostics
- Embed pipelines in normal SQL using
(| ... |)(e.g.,SELECT * FROM (| iris %>% filter(x > 0) |);)
Usage:
-- Using the dplyr() table function
SELECT * FROM dplyr('table_name %>% filter(x > 5) %>% select(a, b)');
-- Or write a pipeline directly as a statement (parser extension)
table_name %>% filter(x > 5) %>% select(a, b);
For more information, visit the GitHub repository.
Added Functions
| function_name | function_type | description | comment | examples |
|---|---|---|---|---|
| dplyr | table | NULL | NULL |