Read PLINK 2 genomics file formats and run common genetic analyses directly in SQL
Installing and Loading
INSTALL plinking_duck FROM community;
LOAD plinking_duck;
Example
-- Read variant metadata from a .pvar file
SELECT CHROM, POS, ID, REF, ALT
FROM read_pvar('cohort.pvar')
WHERE CHROM = '22';
-- Read genotypes in tidy format (one row per variant x sample)
SELECT chrom, pos, iid, genotype
FROM read_pfile('cohort', orient := 'genotype')
LIMIT 10;
-- Compute allele frequencies
SELECT * FROM plink_freq('cohort.pgen')
WHERE ALT_FREQ > 0.01;
-- Run a GWAS association test
SELECT * FROM plink_glm('cohort')
WHERE p_value < 5e-8;
About plinking_duck
PlinkingDuck brings PLINK 2 genotype, variant, and sample data into DuckDB, letting you query genomics datasets with standard SQL instead of format-specific command-line tools.
File readers:
read_pvar(path)— variant metadata (.pvar/.bim)read_psam(path)— sample metadata (.psam/.fam)read_pgen(path)— binary genotypes (.pgen)read_pfile(prefix)— unified fileset reader with orient modes (variant/genotype/sample), sample subsetting, region and variant filtering
Analysis functions:
plink_freq— per-variant allele frequencies via fast genotype countingplink_hardy— Hardy-Weinberg equilibrium exact testplink_missing— per-variant or per-sample missingness ratesplink_ld— pairwise linkage disequilibrium (r², D, D')plink_score— polygenic risk scoring with mean imputationplink_glm— per-variant GWAS regression (linear, logistic, Firth)plink_pca— principal component analysis via randomized SVD
Genotype output modes:
genotypes='struct'— STRUCT with named fields per samplegenotypes='counts'— fast genotype counting (no decompression)genotypes='stats'— counts + AF, MAF, missingness, heterozygosity
Flexible inputs:
- Unified
variantsparameter: indices, rsids, CPRA strings/structs, ranges - Parquet/CSV/table companions for variant and sample metadata
af_range/ac_range/genotype_rangefilter pushdown
All functions support projection pushdown (skip genotype decompression for metadata-only queries), parallel scanning, sample subsetting, and region filtering. Legacy PLINK 1 formats (.bim/.fam) are auto-detected.
Built on pgenlib for efficient access to the compressed .pgen binary format without full decompression.
For full documentation, see plinking-duck.readthedocs.io.
Added Functions
| function_name | function_type | description | comment | examples |
|---|---|---|---|---|
| plink_freq | table | NULL | NULL | |
| plink_glm | table | NULL | NULL | |
| plink_hardy | table | NULL | NULL | |
| plink_ld | table | NULL | NULL | |
| plink_missing | table | NULL | NULL | |
| plink_pca | table | NULL | NULL | |
| plink_score | table | NULL | NULL | |
| read_pfile | table | NULL | NULL | |
| read_pgen | table | NULL | NULL | |
| read_psam | table | NULL | NULL | |
| read_pvar | table | NULL | NULL |
Overloaded Functions
This extension does not add any function overloads.
Added Types
This extension does not add any types.
Added Settings
| name | description | input_type | scope | aliases |
|---|---|---|---|---|
| plinking_max_matrix_elements | Maximum genotype matrix elements for orient := 'sample' pre-read (variants x samples). Default 16 billion (~16 GB of int8). | BIGINT | GLOBAL | [] |
| plinking_max_threads | Maximum threads for parallel scan operations. 0 = default (hardcoded cap of 16), >0 = cap at this value. | BIGINT | GLOBAL | [] |
| plinking_use_parquet_companions | Auto-discover .pvar.parquet and .psam.parquet companion files. When true, parquet companions are preferred over text formats. | BOOLEAN | GLOBAL | [] |