Read HTS (VCF/BCF/BAM/CRAM/FASTA/FASTQ/GTF/GFF) files in DuckDB via htslib
Installing and Loading
INSTALL duckhts FROM community;
LOAD duckhts;
Example
-- Load the extension
LOAD duckhts;
-- Read a VCF/BCF file (tidy FORMAT columns)
SELECT CHROM, POS, REF, ALT, FORMAT_GT
FROM read_bcf('test/data/formatcols.vcf.gz', tidy_format := true)
LIMIT 5;
-- Read a BAM/SAM file
SELECT QNAME, RNAME, POS, READ_GROUP_ID, SAMPLE_ID
FROM read_bam('test/data/rg.sam.gz')
LIMIT 5;
About duckhts
DuckHTS provides table functions for common high-throughput sequencing (HTS) formats using htslib. Query VCF/BCF/BAM/CRAM/FASTA/FASTQ/GTF/GFF and tabix-indexed files directly in SQL.
Functions include:
- read_bcf(path, [region, tidy_format])
- read_bam(path, [region, reference, standard_tags, auxiliary_tags])
- read_fasta(path)
- read_fastq(path, [mate_path, interleaved])
- read_gff(path, [region, attributes_map])
- read_gtf(path, [region, attributes_map])
- read_tabix(path, [region, header, header_names, auto_detect, column_types])
Paired FASTQ is supported via mate_path or interleaved := true. CRAM is supported with an explicit reference file. For GTF/GFF, attributes can be returned as a parsed MAP using attributes_map := true. Optional SAMtags columns and an auxiliary tag map are available via standard_tags and auxiliary_tags. Tabix files can use header/header_names and type inference via auto_detect or explicit column_types.
MSVC builds (windows_amd64/windows_arm64) are not supported. MinGW/RTools is supported on Windows.
Added Functions
| function_name | function_type | description | comment | examples |
|---|---|---|---|---|
| read_bam | table | NULL | NULL | |
| read_bcf | table | NULL | NULL | |
| read_fasta | table | NULL | NULL | |
| read_fastq | table | NULL | NULL | |
| read_gff | table | NULL | NULL | |
| read_gtf | table | NULL | NULL | |
| read_tabix | table | NULL | NULL |