Search Shortcut cmd + k | ctrl + k
duckhts

DuckDB extension for reading HTS file formats via htslib

Maintainer(s): sounkou-bioinfo

Installing and Loading

INSTALL duckhts FROM community;
LOAD duckhts;

Example

-- Load the extension
LOAD duckhts;

-- Read a VCF/BCF file (tidy FORMAT columns)
SELECT CHROM, POS, REF, ALT, SAMPLE_ID
FROM read_bcf('test/data/formatcols.vcf.gz', tidy_format := true)
LIMIT 5;

-- Read a BAM/SAM file
SELECT QNAME, RNAME, POS, READ_GROUP_ID, SAMPLE_ID
FROM read_bam('test/data/rg.sam.gz')
LIMIT 5;

About duckhts

DuckHTS provides DuckDB table functions for common high-throughput sequencing (HTS) formats using htslib. Query VCF, BCF, BAM, CRAM, FASTA, FASTQ, GTF, GFF, and tabix-indexed files directly in SQL.

The extension also includes sequence utility UDFs, SAM flag predicate helpers, and HTS metadata helpers.

Functions included in this extension:

Readers

  • read_bcf(path, region := NULL, index_path := NULL, tidy_format := FALSE, additional_csq_column_types := NULL): Read VCF and BCF variant data with typed INFO, FORMAT, typed CSQ/ANN/BCSQ subfields, optional tidy sample output, and optional bcftools-style CSQ type overrides.
  • read_bam(path, standard_tags := FALSE, auxiliary_tags := FALSE, region := NULL, index_path := NULL, reference := NULL, sequence_encoding := 'string', quality_representation := 'string'): Read SAM, BAM, and CRAM alignments with optional typed SAMtags and auxiliary tag maps. Use sequence_encoding := 'nt16' to return SEQ as UTINYINT[] and quality_representation := 'phred' to return QUAL as UTINYINT[] instead of VARCHAR.
  • read_fasta(path, region := NULL, index_path := NULL, sequence_encoding := 'string'): Read FASTA records or indexed FASTA regions as sequence rows. Use sequence_encoding := 'nt16' to return SEQUENCE as UTINYINT[] (htslib nt16 4-bit codes) instead of VARCHAR.
  • read_bed(path, region := NULL, index_path := NULL): Read BED3-BED12 interval files with canonical typed columns and optional tabix-backed region filtering.
  • fasta_nuc(path, bed_path := NULL, bin_width := NULL, region := NULL, index_path := NULL, bed_index_path := NULL, include_seq := FALSE): Compute bedtools nuc-style nucleotide composition for supplied BED intervals or generated fixed-width bins over a FASTA reference.
  • read_fastq(path, interleaved := FALSE, mate_path := NULL, sequence_encoding := 'string', quality_representation := 'string', input_quality_encoding := 'phred33'): Read single-end, paired-end, or interleaved FASTQ files with optional legacy quality decoding. By default, FASTQ qualities are interpreted as modern Phred+33 input. Use sequence_encoding := 'nt16' to return SEQUENCE as UTINYINT[] and quality_representation := 'phred' to return QUALITY as UTINYINT[] instead of VARCHAR. input_quality_encoding accepts 'phred33', 'auto', 'phred64', or 'solexa64'.
  • read_gff(path, header_names := NULL, header := FALSE, column_types := NULL, auto_detect := FALSE, attributes_map := FALSE, region := NULL, index_path := NULL): Read GFF annotations with optional parsed attribute maps and indexed region filtering.
  • read_gtf(path, header_names := NULL, header := FALSE, column_types := NULL, auto_detect := FALSE, attributes_map := FALSE, region := NULL, index_path := NULL): Read GTF annotations with optional parsed attribute maps and indexed region filtering.
  • read_tabix(path, header_names := NULL, header := FALSE, column_types := NULL, auto_detect := FALSE, region := NULL, index_path := NULL): Read generic tabix-indexed text data with optional header handling and type inference.
  • fasta_index(path, index_path := NULL): Build a FASTA index (.fai) and return a single row with columns success (BOOLEAN) and index_path (VARCHAR).

Metadata

  • detect_quality_encoding(path, max_records := 10000): Inspect a FASTQ file's observed quality ASCII range and report compatible legacy encodings with a heuristic guessed encoding.
  • read_hts_header(path, format := NULL, mode := NULL): Inspect HTS headers in parsed, raw, or combined form across supported formats.
  • read_hts_index(path, format := NULL, index_path := NULL): Inspect high-level HTS index metadata such as sequence names and mapped counts.
  • read_hts_index_spans(path, format := NULL, index_path := NULL): Expand index metadata into span and chunk rows suitable for low-level index inspection.
  • read_hts_index_raw(path, format := NULL, index_path := NULL): Return the raw on-disk HTS index blob together with basic identifying metadata.

Compression

  • bgzip(path, output_path := NULL, threads := 4, level := -1, keep := TRUE, overwrite := FALSE): Compress a plain file to BGZF and return the created output path and byte counts.
  • bgunzip(path, output_path := NULL, threads := 4, keep := TRUE, overwrite := FALSE): Decompress a BGZF-compressed file and return the created output path and byte counts.

Indexing

  • bam_index(path, index_path := NULL, min_shift := 0, threads := 4): Build a BAM or CRAM index and report the written index path and format.
  • bcf_index(path, index_path := NULL, min_shift := NULL, threads := 4): Build a TBI or CSI index for a VCF or BCF file and report the written index path and format.
  • tabix_index(path, preset := 'vcf', index_path := NULL, min_shift := 0, threads := 4, seq_col := NULL, start_col := NULL, end_col := NULL, comment_char := NULL, skip_lines := NULL): Build a tabix index for a BGZF-compressed text file using a preset or explicit coordinate columns.

Variants

  • bcftools_liftover(chrom, pos, ref, alt, chain_path, dst_fasta_ref, src_fasta_ref, max_snp_gap, max_indel_inc, lift_mt, end_pos, no_left_align): Row-oriented liftover kernel intended to mirror bcftools +liftover semantics as closely as possible while returning one STRUCT per input row with fields: src_chrom, src_pos, src_ref, src_alt, dest_chrom, dest_pos, dest_end, dest_ref, dest_alt, mapped, reverse_complemented, swap, reject_reason, and note. Set no_left_align := true to skip post-liftover left-alignment of lifted indels (mirrors –no-left-align in bcftools +liftover).
  • duckdb_liftover(table_name, chrom_col, pos_col, ref_col := NULL, alt_col := NULL, chain_path := NULL, dst_fasta_ref := NULL, src_fasta_ref := NULL, max_snp_gap := 1, max_indel_inc := 250, lift_mt := false, end_pos_col := NULL, no_left_align := false): DuckDB-specific wrapper over bcftools_liftover that takes either a table name or a derived-table expression plus column-name strings for chrom/pos/ref/alt and returns the lifted table. The no_left_align parameter mirrors –no-left-align in bcftools +liftover.
  • bcftools_score(bcf_path, summary_path, use := NULL, columns := 'PLINK', columns_file := NULL, q_score_thr := NULL, use_variant_id := FALSE, counts := FALSE, samples := NULL, force_samples := FALSE, regions := NULL, regions_file := NULL, regions_overlap := 1, targets := NULL, targets_file := NULL, targets_overlap := 0, apply_filters := NULL, include := NULL, exclude := NULL): Compute polygenic scores from one genotype BCF/VCF and one summary-statistics file with bcftools +score-compatible GT/DS/HDS/AP/GP/AS dosage semantics, sample subsetting, and region/target/FILTER-string controls.
  • bcftools_munge_row(chrom, pos, a1, a2, id, p, z, or, beta, n, n_cas, n_con, info, frq, se, lp, ac, neff, neffdiv2, het_i2, het_p, het_lp, dire, fasta_ref, iffy_tag := 'IFFY', mismatch_tag := 'REF_MISMATCH', ns := NULL, nc := NULL, ne := NULL): Normalize one summary-statistics row into GWAS-VCF-style fields (chrom/pos/ref/alt/effect metrics), resolving REF/ALT orientation against a FASTA reference and applying swap-aware sign/frequency/count transforms. The output flag alleles_swapped means REF/ALT orientation was swapped to match the FASTA reference.
  • duckdb_munge(table_name, preset := '', column_map := map([''], ['']), column_map_file := '', fasta_ref := NULL, iffy_tag := 'IFFY', mismatch_tag := 'REF_MISMATCH', ns := NULL, nc := NULL, ne := NULL): DuckDB macro wrapper over bcftools_munge_row that maps source columns (via preset or explicit map) and returns normalized GWAS-VCF-style rows with lean outputs and explicit alleles_swapped semantics. Output columns: chrom, pos, id, ref, alt, alleles_swapped, filter, ns, ez, nc, es, se, lp, af, ac, ne (16 columns). For METAL meta-analysis output with SI/I2/CQ/ED columns, use duckdb_munge_metal.
  • duckdb_munge_metal(table_name, preset := '', column_map := map([''], ['']), column_map_file := '', fasta_ref := NULL, iffy_tag := 'IFFY', mismatch_tag := 'REF_MISMATCH', ns := NULL, nc := NULL, ne := NULL): Extended munge macro with METAL meta-analysis output columns. Same as duckdb_munge but additionally emits: si (imputation info, from INFO input), i2 (Cochran's I² heterogeneity, from HET_I2), cq (Cochran's Q -log10 p, from HET_LP or -log10(HET_P)), and ed (effect direction string, from DIRE; +/- flipped on allele swap). The R wrapper rduckhts_munge() auto-dispatches to this macro when metal keys (INFO, HET_I2, HET_P, HET_LP, DIRE) are present in the resolved column map.

Sequence UDFs

  • seq_revcomp(sequence): Compute the reverse complement of a DNA sequence using A, C, G, T, and N bases.
  • seq_canonical(sequence): Return the lexicographically smaller of a sequence and its reverse complement.
  • seq_hash_2bit(sequence): Encode a short DNA sequence as a 2-bit unsigned integer hash.
  • seq_encode_4bit(sequence): Encode an IUPAC DNA sequence as a list of 4-bit base codes, preserving ambiguity symbols including N.
  • seq_decode_4bit(codes): Decode a list of 4-bit IUPAC DNA base codes back into a sequence string.
  • seq_gc_content(sequence): Compute GC fraction for a DNA sequence as a value between 0 and 1.
  • seq_kmers(sequence, k, canonical := FALSE): Expand a sequence into positional k-mers with optional canonicalization.

SAM Flag UDFs

  • sam_flag_bits(flag): Decode a SAM flag into a struct of boolean bit fields using explicit SAM-oriented names such as is_paired, is_proper_pair, is_next_segment_unmapped, and is_supplementary.
  • sam_flag_has(flag, mask): Test whether any bits from the provided SAM flag mask are set in a flag value.
  • is_forward_aligned(flag): Test whether a mapped segment is aligned to the forward strand. Returns NULL for unmapped segments because SAM flag 0x10 does not define genomic strand when 0x4 is set.
  • is_paired(flag): Test whether the SAM flag indicates that the template has multiple segments in sequencing (0x1).
  • is_proper_pair(flag): Test whether the SAM flag indicates that each segment is properly aligned according to the aligner (0x2).
  • is_unmapped(flag): Test whether the read itself is unmapped according to the SAM flag.
  • is_next_segment_unmapped(flag): Test whether the next segment in the template is flagged as unmapped (0x8).
  • is_reverse_complemented(flag): Test whether SEQ is stored reverse complemented (0x10); for mapped reads this corresponds to reverse-strand alignment.
  • is_next_segment_reverse_complemented(flag): Test whether SEQ of the next segment in the template is stored reverse complemented (0x20).
  • is_first_segment(flag): Test whether the read is marked as the first segment in the template.
  • is_last_segment(flag): Test whether the read is marked as the last segment in the template.
  • is_secondary(flag): Test whether the alignment is marked as secondary.
  • is_qc_fail(flag): Test whether the read failed vendor or pipeline quality checks.
  • is_duplicate(flag): Test whether the alignment is flagged as a duplicate.
  • is_supplementary(flag): Test whether the alignment is marked as supplementary.

CIGAR Utils

  • cigar_has_soft_clip(cigar): Test whether a CIGAR string contains any soft-clipped segment (S).
  • cigar_has_hard_clip(cigar): Test whether a CIGAR string contains any hard-clipped segment (H).
  • cigar_left_soft_clip(cigar): Return the left-end soft-clipped length from a CIGAR string, or zero if the alignment does not start with S.
  • cigar_right_soft_clip(cigar): Return the right-end soft-clipped length from a CIGAR string, or zero if the alignment does not end with S.
  • cigar_query_length(cigar): Return the query-consuming length from a CIGAR string, counting M, I, S, =, and X.
  • cigar_aligned_query_length(cigar): Return the aligned query length from a CIGAR string, counting M, =, and X but excluding clips and insertions.
  • cigar_reference_length(cigar): Return the reference-consuming length from a CIGAR string, counting M, D, N, =, and X.
  • cigar_has_op(cigar, op): Test whether a CIGAR string contains at least one instance of the requested operator.

Operational notes:

  • Paired FASTQ is supported via mate_path or interleaved := true.
  • CRAM reads are supported with an explicit reference file.
  • GTF/GFF attributes can be returned as a parsed MAP with attributes_map := true via read_gff or read_gtf.
  • Optional SAM tag columns and an auxiliary tag map are available through standard_tags and auxiliary_tags.
  • Tabix readers support header, header_names, type inference with auto_detect, and explicit column_types.
  • MSVC builds (windows_amd64/windows_arm64) are not supported; use MinGW/RTools on Windows.

Added Functions

function_name function_type description comment examples
bam_index table NULL NULL  
bcf_index table NULL NULL  
bcftools_liftover scalar NULL NULL  
bcftools_munge_row scalar NULL NULL  
bcftools_score table NULL NULL  
bgunzip table NULL NULL  
bgzip table NULL NULL  
cigar_aligned_query_length scalar NULL NULL  
cigar_has_hard_clip scalar NULL NULL  
cigar_has_op scalar NULL NULL  
cigar_has_soft_clip scalar NULL NULL  
cigar_left_soft_clip scalar NULL NULL  
cigar_query_length scalar NULL NULL  
cigar_reference_length scalar NULL NULL  
cigar_right_soft_clip scalar NULL NULL  
detect_quality_encoding table NULL NULL  
duckdb_liftover table_macro NULL NULL  
duckdb_munge table_macro NULL NULL  
duckdb_munge_metal table_macro NULL NULL  
duckdb_munge_preset_map macro NULL NULL  
duckdb_munge_resolved_map macro NULL NULL  
duckhts_quote_ident macro NULL NULL  
fasta_index table NULL NULL  
fasta_nuc table NULL NULL  
is_duplicate scalar NULL NULL  
is_first_segment scalar NULL NULL  
is_forward_aligned scalar NULL NULL  
is_last_segment scalar NULL NULL  
is_next_segment_reverse_complemented scalar NULL NULL  
is_next_segment_unmapped scalar NULL NULL  
is_paired scalar NULL NULL  
is_proper_pair scalar NULL NULL  
is_qc_fail scalar NULL NULL  
is_reverse_complemented scalar NULL NULL  
is_secondary scalar NULL NULL  
is_supplementary scalar NULL NULL  
is_unmapped scalar NULL NULL  
read_bam table NULL NULL  
read_bcf table NULL NULL  
read_bed table NULL NULL  
read_fasta table NULL NULL  
read_fastq table NULL NULL  
read_gff table NULL NULL  
read_gtf table NULL NULL  
read_hts_header table NULL NULL  
read_hts_index table NULL NULL  
read_hts_index_raw table_macro NULL NULL  
read_hts_index_spans table_macro NULL NULL  
read_tabix table NULL NULL  
sam_flag_bits scalar NULL NULL  
sam_flag_has scalar NULL NULL  
seq_canonical scalar NULL NULL  
seq_decode_4bit scalar NULL NULL  
seq_encode_4bit scalar NULL NULL  
seq_gc_content scalar NULL NULL  
seq_hash_2bit scalar NULL NULL  
seq_kmers table NULL NULL  
seq_revcomp scalar NULL NULL  
tabix_index table NULL NULL  

Overloaded Functions

Added Types

Added Settings