Search Shortcut cmd + k | ctrl + k
miint

Bioinformatics extension for microbiome research - read/write FASTQ, SAM/BAM, BIOM, GFF, Newick trees, query NCBI, and more.

Maintainer(s): wasade

Installing and Loading

INSTALL miint FROM community;
LOAD miint;

Example

-- Extract 16S rRNA gene sequences from E. coli K-12 MG1655
-- Fetches genome and annotations directly from NCBI 
WITH ecoli_16s_positions AS (
    SELECT position, stop_position, strand,
           attributes['gene'] AS gene,
           attributes['product'] AS product
    FROM read_ncbi_annotation('NC_000913.3')
    WHERE type = 'rRNA'
      AND attributes['product'] LIKE '%16S%'
    ORDER BY position
)
SELECT concat_ws(':', read_id, position, stop_position, strand) AS read_id,
       sequence1[position:stop_position] AS sequence
FROM read_ncbi_fasta('NC_000913.3'), ecoli_16s_positions;

About miint

MIINT (MIcrobiome INTelligence) brings SQL to bioinformatics data analysis.

Read bioinformatics formats as tables:

  • read_alignments: SAM/BAM alignment files
  • read_fastx: FASTA/FASTQ sequence files (single/paired-end)
  • read_biom: BIOM abundance matrices
  • read_gff: GFF3 annotation files
  • read_newick: Phylogenetic trees
  • read_jplace: Phylogenetic placements

Query NCBI directly:

  • read_ncbi: Fetch sequence metadata by accession
  • read_ncbi_fasta: Fetch sequences from NCBI
  • read_ncbi_annotation: Fetch feature annotations from NCBI

Alignment with SQL:

  • align_minimap2: Align sequences using minimap2
  • align_bowtie2: Align sequences using bowtie2

Write to standard formats:

  • COPY ... FORMAT FASTQ/FASTA: Export sequences
  • COPY ... FORMAT SAM/BAM: Export alignments
  • COPY ... FORMAT BIOM: Export abundance tables
  • COPY ... FORMAT NEWICK: Export phylogenetic trees

Analysis functions:

  • SAM flag checking (alignment_is_primary, alignment_is_unmapped, etc.)
  • Sequence identity calculation (alignment_seq_identity)
  • Interval compression (compress_intervals)
  • Woltka OGU classification (woltka_ogu, woltka_ogu_per_sample)

See the documentation for complete details.

Added Functions

function_name function_type description comment examples
read_alignments table Read SAM/BAM alignment files as a table. Supports headerless SAM files with reference_lengths parameter, glob patterns, and parallel processing. Also available as read_sam (alias) [SELECT * FROM read_alignments('alignments.bam');]
read_fastx table Read FASTA/FASTQ sequence files as a table. Supports single-end and paired-end reads, glob patterns, gzip compression, and quality score offset conversion. NULL [SELECT * FROM read_fastx('reads.fastq.gz');]
read_sequences_sff table Read SFF (Standard Flowgram Format) files from 454/Roche sequencing. Returns read_fastx-compatible schema with optional quality/adapter trimming. NULL [SELECT * FROM read_sequences_sff('reads.sff');]
read_sequences_sam table Read sequences from SAM/BAM alignment files in read_fastx-compatible schema. Extracts sequence and quality data from alignments. NULL [SELECT * FROM read_sequences_sam('alignments.bam');]
read_biom table Read BIOM (Biological Observation Matrix) format files. Returns sparse COO format with sample_id, feature_id, and value columns. HDF5-based BIOM v2.1 format [SELECT * FROM read_biom('ogu_table.biom');]
read_gff table_macro Read GFF3 annotation files. Parses attributes column into a SQL MAP for easy querying. Macro [SELECT type, attributes['gene'] AS gene FROM read_gff('annotations.gff');]
read_ncbi table Fetch GenBank metadata from NCBI by accession number. Returns organism, length, taxonomy_id, and other metadata. Requires httpfs extension [SELECT * FROM read_ncbi('NC_000913.3');]
read_ncbi_fasta table Fetch FASTA sequences from NCBI by accession number. Returns data in read_fastx-compatible schema. Requires httpfs extension [SELECT * FROM read_ncbi_fasta('NC_000913.3');]
read_ncbi_annotation table Fetch feature annotations from NCBI by accession number. Returns data in read_gff-compatible schema. Requires httpfs extension [SELECT * FROM read_ncbi_annotation('NC_000913.3') WHERE type = 'gene';]
read_jplace table_macro Read jplace phylogenetic placement files. Returns best placement per fragment with edge_num, likelihood, and like_weight_ratio. Macro - requires json extension [SELECT fragment, edge_num, like_weight_ratio FROM read_jplace('placements.jplace');]
read_mzml table Read mzML mass spectrometry files. Returns one row per spectrum with metadata, m/z arrays, and intensity arrays. NULL [SELECT * FROM read_mzml('sample.mzML');]
read_mzxml table Read mzXML mass spectrometry files. Returns same schema as read_mzml for cross-format compatibility. NULL [SELECT * FROM read_mzxml('sample.mzXML');]
read_mzml_chromatograms table Read chromatogram data from mzML files. Returns chromatogram metadata with time and intensity arrays. NULL [SELECT * FROM read_mzml_chromatograms('sample.mzML');]
read_newick table Read Newick phylogenetic tree files. Returns one row per node with parent-child relationships and branch lengths. NULL [SELECT name, branch_length FROM read_newick('tree.nwk') WHERE is_tip;]
massql table Execute MassQL queries against mass spectrometry data. Supports conditions (MS1MZ, MS2PROD, MS2PREC, MS2NL), aggregations, and X/Y-variable pattern matching. NULL [SELECT * FROM massql('QUERY scaninfo(MS2DATA) WHERE MS2PROD=167.09', 'sample.mzML');]
align_minimap2 table Align query sequences to subject sequences using minimap2. Supports pre-built indexes for 10-30x performance improvement. NULL [SELECT * FROM align_minimap2('queries', subject_table='subjects');]
align_minimap2_sharded table Align query sequences against multiple pre-built minimap2 index shards in parallel. Reads are routed to specific shards via a mapping table. NULL [SELECT * FROM align_minimap2_sharded('queries', shard_directory='indexes/', read_to_shard='read_to_shard');]
save_minimap2_index table Build and save a minimap2 index to disk for reuse with align_minimap2. NULL [SELECT * FROM save_minimap2_index('subjects', 'refs.mmi', preset='sr');]
align_bowtie2 table Align query sequences to subject sequences using Bowtie2. Optimized for short reads. Requires bowtie2 in PATH [SELECT * FROM align_bowtie2('queries', 'subjects', preset='very-sensitive');]
align_bowtie2_sharded table Align query sequences against multiple pre-built Bowtie2 index shards in parallel. Reads are routed to specific shards via a mapping table. Requires bowtie2 in PATH [SELECT * FROM align_bowtie2_sharded('queries', shard_directory='indexes/', read_to_shard='read_to_shard');]
rype_classify table Classify sequences against a RYpe minimizer index. Returns bucket assignments with confidence scores. Supports paired-end and negative index filtering. NULL [SELECT * FROM rype_classify('index.ryxdi', 'seqs');]
rype_extract_minimizer_set table Extract deduplicated forward and reverse complement minimizer hash sets from sequences. NULL [SELECT * FROM rype_extract_minimizer_set('seqs', 32, 10);]
rype_extract_strand_minimizers table Extract minimizer hashes with positional information for both forward and reverse complement strands. NULL [SELECT * FROM rype_extract_strand_minimizers('seqs', 32, 10);]
rype_log_ratio table Compute log-ratio classification between two RYpe indices (numerator vs denominator). Returns per-read log-ratio scores. NULL [SELECT * FROM rype_log_ratio('numerator.ryxdi', 'denominator.ryxdi', 'seqs');]
alignment_is_paired scalar Test if read is paired (SAM flag 0x1). Also available as is_paired [SELECT * FROM read_alignments('a.bam') WHERE alignment_is_paired(flags);]
alignment_is_proper_pair scalar Test if read is properly paired (SAM flag 0x2). Also available as is_proper_pair [SELECT * FROM read_alignments('a.bam') WHERE alignment_is_proper_pair(flags);]
alignment_is_unmapped scalar Test if read is unmapped (SAM flag 0x4). Also available as is_unmapped [SELECT * FROM read_alignments('a.bam') WHERE NOT alignment_is_unmapped(flags);]
alignment_is_mate_unmapped scalar Test if mate is unmapped (SAM flag 0x8). Also available as is_munmap [SELECT * FROM read_alignments('a.bam') WHERE alignment_is_mate_unmapped(flags);]
alignment_is_reverse scalar Test if read is on reverse strand (SAM flag 0x10). Also available as is_reverse [SELECT * FROM read_alignments('a.bam') WHERE alignment_is_reverse(flags);]
alignment_is_mate_reverse scalar Test if mate is on reverse strand (SAM flag 0x20). Also available as is_mreverse [SELECT * FROM read_alignments('a.bam') WHERE alignment_is_mate_reverse(flags);]
alignment_is_read1 scalar Test if read is first in pair (SAM flag 0x40). Also available as is_read1 [SELECT * FROM read_alignments('a.bam') WHERE alignment_is_read1(flags);]
alignment_is_read2 scalar Test if read is second in pair (SAM flag 0x80). Also available as is_read2 [SELECT * FROM read_alignments('a.bam') WHERE alignment_is_read2(flags);]
alignment_is_secondary scalar Test if alignment is secondary (SAM flag 0x100). Also available as is_secondary [SELECT * FROM read_alignments('a.bam') WHERE NOT alignment_is_secondary(flags);]
alignment_is_primary scalar Test if alignment is primary (neither secondary nor supplementary). NULL [SELECT * FROM read_alignments('a.bam') WHERE alignment_is_primary(flags);]
alignment_is_qc_failed scalar Test if read failed QC (SAM flag 0x200). Also available as is_qcfail [SELECT * FROM read_alignments('a.bam') WHERE NOT alignment_is_qc_failed(flags);]
alignment_is_duplicate scalar Test if read is PCR/optical duplicate (SAM flag 0x400). Also available as is_dup [SELECT * FROM read_alignments('a.bam') WHERE NOT alignment_is_duplicate(flags);]
alignment_is_supplementary scalar Test if alignment is supplementary (SAM flag 0x800). Also available as is_supplementary [SELECT * FROM read_alignments('a.bam') WHERE NOT alignment_is_supplementary(flags);]
alignment_seq_identity scalar Calculate sequence identity between read and reference. Supports gap_compressed (default), gap_excluded, and blast methods. NULL [SELECT alignment_seq_identity(cigar, tag_nm, tag_md) FROM read_alignments('a.bam');]
alignment_query_length scalar Calculate total query length from a CIGAR string. Optionally includes hard-clipped bases. NULL [SELECT alignment_query_length(cigar) FROM read_alignments('a.bam');]
alignment_query_coverage scalar Calculate proportion of query bases covered by alignment. Supports 'aligned' (default) and 'mapped' types. NULL [SELECT alignment_query_coverage(cigar) FROM read_alignments('a.bam');]
align_pairwise_score scalar Compute gap-affine pairwise alignment score using WFA2. Returns integer score (0 = identical). Default penalties: mismatch=4, gap_open=6, gap_extend=2 [SELECT align_pairwise_score('ACGT', 'ACAT');]
align_pairwise_cigar scalar Compute pairwise alignment score and extended CIGAR string using WFA2. Returns STRUCT(score, cigar). Default penalties: mismatch=4, gap_open=6, gap_extend=2 [SELECT (align_pairwise_cigar('ACGT', 'ACAT')).cigar;]
align_pairwise_full scalar Compute pairwise alignment with score, CIGAR, and aligned sequences using WFA2. Returns STRUCT(score, cigar, query_aligned, subject_aligned). Default penalties: mismatch=4, gap_open=6, gap_extend=2 [SELECT align_pairwise_full('ACGT', 'AGT');]
sequence_dna_reverse_complement scalar Calculate reverse complement of DNA sequences. Supports IUPAC ambiguity codes and preserves case. NULL [SELECT sequence_dna_reverse_complement('ATCG');]
sequence_rna_reverse_complement scalar Calculate reverse complement of RNA sequences. Supports IUPAC ambiguity codes and preserves case. NULL [SELECT sequence_rna_reverse_complement('AUCG');]
sequence_dna_as_regexp scalar Convert DNA sequences with IUPAC codes to regular expression patterns for pattern matching. NULL [SELECT sequence_dna_as_regexp('ATNGG');]
sequence_rna_as_regexp scalar Convert RNA sequences with IUPAC codes to regular expression patterns for pattern matching. NULL [SELECT sequence_rna_as_regexp('AUNGG');]
formula scalar Compute monoisotopic mass in Daltons from a chemical formula string. NULL [SELECT formula('C6H12O6');]
massql_to_sql scalar Convert a MassQL query string to its generated SQL for debugging and inspection. NULL [SELECT massql_to_sql('QUERY scaninfo(MS2DATA) WHERE MS2PROD=167.09', 'sample.mzML');]
compress_intervals aggregate Aggregate function that merges overlapping genomic intervals into a minimal set of non-overlapping intervals. NULL [SELECT reference, compress_intervals(position, stop_position) FROM read_alignments('a.bam') GROUP BY reference;]
massdefect macro Compute the fractional mass (mass defect) of an m/z value. Macro [SELECT massdefect(100.0352);]
mz_within macro Check if observed m/z is within absolute tolerance (Da) of target m/z. Macro [SELECT mz_within(200.05, 200.0, 0.1);]
mz_within_ppm macro Check if observed m/z is within relative tolerance (ppm) of target m/z. Macro [SELECT mz_within_ppm(200.05, 200.0, 10);]
mz_massdefect_within macro Check if an m/z value's mass defect falls within a specified range. Macro [SELECT mz_massdefect_within(100.0352, 0.01, 0.05);]
mzml_peaks table_macro Unnest mz_array and intensity_array into per-peak rows with normalized intensities. Macro - do not quote parameters [SELECT * FROM mzml_peaks(spectra);]
mzml_scaninfo table_macro Re-aggregate peak-level data to one row per scan with summary statistics. Macro - do not quote parameters [SELECT * FROM mzml_scaninfo(peaks);]
mzml_scansum table_macro Compute total intensity per spectrum. Macro - do not quote parameters [SELECT * FROM mzml_scansum(peaks);]
mzml_scannum table_macro Return distinct spectrum indices. Macro - do not quote parameters [SELECT * FROM mzml_scannum(peaks);]
mzml_scanmz table_macro Return distinct precursor m/z values. Macro - do not quote parameters [SELECT * FROM mzml_scanmz(peaks);]
mzml_scanmaxint table_macro Return maximum intensity per spectrum. Macro - do not quote parameters [SELECT * FROM mzml_scanmaxint(peaks);]
mzml_ms1_peaks table_macro Filter mzml_peaks output to MS level 1 only. Macro - do not quote parameters [SELECT * FROM mzml_ms1_peaks(spectra);]
mzml_ms2_peaks table_macro Filter mzml_peaks output to MS level 2 only. Macro - do not quote parameters [SELECT * FROM mzml_ms2_peaks(spectra);]
mzml_ms1_parent_peaks table_macro Return MS1 peaks from parent scans of filtered MS2 peaks. Macro - do not quote parameters [SELECT * FROM mzml_ms1_parent_peaks(spectra, ms2_filtered);]
mzml_ms2_child_peaks table_macro Return MS2 peaks from child scans of filtered MS1 peaks. Macro - do not quote parameters [SELECT * FROM mzml_ms2_child_peaks(spectra, ms1_filtered);]
mzml_ms1_where_ms2prod table_macro MS1 peaks from parent scans of MS2 spectra containing a product ion at target m/z. Macro - do not quote parameters [SELECT * FROM mzml_ms1_where_ms2prod(spectra, 167.09);]
mzml_ms2_where_ms1mz table_macro MS2 peaks from child scans of MS1 spectra containing a peak at target m/z. Macro - do not quote parameters [SELECT * FROM mzml_ms2_where_ms1mz(spectra, 349.18);]
mzml_ms1_where_ms2prec table_macro MS1 peaks from parent scans of MS2 spectra with matching precursor m/z. Macro - do not quote parameters [SELECT * FROM mzml_ms1_where_ms2prec(spectra, 349.18);]
mzml_ms2_where_ms2prod_and_ms1mz table_macro MS2 peaks from scans with a product ion and whose parent MS1 has a peak at target m/z. Macro - do not quote parameters [SELECT * FROM mzml_ms2_where_ms2prod_and_ms1mz(spectra, 167.09, 349.18);]
mzml_filter_mz table_macro Filter peaks near a target m/z within tolerance. Macro - do not quote parameters [SELECT * FROM mzml_filter_mz(peaks, 167.09, 0.1);]
mzml_filter_nl table_macro Filter peaks with neutral loss near a target value within tolerance. Macro - do not quote parameters [SELECT * FROM mzml_filter_nl(peaks, 18.01, 0.1);]
mzml_excluded_ms2prod table_macro MS2 peaks from scans that do NOT contain a product ion at target m/z. Macro - do not quote parameters [SELECT * FROM mzml_excluded_ms2prod(spectra, 167.09);]
mzml_excluded_ms1mz table_macro MS1 peaks from scans that do NOT contain a peak at target m/z. Macro - do not quote parameters [SELECT * FROM mzml_excluded_ms1mz(spectra, 349.18);]
mzml_excluded_ms2prec table_macro MS2 peaks from scans whose precursor is NOT near target m/z. Macro - do not quote parameters [SELECT * FROM mzml_excluded_ms2prec(spectra, 349.18);]
mzml_x_offset_pair table_macro Find spectra with peaks at variable X and X+delta (pattern matching). Macro - do not quote parameters [SELECT * FROM mzml_x_offset_pair(spectra, 14.0);]
mzml_x_offset_triplet table_macro Find spectra with peaks at X, X+delta2, and X+delta3. Macro - do not quote parameters [SELECT * FROM mzml_x_offset_triplet(spectra, 14.0, 28.0);]
mzml_x_offset_ntuple table_macro Find spectra with peaks at X plus a list of N offsets. Macro - do not quote parameters [SELECT * FROM mzml_x_offset_ntuple(spectra, [14.0, 28.0, 42.0]);]
mzml_x_offset_pair_range table_macro Offset pair matching with X constrained to a min/max range. Macro - do not quote parameters [SELECT * FROM mzml_x_offset_pair_range(spectra, 14.0, 100.0, 500.0);]
mzml_x_prec_prod table_macro Find MS2 spectra where precursor minus delta matches a product ion. Macro - do not quote parameters [SELECT * FROM mzml_x_prec_prod(spectra, 18.01, 0.1, 5.0);]
mzml_x_prec_massdefect table_macro Find MS2 spectra with precursor mass defect in a specified range. Macro - do not quote parameters [SELECT * FROM mzml_x_prec_massdefect(spectra, 0.01, 0.05);]
mzml_x_ms1_ms2_prec table_macro Find MS2 spectra whose precursor matches an MS1 peak. Macro - do not quote parameters [SELECT * FROM mzml_x_ms1_ms2_prec(spectra);]
mzml_or_cardinality table_macro Match spectra containing N-of-M target m/z values (cardinality constraint). Macro - do not quote parameters [SELECT * FROM mzml_or_cardinality(spectra, [167.09, 181.10, 195.12], 0.1, 2, 3);]
mzml_peak_pair table_macro Find peak pairs at X and 2*(X - formula_mass) for formula-based pattern matching. Macro - do not quote parameters [SELECT * FROM mzml_peak_pair(spectra, 'Fe');]
mzml_isotope_pattern table_macro Match spectra with isotope patterns at specified offsets and intensity ratios. Macro - do not quote parameters [SELECT * FROM mzml_isotope_pattern(spectra, [0.0, 1.003], [1.0, 0.5], [20.0, 20.0], 0.01);]
mzml_i_norm macro Max-normalize an intensity array by base peak intensity. Macro [SELECT mzml_i_norm(intensity_array, base_peak_intensity) FROM read_mzml('s.mzML');]
mzml_i_tic_norm macro TIC-normalize an intensity array by total ion current. Macro [SELECT mzml_i_tic_norm(intensity_array, total_ion_current) FROM read_mzml('s.mzML');]
parse_gff_attributes macro Parse a GFF3 attribute string (semicolon-delimited key=value pairs) into a DuckDB MAP. Macro - helper for read_gff [SELECT parse_gff_attributes('ID=gene1;Name=foo');]
woltka_ogu_per_sample table_macro Compute Woltka OGU counts over alignment data for multiple samples. Accounts for multi-mapped reads. Macro - do not quote parameters [SELECT * FROM woltka_ogu_per_sample(alignments, sample_id, read_id);]
woltka_ogu table_macro Compute Woltka OGU counts over alignment data for a single sample. Accounts for multi-mapped reads. Macro - do not quote parameters [SELECT * FROM woltka_ogu(alignments, read_id);]
genome_coverage table_macro Compute genome coverage from alignment data. Compresses intervals per contig, maps to genomes, and returns proportion covered. Macro - do not quote parameters [SELECT * FROM genome_coverage(alignments, genome_lengths, contig_to_genome);]
bowtie2_available scalar Check if the bowtie2 binary is available in PATH. Returns false in non-bowtie2 builds [SELECT bowtie2_available();]
miint_version scalar Return the MIINT extension version string. NULL [SELECT miint_version();]
miint_versions table Return a table of dependency library versions (htslib, minimap2, WFA2-lib, etc.). NULL [SELECT * FROM miint_versions();]
is_unmapped scalar NULL NULL NULL
is_reverse scalar NULL NULL NULL
is_dup scalar NULL NULL NULL
is_qcfail scalar NULL NULL NULL
is_read1 scalar NULL NULL NULL
is_read2 scalar NULL NULL NULL
is_secondary scalar NULL NULL NULL
is_mreverse scalar NULL NULL NULL
is_munmap scalar NULL NULL NULL
read_sam table NULL NULL NULL
is_supplementary scalar NULL NULL NULL
is_paired scalar NULL NULL NULL
is_proper_pair scalar NULL NULL NULL

Overloaded Functions

This extension does not add any function overloads.

Added Types

This extension does not add any types.

Added Settings

This extension does not add any settings.