Search Shortcut cmd + k | ctrl + k
miint

Bioinformatics extension for microbiome research - read/write FASTQ, SAM/BAM, BIOM, GFF, Newick trees, query NCBI, and more.

Maintainer(s): wasade

Installing and Loading

INSTALL miint FROM community;
LOAD miint;

Example

-- Extract 16S rRNA gene sequences from E. coli K-12 MG1655
-- Fetches genome and annotations directly from NCBI 
WITH ecoli_16s_positions AS (
    SELECT position, stop_position, strand,
           attributes['gene'] AS gene,
           attributes['product'] AS product
    FROM read_ncbi_annotation('NC_000913.3')
    WHERE type = 'rRNA'
      AND attributes['product'] LIKE '%16S%'
    ORDER BY position
)
SELECT concat_ws(':', read_id, position, stop_position, strand) AS read_id,
       sequence1[position:stop_position] AS sequence
FROM read_ncbi_fasta('NC_000913.3'), ecoli_16s_positions;

About miint

MIINT (MIcrobiome INTelligence) brings SQL to bioinformatics data analysis.

Read bioinformatics formats as tables:

  • read_alignments: SAM/BAM alignment files
  • read_fastx: FASTA/FASTQ sequence files (single/paired-end)
  • read_biom: BIOM abundance matrices
  • read_gff: GFF3 annotation files
  • read_newick: Phylogenetic trees
  • read_jplace: Phylogenetic placements

Query NCBI directly:

  • read_ncbi: Fetch sequence metadata by accession
  • read_ncbi_fasta: Fetch sequences from NCBI
  • read_ncbi_annotation: Fetch feature annotations from NCBI

Alignment with SQL:

  • align_minimap2: Align sequences using minimap2
  • align_bowtie2: Align sequences using bowtie2

Write to standard formats:

  • COPY ... FORMAT FASTQ/FASTA: Export sequences
  • COPY ... FORMAT SAM/BAM: Export alignments
  • COPY ... FORMAT BIOM: Export abundance tables
  • COPY ... FORMAT NEWICK: Export phylogenetic trees

Analysis functions:

  • SAM flag checking (alignment_is_primary, alignment_is_unmapped, etc.)
  • Sequence identity calculation (alignment_seq_identity)
  • Interval compression (compress_intervals)
  • Woltka OGU classification (woltka_ogu, woltka_ogu_per_sample)

See the documentation for complete details.

Added Functions

function_name function_type description comment examples
read_alignments table Read SAM/BAM alignment files as a table. Supports headerless SAM files with reference_lengths parameter, glob patterns, and parallel processing. Also available as read_sam (alias) [SELECT * FROM read_alignments('alignments.bam');]
read_fastx table Read FASTA/FASTQ sequence files as a table. Supports single-end and paired-end reads, glob patterns, gzip compression, and quality score offset conversion. NULL [SELECT * FROM read_fastx('reads.fastq.gz');]
read_sequences_sff table Read SFF (Standard Flowgram Format) files from 454/Roche sequencing. Returns read_fastx-compatible schema with optional quality/adapter trimming. NULL [SELECT * FROM read_sequences_sff('reads.sff');]
read_sequences_sam table Read sequences from SAM/BAM alignment files in read_fastx-compatible schema. Extracts sequence and quality data from alignments. NULL [SELECT * FROM read_sequences_sam('alignments.bam');]
read_biom table Read BIOM (Biological Observation Matrix) format files. Returns sparse COO format with sample_id, feature_id, and value columns. HDF5-based BIOM v2.1 format [SELECT * FROM read_biom('ogu_table.biom');]
read_gff table_macro Read GFF3 annotation files. Parses attributes column into a SQL MAP for easy querying. Macro [SELECT type, attributes['gene'] AS gene FROM read_gff('annotations.gff');]
read_ncbi table Fetch GenBank metadata from NCBI by accession number. Returns organism, length, taxonomy_id, and other metadata. Requires httpfs extension [SELECT * FROM read_ncbi('NC_000913.3');]
read_ncbi_fasta table Fetch FASTA sequences from NCBI by accession number. Returns data in read_fastx-compatible schema. Requires httpfs extension [SELECT * FROM read_ncbi_fasta('NC_000913.3');]
read_ncbi_annotation table Fetch feature annotations from NCBI by accession number. Returns data in read_gff-compatible schema. Requires httpfs extension [SELECT * FROM read_ncbi_annotation('NC_000913.3') WHERE type = 'gene';]
read_jplace table_macro Read jplace phylogenetic placement files. Returns best placement per fragment with edge_num, likelihood, and like_weight_ratio. Macro - requires json extension [SELECT fragment, edge_num, like_weight_ratio FROM read_jplace('placements.jplace');]
read_mzml table Read mzML mass spectrometry files. Returns one row per spectrum with metadata, m/z arrays, and intensity arrays. NULL [SELECT * FROM read_mzml('sample.mzML');]
read_mzxml table Read mzXML mass spectrometry files. Returns same schema as read_mzml for cross-format compatibility. NULL [SELECT * FROM read_mzxml('sample.mzXML');]
read_mzml_chromatograms table Read chromatogram data from mzML files. Returns chromatogram metadata with time and intensity arrays. NULL [SELECT * FROM read_mzml_chromatograms('sample.mzML');]
read_newick table Read Newick phylogenetic tree files. Returns one row per node with parent-child relationships and branch lengths. NULL [SELECT name, branch_length FROM read_newick('tree.nwk') WHERE is_tip;]
massql table Execute MassQL queries against mass spectrometry data. Supports conditions (MS1MZ, MS2PROD, MS2PREC, MS2NL), aggregations, and X/Y-variable pattern matching. NULL [SELECT * FROM massql('QUERY scaninfo(MS2DATA) WHERE MS2PROD=167.09', 'sample.mzML');]
align_minimap2 table Align query sequences to subject sequences using minimap2. Supports pre-built indexes for 10-30x performance improvement. NULL [SELECT * FROM align_minimap2('queries', subject_table='subjects');]
align_minimap2_sharded table Align query sequences against multiple pre-built minimap2 index shards in parallel. Reads are routed to specific shards via a mapping table. NULL [SELECT * FROM align_minimap2_sharded('queries', shard_directory='indexes/', read_to_shard='read_to_shard');]
save_minimap2_index table Build and save a minimap2 index to disk for reuse with align_minimap2. NULL [SELECT * FROM save_minimap2_index('subjects', 'refs.mmi', preset='sr');]
align_bowtie2 table Align query sequences to subject sequences using Bowtie2. Optimized for short reads. Requires bowtie2 in PATH [SELECT * FROM align_bowtie2('queries', 'subjects', preset='very-sensitive');]
align_bowtie2_sharded table Align query sequences against multiple pre-built Bowtie2 index shards in parallel. Reads are routed to specific shards via a mapping table. Requires bowtie2 in PATH [SELECT * FROM align_bowtie2_sharded('queries', shard_directory='indexes/', read_to_shard='read_to_shard');]
rype_classify table Classify sequences against a RYpe minimizer index. Returns bucket assignments with confidence scores. Supports paired-end and negative index filtering. NULL [SELECT * FROM rype_classify('index.ryxdi', 'seqs');]
rype_extract_minimizer_set table Extract deduplicated forward and reverse complement minimizer hash sets from sequences. NULL [SELECT * FROM rype_extract_minimizer_set('seqs', 32, 10);]
rype_extract_strand_minimizers table Extract minimizer hashes with positional information for both forward and reverse complement strands. NULL [SELECT * FROM rype_extract_strand_minimizers('seqs', 32, 10);]
rype_log_ratio table Compute log-ratio classification between two RYpe indices (numerator vs denominator). Returns per-read log-ratio scores. NULL [SELECT * FROM rype_log_ratio('numerator.ryxdi', 'denominator.ryxdi', 'seqs');]
alignment_is_paired scalar Test if read is paired (SAM flag 0x1). Also available as is_paired [SELECT * FROM read_alignments('a.bam') WHERE alignment_is_paired(flags);]
alignment_is_proper_pair scalar Test if read is properly paired (SAM flag 0x2). Also available as is_proper_pair [SELECT * FROM read_alignments('a.bam') WHERE alignment_is_proper_pair(flags);]
alignment_is_unmapped scalar Test if read is unmapped (SAM flag 0x4). Also available as is_unmapped [SELECT * FROM read_alignments('a.bam') WHERE NOT alignment_is_unmapped(flags);]
alignment_is_mate_unmapped scalar Test if mate is unmapped (SAM flag 0x8). Also available as is_munmap [SELECT * FROM read_alignments('a.bam') WHERE alignment_is_mate_unmapped(flags);]
alignment_is_reverse scalar Test if read is on reverse strand (SAM flag 0x10). Also available as is_reverse [SELECT * FROM read_alignments('a.bam') WHERE alignment_is_reverse(flags);]
alignment_is_mate_reverse scalar Test if mate is on reverse strand (SAM flag 0x20). Also available as is_mreverse [SELECT * FROM read_alignments('a.bam') WHERE alignment_is_mate_reverse(flags);]
alignment_is_read1 scalar Test if read is first in pair (SAM flag 0x40). Also available as is_read1 [SELECT * FROM read_alignments('a.bam') WHERE alignment_is_read1(flags);]
alignment_is_read2 scalar Test if read is second in pair (SAM flag 0x80). Also available as is_read2 [SELECT * FROM read_alignments('a.bam') WHERE alignment_is_read2(flags);]
alignment_is_secondary scalar Test if alignment is secondary (SAM flag 0x100). Also available as is_secondary [SELECT * FROM read_alignments('a.bam') WHERE NOT alignment_is_secondary(flags);]
alignment_is_primary scalar Test if alignment is primary (neither secondary nor supplementary). NULL [SELECT * FROM read_alignments('a.bam') WHERE alignment_is_primary(flags);]
alignment_is_qc_failed scalar Test if read failed QC (SAM flag 0x200). Also available as is_qcfail [SELECT * FROM read_alignments('a.bam') WHERE NOT alignment_is_qc_failed(flags);]
alignment_is_duplicate scalar Test if read is PCR/optical duplicate (SAM flag 0x400). Also available as is_dup [SELECT * FROM read_alignments('a.bam') WHERE NOT alignment_is_duplicate(flags);]
alignment_is_supplementary scalar Test if alignment is supplementary (SAM flag 0x800). Also available as is_supplementary [SELECT * FROM read_alignments('a.bam') WHERE NOT alignment_is_supplementary(flags);]
alignment_seq_identity scalar Calculate sequence identity between read and reference. Supports gap_compressed (default), gap_excluded, and blast methods. NULL [SELECT alignment_seq_identity(cigar, tag_nm, tag_md) FROM read_alignments('a.bam');]
alignment_query_length scalar Calculate total query length from a CIGAR string. Optionally includes hard-clipped bases. NULL [SELECT alignment_query_length(cigar) FROM read_alignments('a.bam');]
alignment_query_coverage scalar Calculate proportion of query bases covered by alignment. Supports 'aligned' (default) and 'mapped' types. NULL [SELECT alignment_query_coverage(cigar) FROM read_alignments('a.bam');]
align_pairwise_score scalar Compute gap-affine pairwise alignment score using WFA2. Returns integer score (0 = identical). Default penalties: mismatch=4, gap_open=6, gap_extend=2 [SELECT align_pairwise_score('ACGT', 'ACAT');]
align_pairwise_cigar scalar Compute pairwise alignment score and extended CIGAR string using WFA2. Returns STRUCT(score, cigar). Default penalties: mismatch=4, gap_open=6, gap_extend=2 [SELECT (align_pairwise_cigar('ACGT', 'ACAT')).cigar;]
align_pairwise_full scalar Compute pairwise alignment with score, CIGAR, and aligned sequences using WFA2. Returns STRUCT(score, cigar, query_aligned, subject_aligned). Default penalties: mismatch=4, gap_open=6, gap_extend=2 [SELECT align_pairwise_full('ACGT', 'AGT');]
sequence_dna_reverse_complement scalar Calculate reverse complement of DNA sequences. Supports IUPAC ambiguity codes and preserves case. NULL [SELECT sequence_dna_reverse_complement('ATCG');]
sequence_rna_reverse_complement scalar Calculate reverse complement of RNA sequences. Supports IUPAC ambiguity codes and preserves case. NULL [SELECT sequence_rna_reverse_complement('AUCG');]
sequence_dna_as_regexp scalar Convert DNA sequences with IUPAC codes to regular expression patterns for pattern matching. NULL [SELECT sequence_dna_as_regexp('ATNGG');]
sequence_rna_as_regexp scalar Convert RNA sequences with IUPAC codes to regular expression patterns for pattern matching. NULL [SELECT sequence_rna_as_regexp('AUNGG');]
formula scalar Compute monoisotopic mass in Daltons from a chemical formula string. NULL [SELECT formula('C6H12O6');]
massql_to_sql scalar Convert a MassQL query string to its generated SQL for debugging and inspection. NULL [SELECT massql_to_sql('QUERY scaninfo(MS2DATA) WHERE MS2PROD=167.09', 'sample.mzML');]
compress_intervals aggregate Aggregate function that merges overlapping genomic intervals into a minimal set of non-overlapping intervals. NULL [SELECT reference, compress_intervals(position, stop_position) FROM read_alignments('a.bam') GROUP BY reference;]
massdefect macro Compute the fractional mass (mass defect) of an m/z value. Macro [SELECT massdefect(100.0352);]
mz_within macro Check if observed m/z is within absolute tolerance (Da) of target m/z. Macro [SELECT mz_within(200.05, 200.0, 0.1);]
mz_within_ppm macro Check if observed m/z is within relative tolerance (ppm) of target m/z. Macro [SELECT mz_within_ppm(200.05, 200.0, 10);]
mz_massdefect_within macro Check if an m/z value's mass defect falls within a specified range. Macro [SELECT mz_massdefect_within(100.0352, 0.01, 0.05);]
mzml_peaks table_macro Unnest mz_array and intensity_array into per-peak rows with normalized intensities. Macro - do not quote parameters [SELECT * FROM mzml_peaks(spectra);]
mzml_scaninfo table_macro Re-aggregate peak-level data to one row per scan with summary statistics. Macro - do not quote parameters [SELECT * FROM mzml_scaninfo(peaks);]
mzml_scansum table_macro Compute total intensity per spectrum. Macro - do not quote parameters [SELECT * FROM mzml_scansum(peaks);]
mzml_scannum table_macro Return distinct spectrum indices. Macro - do not quote parameters [SELECT * FROM mzml_scannum(peaks);]
mzml_scanmz table_macro Return distinct precursor m/z values. Macro - do not quote parameters [SELECT * FROM mzml_scanmz(peaks);]
mzml_scanmaxint table_macro Return maximum intensity per spectrum. Macro - do not quote parameters [SELECT * FROM mzml_scanmaxint(peaks);]
mzml_ms1_peaks table_macro Filter mzml_peaks output to MS level 1 only. Macro - do not quote parameters [SELECT * FROM mzml_ms1_peaks(spectra);]
mzml_ms2_peaks table_macro Filter mzml_peaks output to MS level 2 only. Macro - do not quote parameters [SELECT * FROM mzml_ms2_peaks(spectra);]
mzml_ms1_parent_peaks table_macro Return MS1 peaks from parent scans of filtered MS2 peaks. Macro - do not quote parameters [SELECT * FROM mzml_ms1_parent_peaks(spectra, ms2_filtered);]
mzml_ms2_child_peaks table_macro Return MS2 peaks from child scans of filtered MS1 peaks. Macro - do not quote parameters [SELECT * FROM mzml_ms2_child_peaks(spectra, ms1_filtered);]
mzml_ms1_where_ms2prod table_macro MS1 peaks from parent scans of MS2 spectra containing a product ion at target m/z. Macro - do not quote parameters [SELECT * FROM mzml_ms1_where_ms2prod(spectra, 167.09);]
mzml_ms2_where_ms1mz table_macro MS2 peaks from child scans of MS1 spectra containing a peak at target m/z. Macro - do not quote parameters [SELECT * FROM mzml_ms2_where_ms1mz(spectra, 349.18);]
mzml_ms1_where_ms2prec table_macro MS1 peaks from parent scans of MS2 spectra with matching precursor m/z. Macro - do not quote parameters [SELECT * FROM mzml_ms1_where_ms2prec(spectra, 349.18);]
mzml_ms2_where_ms2prod_and_ms1mz table_macro MS2 peaks from scans with a product ion and whose parent MS1 has a peak at target m/z. Macro - do not quote parameters [SELECT * FROM mzml_ms2_where_ms2prod_and_ms1mz(spectra, 167.09, 349.18);]
mzml_filter_mz table_macro Filter peaks near a target m/z within tolerance. Macro - do not quote parameters [SELECT * FROM mzml_filter_mz(peaks, 167.09, 0.1);]
mzml_filter_nl table_macro Filter peaks with neutral loss near a target value within tolerance. Macro - do not quote parameters [SELECT * FROM mzml_filter_nl(peaks, 18.01, 0.1);]
mzml_excluded_ms2prod table_macro MS2 peaks from scans that do NOT contain a product ion at target m/z. Macro - do not quote parameters [SELECT * FROM mzml_excluded_ms2prod(spectra, 167.09);]
mzml_excluded_ms1mz table_macro MS1 peaks from scans that do NOT contain a peak at target m/z. Macro - do not quote parameters [SELECT * FROM mzml_excluded_ms1mz(spectra, 349.18);]
mzml_excluded_ms2prec table_macro MS2 peaks from scans whose precursor is NOT near target m/z. Macro - do not quote parameters [SELECT * FROM mzml_excluded_ms2prec(spectra, 349.18);]
mzml_x_offset_pair table_macro Find spectra with peaks at variable X and X+delta (pattern matching). Macro - do not quote parameters [SELECT * FROM mzml_x_offset_pair(spectra, 14.0);]
mzml_x_offset_triplet table_macro Find spectra with peaks at X, X+delta2, and X+delta3. Macro - do not quote parameters [SELECT * FROM mzml_x_offset_triplet(spectra, 14.0, 28.0);]
mzml_x_offset_ntuple table_macro Find spectra with peaks at X plus a list of N offsets. Macro - do not quote parameters [SELECT * FROM mzml_x_offset_ntuple(spectra, [14.0, 28.0, 42.0]);]
mzml_x_offset_pair_range table_macro Offset pair matching with X constrained to a min/max range. Macro - do not quote parameters [SELECT * FROM mzml_x_offset_pair_range(spectra, 14.0, 100.0, 500.0);]
mzml_x_prec_prod table_macro Find MS2 spectra where precursor minus delta matches a product ion. Macro - do not quote parameters [SELECT * FROM mzml_x_prec_prod(spectra, 18.01, 0.1, 5.0);]
mzml_x_prec_massdefect table_macro Find MS2 spectra with precursor mass defect in a specified range. Macro - do not quote parameters [SELECT * FROM mzml_x_prec_massdefect(spectra, 0.01, 0.05);]
mzml_x_ms1_ms2_prec table_macro Find MS2 spectra whose precursor matches an MS1 peak. Macro - do not quote parameters [SELECT * FROM mzml_x_ms1_ms2_prec(spectra);]
mzml_or_cardinality table_macro Match spectra containing N-of-M target m/z values (cardinality constraint). Macro - do not quote parameters [SELECT * FROM mzml_or_cardinality(spectra, [167.09, 181.10, 195.12], 0.1, 2, 3);]
mzml_peak_pair table_macro Find peak pairs at X and 2*(X - formula_mass) for formula-based pattern matching. Macro - do not quote parameters [SELECT * FROM mzml_peak_pair(spectra, 'Fe');]
mzml_isotope_pattern table_macro Match spectra with isotope patterns at specified offsets and intensity ratios. Macro - do not quote parameters [SELECT * FROM mzml_isotope_pattern(spectra, [0.0, 1.003], [1.0, 0.5], [20.0, 20.0], 0.01);]
mzml_i_norm macro Max-normalize an intensity array by base peak intensity. Macro [SELECT mzml_i_norm(intensity_array, base_peak_intensity) FROM read_mzml('s.mzML');]
mzml_i_tic_norm macro TIC-normalize an intensity array by total ion current. Macro [SELECT mzml_i_tic_norm(intensity_array, total_ion_current) FROM read_mzml('s.mzML');]
parse_gff_attributes macro Parse a GFF3 attribute string (semicolon-delimited key=value pairs) into a DuckDB MAP. Macro - helper for read_gff [SELECT parse_gff_attributes('ID=gene1;Name=foo');]
woltka_ogu_per_sample table_macro Compute Woltka OGU counts over alignment data for multiple samples. Accounts for multi-mapped reads. Macro - do not quote parameters [SELECT * FROM woltka_ogu_per_sample(alignments, sample_id, read_id);]
woltka_ogu table_macro Compute Woltka OGU counts over alignment data for a single sample. Accounts for multi-mapped reads. Macro - do not quote parameters [SELECT * FROM woltka_ogu(alignments, read_id);]
genome_coverage table_macro Compute genome coverage from alignment data. Compresses intervals per contig, maps to genomes, and returns proportion covered. Macro - do not quote parameters [SELECT * FROM genome_coverage(alignments, genome_lengths, contig_to_genome);]
bowtie2_available scalar Check if the bowtie2 binary is available in PATH. Returns false in non-bowtie2 builds [SELECT bowtie2_available();]
miint_version scalar Return the MIINT extension version string. NULL [SELECT miint_version();]
miint_versions table Return a table of dependency library versions (htslib, minimap2, WFA2-lib, etc.). NULL [SELECT * FROM miint_versions();]
is_unmapped scalar NULL NULL NULL
is_reverse scalar NULL NULL NULL
is_dup scalar NULL NULL NULL
is_qcfail scalar NULL NULL NULL
is_read1 scalar NULL NULL NULL
is_read2 scalar NULL NULL NULL
is_secondary scalar NULL NULL NULL
is_mreverse scalar NULL NULL NULL
is_munmap scalar NULL NULL NULL
read_sam table NULL NULL NULL
is_supplementary scalar NULL NULL NULL
is_paired scalar NULL NULL NULL
is_proper_pair scalar NULL NULL NULL

Overloaded Functions

| function_name | function_type | description | comment | examples | |—————|—————|————-|———|———-|

Added Types

| type_name | type_size | logical_type | type_category | internal | |———–|———-:|————–|—————|———-|

Added Settings

| name | description | input_type | scope | aliases | |——|————-|————|——-|———|