Bioinformatics extension for microbiome research - read/write FASTQ, SAM/BAM, BIOM, GFF, Newick trees, query NCBI, and more.
Maintainer(s):
wasade
Installing and Loading
INSTALL miint FROM community;
LOAD miint;
Example
-- Extract 16S rRNA gene sequences from E. coli K-12 MG1655
-- Fetches genome and annotations directly from NCBI
WITH ecoli_16s_positions AS (
SELECT position, stop_position, strand,
attributes['gene'] AS gene,
attributes['product'] AS product
FROM read_ncbi_annotation('NC_000913.3')
WHERE type = 'rRNA'
AND attributes['product'] LIKE '%16S%'
ORDER BY position
)
SELECT concat_ws(':', read_id, position, stop_position, strand) AS read_id,
sequence1[position:stop_position] AS sequence
FROM read_ncbi_fasta('NC_000913.3'), ecoli_16s_positions;
About miint
MIINT (MIcrobiome INTelligence) brings SQL to bioinformatics data analysis.
Read bioinformatics formats as tables:
read_alignments: SAM/BAM alignment filesread_fastx: FASTA/FASTQ sequence files (single/paired-end)read_biom: BIOM abundance matricesread_gff: GFF3 annotation filesread_newick: Phylogenetic treesread_jplace: Phylogenetic placements
Query NCBI directly:
read_ncbi: Fetch sequence metadata by accessionread_ncbi_fasta: Fetch sequences from NCBIread_ncbi_annotation: Fetch feature annotations from NCBI
Alignment with SQL:
align_minimap2: Align sequences using minimap2align_bowtie2: Align sequences using bowtie2
Write to standard formats:
COPY ... FORMAT FASTQ/FASTA: Export sequencesCOPY ... FORMAT SAM/BAM: Export alignmentsCOPY ... FORMAT BIOM: Export abundance tablesCOPY ... FORMAT NEWICK: Export phylogenetic trees
Analysis functions:
- SAM flag checking (
alignment_is_primary,alignment_is_unmapped, etc.) - Sequence identity calculation (
alignment_seq_identity) - Interval compression (
compress_intervals) - Woltka OGU classification (
woltka_ogu,woltka_ogu_per_sample)
See the documentation for complete details.
Added Functions
| function_name | function_type | description | comment | examples |
|---|---|---|---|---|
| read_alignments | table | Read SAM/BAM alignment files as a table. Supports headerless SAM files with reference_lengths parameter, glob patterns, and parallel processing. | Also available as read_sam (alias) | [SELECT * FROM read_alignments('alignments.bam');] |
| read_fastx | table | Read FASTA/FASTQ sequence files as a table. Supports single-end and paired-end reads, glob patterns, gzip compression, and quality score offset conversion. | NULL | [SELECT * FROM read_fastx('reads.fastq.gz');] |
| read_sequences_sff | table | Read SFF (Standard Flowgram Format) files from 454/Roche sequencing. Returns read_fastx-compatible schema with optional quality/adapter trimming. | NULL | [SELECT * FROM read_sequences_sff('reads.sff');] |
| read_sequences_sam | table | Read sequences from SAM/BAM alignment files in read_fastx-compatible schema. Extracts sequence and quality data from alignments. | NULL | [SELECT * FROM read_sequences_sam('alignments.bam');] |
| read_biom | table | Read BIOM (Biological Observation Matrix) format files. Returns sparse COO format with sample_id, feature_id, and value columns. | HDF5-based BIOM v2.1 format | [SELECT * FROM read_biom('ogu_table.biom');] |
| read_gff | table_macro | Read GFF3 annotation files. Parses attributes column into a SQL MAP for easy querying. | Macro | [SELECT type, attributes['gene'] AS gene FROM read_gff('annotations.gff');] |
| read_ncbi | table | Fetch GenBank metadata from NCBI by accession number. Returns organism, length, taxonomy_id, and other metadata. | Requires httpfs extension | [SELECT * FROM read_ncbi('NC_000913.3');] |
| read_ncbi_fasta | table | Fetch FASTA sequences from NCBI by accession number. Returns data in read_fastx-compatible schema. | Requires httpfs extension | [SELECT * FROM read_ncbi_fasta('NC_000913.3');] |
| read_ncbi_annotation | table | Fetch feature annotations from NCBI by accession number. Returns data in read_gff-compatible schema. | Requires httpfs extension | [SELECT * FROM read_ncbi_annotation('NC_000913.3') WHERE type = 'gene';] |
| read_jplace | table_macro | Read jplace phylogenetic placement files. Returns best placement per fragment with edge_num, likelihood, and like_weight_ratio. | Macro - requires json extension | [SELECT fragment, edge_num, like_weight_ratio FROM read_jplace('placements.jplace');] |
| read_mzml | table | Read mzML mass spectrometry files. Returns one row per spectrum with metadata, m/z arrays, and intensity arrays. | NULL | [SELECT * FROM read_mzml('sample.mzML');] |
| read_mzxml | table | Read mzXML mass spectrometry files. Returns same schema as read_mzml for cross-format compatibility. | NULL | [SELECT * FROM read_mzxml('sample.mzXML');] |
| read_mzml_chromatograms | table | Read chromatogram data from mzML files. Returns chromatogram metadata with time and intensity arrays. | NULL | [SELECT * FROM read_mzml_chromatograms('sample.mzML');] |
| read_newick | table | Read Newick phylogenetic tree files. Returns one row per node with parent-child relationships and branch lengths. | NULL | [SELECT name, branch_length FROM read_newick('tree.nwk') WHERE is_tip;] |
| massql | table | Execute MassQL queries against mass spectrometry data. Supports conditions (MS1MZ, MS2PROD, MS2PREC, MS2NL), aggregations, and X/Y-variable pattern matching. | NULL | [SELECT * FROM massql('QUERY scaninfo(MS2DATA) WHERE MS2PROD=167.09', 'sample.mzML');] |
| align_minimap2 | table | Align query sequences to subject sequences using minimap2. Supports pre-built indexes for 10-30x performance improvement. | NULL | [SELECT * FROM align_minimap2('queries', subject_table='subjects');] |
| align_minimap2_sharded | table | Align query sequences against multiple pre-built minimap2 index shards in parallel. Reads are routed to specific shards via a mapping table. | NULL | [SELECT * FROM align_minimap2_sharded('queries', shard_directory='indexes/', read_to_shard='read_to_shard');] |
| save_minimap2_index | table | Build and save a minimap2 index to disk for reuse with align_minimap2. | NULL | [SELECT * FROM save_minimap2_index('subjects', 'refs.mmi', preset='sr');] |
| align_bowtie2 | table | Align query sequences to subject sequences using Bowtie2. Optimized for short reads. | Requires bowtie2 in PATH | [SELECT * FROM align_bowtie2('queries', 'subjects', preset='very-sensitive');] |
| align_bowtie2_sharded | table | Align query sequences against multiple pre-built Bowtie2 index shards in parallel. Reads are routed to specific shards via a mapping table. | Requires bowtie2 in PATH | [SELECT * FROM align_bowtie2_sharded('queries', shard_directory='indexes/', read_to_shard='read_to_shard');] |
| rype_classify | table | Classify sequences against a RYpe minimizer index. Returns bucket assignments with confidence scores. Supports paired-end and negative index filtering. | NULL | [SELECT * FROM rype_classify('index.ryxdi', 'seqs');] |
| rype_extract_minimizer_set | table | Extract deduplicated forward and reverse complement minimizer hash sets from sequences. | NULL | [SELECT * FROM rype_extract_minimizer_set('seqs', 32, 10);] |
| rype_extract_strand_minimizers | table | Extract minimizer hashes with positional information for both forward and reverse complement strands. | NULL | [SELECT * FROM rype_extract_strand_minimizers('seqs', 32, 10);] |
| rype_log_ratio | table | Compute log-ratio classification between two RYpe indices (numerator vs denominator). Returns per-read log-ratio scores. | NULL | [SELECT * FROM rype_log_ratio('numerator.ryxdi', 'denominator.ryxdi', 'seqs');] |
| alignment_is_paired | scalar | Test if read is paired (SAM flag 0x1). | Also available as is_paired | [SELECT * FROM read_alignments('a.bam') WHERE alignment_is_paired(flags);] |
| alignment_is_proper_pair | scalar | Test if read is properly paired (SAM flag 0x2). | Also available as is_proper_pair | [SELECT * FROM read_alignments('a.bam') WHERE alignment_is_proper_pair(flags);] |
| alignment_is_unmapped | scalar | Test if read is unmapped (SAM flag 0x4). | Also available as is_unmapped | [SELECT * FROM read_alignments('a.bam') WHERE NOT alignment_is_unmapped(flags);] |
| alignment_is_mate_unmapped | scalar | Test if mate is unmapped (SAM flag 0x8). | Also available as is_munmap | [SELECT * FROM read_alignments('a.bam') WHERE alignment_is_mate_unmapped(flags);] |
| alignment_is_reverse | scalar | Test if read is on reverse strand (SAM flag 0x10). | Also available as is_reverse | [SELECT * FROM read_alignments('a.bam') WHERE alignment_is_reverse(flags);] |
| alignment_is_mate_reverse | scalar | Test if mate is on reverse strand (SAM flag 0x20). | Also available as is_mreverse | [SELECT * FROM read_alignments('a.bam') WHERE alignment_is_mate_reverse(flags);] |
| alignment_is_read1 | scalar | Test if read is first in pair (SAM flag 0x40). | Also available as is_read1 | [SELECT * FROM read_alignments('a.bam') WHERE alignment_is_read1(flags);] |
| alignment_is_read2 | scalar | Test if read is second in pair (SAM flag 0x80). | Also available as is_read2 | [SELECT * FROM read_alignments('a.bam') WHERE alignment_is_read2(flags);] |
| alignment_is_secondary | scalar | Test if alignment is secondary (SAM flag 0x100). | Also available as is_secondary | [SELECT * FROM read_alignments('a.bam') WHERE NOT alignment_is_secondary(flags);] |
| alignment_is_primary | scalar | Test if alignment is primary (neither secondary nor supplementary). | NULL | [SELECT * FROM read_alignments('a.bam') WHERE alignment_is_primary(flags);] |
| alignment_is_qc_failed | scalar | Test if read failed QC (SAM flag 0x200). | Also available as is_qcfail | [SELECT * FROM read_alignments('a.bam') WHERE NOT alignment_is_qc_failed(flags);] |
| alignment_is_duplicate | scalar | Test if read is PCR/optical duplicate (SAM flag 0x400). | Also available as is_dup | [SELECT * FROM read_alignments('a.bam') WHERE NOT alignment_is_duplicate(flags);] |
| alignment_is_supplementary | scalar | Test if alignment is supplementary (SAM flag 0x800). | Also available as is_supplementary | [SELECT * FROM read_alignments('a.bam') WHERE NOT alignment_is_supplementary(flags);] |
| alignment_seq_identity | scalar | Calculate sequence identity between read and reference. Supports gap_compressed (default), gap_excluded, and blast methods. | NULL | [SELECT alignment_seq_identity(cigar, tag_nm, tag_md) FROM read_alignments('a.bam');] |
| alignment_query_length | scalar | Calculate total query length from a CIGAR string. Optionally includes hard-clipped bases. | NULL | [SELECT alignment_query_length(cigar) FROM read_alignments('a.bam');] |
| alignment_query_coverage | scalar | Calculate proportion of query bases covered by alignment. Supports 'aligned' (default) and 'mapped' types. | NULL | [SELECT alignment_query_coverage(cigar) FROM read_alignments('a.bam');] |
| align_pairwise_score | scalar | Compute gap-affine pairwise alignment score using WFA2. Returns integer score (0 = identical). | Default penalties: mismatch=4, gap_open=6, gap_extend=2 | [SELECT align_pairwise_score('ACGT', 'ACAT');] |
| align_pairwise_cigar | scalar | Compute pairwise alignment score and extended CIGAR string using WFA2. Returns STRUCT(score, cigar). | Default penalties: mismatch=4, gap_open=6, gap_extend=2 | [SELECT (align_pairwise_cigar('ACGT', 'ACAT')).cigar;] |
| align_pairwise_full | scalar | Compute pairwise alignment with score, CIGAR, and aligned sequences using WFA2. Returns STRUCT(score, cigar, query_aligned, subject_aligned). | Default penalties: mismatch=4, gap_open=6, gap_extend=2 | [SELECT align_pairwise_full('ACGT', 'AGT');] |
| sequence_dna_reverse_complement | scalar | Calculate reverse complement of DNA sequences. Supports IUPAC ambiguity codes and preserves case. | NULL | [SELECT sequence_dna_reverse_complement('ATCG');] |
| sequence_rna_reverse_complement | scalar | Calculate reverse complement of RNA sequences. Supports IUPAC ambiguity codes and preserves case. | NULL | [SELECT sequence_rna_reverse_complement('AUCG');] |
| sequence_dna_as_regexp | scalar | Convert DNA sequences with IUPAC codes to regular expression patterns for pattern matching. | NULL | [SELECT sequence_dna_as_regexp('ATNGG');] |
| sequence_rna_as_regexp | scalar | Convert RNA sequences with IUPAC codes to regular expression patterns for pattern matching. | NULL | [SELECT sequence_rna_as_regexp('AUNGG');] |
| formula | scalar | Compute monoisotopic mass in Daltons from a chemical formula string. | NULL | [SELECT formula('C6H12O6');] |
| massql_to_sql | scalar | Convert a MassQL query string to its generated SQL for debugging and inspection. | NULL | [SELECT massql_to_sql('QUERY scaninfo(MS2DATA) WHERE MS2PROD=167.09', 'sample.mzML');] |
| compress_intervals | aggregate | Aggregate function that merges overlapping genomic intervals into a minimal set of non-overlapping intervals. | NULL | [SELECT reference, compress_intervals(position, stop_position) FROM read_alignments('a.bam') GROUP BY reference;] |
| massdefect | macro | Compute the fractional mass (mass defect) of an m/z value. | Macro | [SELECT massdefect(100.0352);] |
| mz_within | macro | Check if observed m/z is within absolute tolerance (Da) of target m/z. | Macro | [SELECT mz_within(200.05, 200.0, 0.1);] |
| mz_within_ppm | macro | Check if observed m/z is within relative tolerance (ppm) of target m/z. | Macro | [SELECT mz_within_ppm(200.05, 200.0, 10);] |
| mz_massdefect_within | macro | Check if an m/z value's mass defect falls within a specified range. | Macro | [SELECT mz_massdefect_within(100.0352, 0.01, 0.05);] |
| mzml_peaks | table_macro | Unnest mz_array and intensity_array into per-peak rows with normalized intensities. | Macro - do not quote parameters | [SELECT * FROM mzml_peaks(spectra);] |
| mzml_scaninfo | table_macro | Re-aggregate peak-level data to one row per scan with summary statistics. | Macro - do not quote parameters | [SELECT * FROM mzml_scaninfo(peaks);] |
| mzml_scansum | table_macro | Compute total intensity per spectrum. | Macro - do not quote parameters | [SELECT * FROM mzml_scansum(peaks);] |
| mzml_scannum | table_macro | Return distinct spectrum indices. | Macro - do not quote parameters | [SELECT * FROM mzml_scannum(peaks);] |
| mzml_scanmz | table_macro | Return distinct precursor m/z values. | Macro - do not quote parameters | [SELECT * FROM mzml_scanmz(peaks);] |
| mzml_scanmaxint | table_macro | Return maximum intensity per spectrum. | Macro - do not quote parameters | [SELECT * FROM mzml_scanmaxint(peaks);] |
| mzml_ms1_peaks | table_macro | Filter mzml_peaks output to MS level 1 only. | Macro - do not quote parameters | [SELECT * FROM mzml_ms1_peaks(spectra);] |
| mzml_ms2_peaks | table_macro | Filter mzml_peaks output to MS level 2 only. | Macro - do not quote parameters | [SELECT * FROM mzml_ms2_peaks(spectra);] |
| mzml_ms1_parent_peaks | table_macro | Return MS1 peaks from parent scans of filtered MS2 peaks. | Macro - do not quote parameters | [SELECT * FROM mzml_ms1_parent_peaks(spectra, ms2_filtered);] |
| mzml_ms2_child_peaks | table_macro | Return MS2 peaks from child scans of filtered MS1 peaks. | Macro - do not quote parameters | [SELECT * FROM mzml_ms2_child_peaks(spectra, ms1_filtered);] |
| mzml_ms1_where_ms2prod | table_macro | MS1 peaks from parent scans of MS2 spectra containing a product ion at target m/z. | Macro - do not quote parameters | [SELECT * FROM mzml_ms1_where_ms2prod(spectra, 167.09);] |
| mzml_ms2_where_ms1mz | table_macro | MS2 peaks from child scans of MS1 spectra containing a peak at target m/z. | Macro - do not quote parameters | [SELECT * FROM mzml_ms2_where_ms1mz(spectra, 349.18);] |
| mzml_ms1_where_ms2prec | table_macro | MS1 peaks from parent scans of MS2 spectra with matching precursor m/z. | Macro - do not quote parameters | [SELECT * FROM mzml_ms1_where_ms2prec(spectra, 349.18);] |
| mzml_ms2_where_ms2prod_and_ms1mz | table_macro | MS2 peaks from scans with a product ion and whose parent MS1 has a peak at target m/z. | Macro - do not quote parameters | [SELECT * FROM mzml_ms2_where_ms2prod_and_ms1mz(spectra, 167.09, 349.18);] |
| mzml_filter_mz | table_macro | Filter peaks near a target m/z within tolerance. | Macro - do not quote parameters | [SELECT * FROM mzml_filter_mz(peaks, 167.09, 0.1);] |
| mzml_filter_nl | table_macro | Filter peaks with neutral loss near a target value within tolerance. | Macro - do not quote parameters | [SELECT * FROM mzml_filter_nl(peaks, 18.01, 0.1);] |
| mzml_excluded_ms2prod | table_macro | MS2 peaks from scans that do NOT contain a product ion at target m/z. | Macro - do not quote parameters | [SELECT * FROM mzml_excluded_ms2prod(spectra, 167.09);] |
| mzml_excluded_ms1mz | table_macro | MS1 peaks from scans that do NOT contain a peak at target m/z. | Macro - do not quote parameters | [SELECT * FROM mzml_excluded_ms1mz(spectra, 349.18);] |
| mzml_excluded_ms2prec | table_macro | MS2 peaks from scans whose precursor is NOT near target m/z. | Macro - do not quote parameters | [SELECT * FROM mzml_excluded_ms2prec(spectra, 349.18);] |
| mzml_x_offset_pair | table_macro | Find spectra with peaks at variable X and X+delta (pattern matching). | Macro - do not quote parameters | [SELECT * FROM mzml_x_offset_pair(spectra, 14.0);] |
| mzml_x_offset_triplet | table_macro | Find spectra with peaks at X, X+delta2, and X+delta3. | Macro - do not quote parameters | [SELECT * FROM mzml_x_offset_triplet(spectra, 14.0, 28.0);] |
| mzml_x_offset_ntuple | table_macro | Find spectra with peaks at X plus a list of N offsets. | Macro - do not quote parameters | [SELECT * FROM mzml_x_offset_ntuple(spectra, [14.0, 28.0, 42.0]);] |
| mzml_x_offset_pair_range | table_macro | Offset pair matching with X constrained to a min/max range. | Macro - do not quote parameters | [SELECT * FROM mzml_x_offset_pair_range(spectra, 14.0, 100.0, 500.0);] |
| mzml_x_prec_prod | table_macro | Find MS2 spectra where precursor minus delta matches a product ion. | Macro - do not quote parameters | [SELECT * FROM mzml_x_prec_prod(spectra, 18.01, 0.1, 5.0);] |
| mzml_x_prec_massdefect | table_macro | Find MS2 spectra with precursor mass defect in a specified range. | Macro - do not quote parameters | [SELECT * FROM mzml_x_prec_massdefect(spectra, 0.01, 0.05);] |
| mzml_x_ms1_ms2_prec | table_macro | Find MS2 spectra whose precursor matches an MS1 peak. | Macro - do not quote parameters | [SELECT * FROM mzml_x_ms1_ms2_prec(spectra);] |
| mzml_or_cardinality | table_macro | Match spectra containing N-of-M target m/z values (cardinality constraint). | Macro - do not quote parameters | [SELECT * FROM mzml_or_cardinality(spectra, [167.09, 181.10, 195.12], 0.1, 2, 3);] |
| mzml_peak_pair | table_macro | Find peak pairs at X and 2*(X - formula_mass) for formula-based pattern matching. | Macro - do not quote parameters | [SELECT * FROM mzml_peak_pair(spectra, 'Fe');] |
| mzml_isotope_pattern | table_macro | Match spectra with isotope patterns at specified offsets and intensity ratios. | Macro - do not quote parameters | [SELECT * FROM mzml_isotope_pattern(spectra, [0.0, 1.003], [1.0, 0.5], [20.0, 20.0], 0.01);] |
| mzml_i_norm | macro | Max-normalize an intensity array by base peak intensity. | Macro | [SELECT mzml_i_norm(intensity_array, base_peak_intensity) FROM read_mzml('s.mzML');] |
| mzml_i_tic_norm | macro | TIC-normalize an intensity array by total ion current. | Macro | [SELECT mzml_i_tic_norm(intensity_array, total_ion_current) FROM read_mzml('s.mzML');] |
| parse_gff_attributes | macro | Parse a GFF3 attribute string (semicolon-delimited key=value pairs) into a DuckDB MAP. | Macro - helper for read_gff | [SELECT parse_gff_attributes('ID=gene1;Name=foo');] |
| woltka_ogu_per_sample | table_macro | Compute Woltka OGU counts over alignment data for multiple samples. Accounts for multi-mapped reads. | Macro - do not quote parameters | [SELECT * FROM woltka_ogu_per_sample(alignments, sample_id, read_id);] |
| woltka_ogu | table_macro | Compute Woltka OGU counts over alignment data for a single sample. Accounts for multi-mapped reads. | Macro - do not quote parameters | [SELECT * FROM woltka_ogu(alignments, read_id);] |
| genome_coverage | table_macro | Compute genome coverage from alignment data. Compresses intervals per contig, maps to genomes, and returns proportion covered. | Macro - do not quote parameters | [SELECT * FROM genome_coverage(alignments, genome_lengths, contig_to_genome);] |
| bowtie2_available | scalar | Check if the bowtie2 binary is available in PATH. | Returns false in non-bowtie2 builds | [SELECT bowtie2_available();] |
| miint_version | scalar | Return the MIINT extension version string. | NULL | [SELECT miint_version();] |
| miint_versions | table | Return a table of dependency library versions (htslib, minimap2, WFA2-lib, etc.). | NULL | [SELECT * FROM miint_versions();] |
| is_unmapped | scalar | NULL | NULL | NULL |
| is_reverse | scalar | NULL | NULL | NULL |
| is_dup | scalar | NULL | NULL | NULL |
| is_qcfail | scalar | NULL | NULL | NULL |
| is_read1 | scalar | NULL | NULL | NULL |
| is_read2 | scalar | NULL | NULL | NULL |
| is_secondary | scalar | NULL | NULL | NULL |
| is_mreverse | scalar | NULL | NULL | NULL |
| is_munmap | scalar | NULL | NULL | NULL |
| read_sam | table | NULL | NULL | NULL |
| is_supplementary | scalar | NULL | NULL | NULL |
| is_paired | scalar | NULL | NULL | NULL |
| is_proper_pair | scalar | NULL | NULL | NULL |
Overloaded Functions
This extension does not add any function overloads.
Added Types
This extension does not add any types.
Added Settings
This extension does not add any settings.