miint

Search Shortcut cmd + k | ctrl + k

Documentation

miint

Downloads 934this week

GitHub stars 5

Extension repository on GitHub

Extension descriptor (YAML)

Bioinformatics extension for microbiome research - read/write FASTQ, SAM/BAM, BIOM, GFF, Newick trees, query NCBI, and more.

Maintainer(s): wasade

Installing and Loading

INSTALL miint FROM community;
LOAD miint;

Example

-- Extract 16S rRNA gene sequences from E. coli K-12 MG1655
-- Fetches genome and annotations directly from NCBI 
WITH ecoli_16s_positions AS (
    SELECT position, stop_position, strand,
           attributes['gene'] AS gene,
           attributes['product'] AS product
    FROM read_ncbi_annotation('NC_000913.3')
    WHERE type = 'rRNA'
      AND attributes['product'] LIKE '%16S%'
    ORDER BY position
)
SELECT concat_ws(':', read_id, position, stop_position, strand) AS read_id,
       sequence1[position:stop_position] AS sequence
FROM read_ncbi_fasta('NC_000913.3'), ecoli_16s_positions;

About miint

MIINT (MIcrobiome INTelligence) brings SQL to bioinformatics data analysis.

Read bioinformatics formats as tables:

read_alignments: SAM/BAM alignment files
read_fastx: FASTA/FASTQ sequence files (single/paired-end)
read_biom: BIOM abundance matrices
read_gff: GFF3 annotation files
read_newick: Phylogenetic trees
read_jplace: Phylogenetic placements

Query NCBI directly:

read_ncbi: Fetch sequence metadata by accession
read_ncbi_fasta: Fetch sequences from NCBI
read_ncbi_annotation: Fetch feature annotations from NCBI

Alignment with SQL:

align_minimap2: Align sequences using minimap2
align_bowtie2: Align sequences using bowtie2

Write to standard formats:

COPY ... FORMAT FASTQ/FASTA: Export sequences
COPY ... FORMAT SAM/BAM: Export alignments
COPY ... FORMAT BIOM: Export abundance tables
COPY ... FORMAT NEWICK: Export phylogenetic trees

Analysis functions:

SAM flag checking (alignment_is_primary, alignment_is_unmapped, etc.)
Sequence identity calculation (alignment_seq_identity)
Interval compression (compress_intervals)
Woltka OGU classification (woltka_ogu, woltka_ogu_per_sample)

See the documentation for complete details.

Added Functions

function_name	function_type	description	comment	examples
read_alignments	table	Read SAM/BAM alignment files as a table. Supports headerless SAM files with reference_lengths parameter, glob patterns, and parallel processing.	Also available as read_sam (alias)	[SELECT * FROM read_alignments('alignments.bam');]
read_fastx	table	Read FASTA/FASTQ sequence files as a table. Supports single-end and paired-end reads, glob patterns, gzip compression, and quality score offset conversion.	NULL	[SELECT * FROM read_fastx('reads.fastq.gz');]
read_sequences_sff	table	Read SFF (Standard Flowgram Format) files from 454/Roche sequencing. Returns read_fastx-compatible schema with optional quality/adapter trimming.	NULL	[SELECT * FROM read_sequences_sff('reads.sff');]
read_sequences_sam	table	Read sequences from SAM/BAM alignment files in read_fastx-compatible schema. Extracts sequence and quality data from alignments.	NULL	[SELECT * FROM read_sequences_sam('alignments.bam');]
read_biom	table	Read BIOM (Biological Observation Matrix) format files. Returns sparse COO format with sample_id, feature_id, and value columns.	HDF5-based BIOM v2.1 format	[SELECT * FROM read_biom('ogu_table.biom');]
read_gff	table_macro	Read GFF3 annotation files. Parses attributes column into a SQL MAP for easy querying.	Macro	[SELECT type, attributes['gene'] AS gene FROM read_gff('annotations.gff');]
read_ncbi	table	Fetch GenBank metadata from NCBI by accession number. Returns organism, length, taxonomy_id, and other metadata.	Requires httpfs extension	[SELECT * FROM read_ncbi('NC_000913.3');]
read_ncbi_fasta	table	Fetch FASTA sequences from NCBI by accession number. Returns data in read_fastx-compatible schema.	Requires httpfs extension	[SELECT * FROM read_ncbi_fasta('NC_000913.3');]
read_ncbi_annotation	table	Fetch feature annotations from NCBI by accession number. Returns data in read_gff-compatible schema.	Requires httpfs extension	[SELECT * FROM read_ncbi_annotation('NC_000913.3') WHERE type = 'gene';]
read_jplace	table_macro	Read jplace phylogenetic placement files. Returns best placement per fragment with edge_num, likelihood, and like_weight_ratio.	Macro - requires json extension	[SELECT fragment, edge_num, like_weight_ratio FROM read_jplace('placements.jplace');]
read_mzml	table	Read mzML mass spectrometry files. Returns one row per spectrum with metadata, m/z arrays, and intensity arrays.	NULL	[SELECT * FROM read_mzml('sample.mzML');]
read_mzxml	table	Read mzXML mass spectrometry files. Returns same schema as read_mzml for cross-format compatibility.	NULL	[SELECT * FROM read_mzxml('sample.mzXML');]
read_mzml_chromatograms	table	Read chromatogram data from mzML files. Returns chromatogram metadata with time and intensity arrays.	NULL	[SELECT * FROM read_mzml_chromatograms('sample.mzML');]
read_newick	table	Read Newick phylogenetic tree files. Returns one row per node with parent-child relationships and branch lengths.	NULL	[SELECT name, branch_length FROM read_newick('tree.nwk') WHERE is_tip;]
massql	table	Execute MassQL queries against mass spectrometry data. Supports conditions (MS1MZ, MS2PROD, MS2PREC, MS2NL), aggregations, and X/Y-variable pattern matching.	NULL	[SELECT * FROM massql('QUERY scaninfo(MS2DATA) WHERE MS2PROD=167.09', 'sample.mzML');]
align_minimap2	table	Align query sequences to subject sequences using minimap2. Supports pre-built indexes for 10-30x performance improvement.	NULL	[SELECT * FROM align_minimap2('queries', subject_table='subjects');]
align_minimap2_sharded	table	Align query sequences against multiple pre-built minimap2 index shards in parallel. Reads are routed to specific shards via a mapping table.	NULL	[SELECT * FROM align_minimap2_sharded('queries', shard_directory='indexes/', read_to_shard='read_to_shard');]
save_minimap2_index	table	Build and save a minimap2 index to disk for reuse with align_minimap2.	NULL	[SELECT * FROM save_minimap2_index('subjects', 'refs.mmi', preset='sr');]
align_bowtie2	table	Align query sequences to subject sequences using Bowtie2. Optimized for short reads.	Requires bowtie2 in PATH	[SELECT * FROM align_bowtie2('queries', 'subjects', preset='very-sensitive');]
align_bowtie2_sharded	table	Align query sequences against multiple pre-built Bowtie2 index shards in parallel. Reads are routed to specific shards via a mapping table.	Requires bowtie2 in PATH	[SELECT * FROM align_bowtie2_sharded('queries', shard_directory='indexes/', read_to_shard='read_to_shard');]
rype_classify	table	Classify sequences against a RYpe minimizer index. Returns bucket assignments with confidence scores. Supports paired-end and negative index filtering.	NULL	[SELECT * FROM rype_classify('index.ryxdi', 'seqs');]
rype_extract_minimizer_set	table	Extract deduplicated forward and reverse complement minimizer hash sets from sequences.	NULL	[SELECT * FROM rype_extract_minimizer_set('seqs', 32, 10);]
rype_extract_strand_minimizers	table	Extract minimizer hashes with positional information for both forward and reverse complement strands.	NULL	[SELECT * FROM rype_extract_strand_minimizers('seqs', 32, 10);]
rype_log_ratio	table	Compute log-ratio classification between two RYpe indices (numerator vs denominator). Returns per-read log-ratio scores.	NULL	[SELECT * FROM rype_log_ratio('numerator.ryxdi', 'denominator.ryxdi', 'seqs');]
alignment_is_paired	scalar	Test if read is paired (SAM flag 0x1).	Also available as is_paired	[SELECT * FROM read_alignments('a.bam') WHERE alignment_is_paired(flags);]
alignment_is_proper_pair	scalar	Test if read is properly paired (SAM flag 0x2).	Also available as is_proper_pair	[SELECT * FROM read_alignments('a.bam') WHERE alignment_is_proper_pair(flags);]
alignment_is_unmapped	scalar	Test if read is unmapped (SAM flag 0x4).	Also available as is_unmapped	[SELECT * FROM read_alignments('a.bam') WHERE NOT alignment_is_unmapped(flags);]
alignment_is_mate_unmapped	scalar	Test if mate is unmapped (SAM flag 0x8).	Also available as is_munmap	[SELECT * FROM read_alignments('a.bam') WHERE alignment_is_mate_unmapped(flags);]
alignment_is_reverse	scalar	Test if read is on reverse strand (SAM flag 0x10).	Also available as is_reverse	[SELECT * FROM read_alignments('a.bam') WHERE alignment_is_reverse(flags);]
alignment_is_mate_reverse	scalar	Test if mate is on reverse strand (SAM flag 0x20).	Also available as is_mreverse	[SELECT * FROM read_alignments('a.bam') WHERE alignment_is_mate_reverse(flags);]
alignment_is_read1	scalar	Test if read is first in pair (SAM flag 0x40).	Also available as is_read1	[SELECT * FROM read_alignments('a.bam') WHERE alignment_is_read1(flags);]
alignment_is_read2	scalar	Test if read is second in pair (SAM flag 0x80).	Also available as is_read2	[SELECT * FROM read_alignments('a.bam') WHERE alignment_is_read2(flags);]
alignment_is_secondary	scalar	Test if alignment is secondary (SAM flag 0x100).	Also available as is_secondary	[SELECT * FROM read_alignments('a.bam') WHERE NOT alignment_is_secondary(flags);]
alignment_is_primary	scalar	Test if alignment is primary (neither secondary nor supplementary).	NULL	[SELECT * FROM read_alignments('a.bam') WHERE alignment_is_primary(flags);]
alignment_is_qc_failed	scalar	Test if read failed QC (SAM flag 0x200).	Also available as is_qcfail	[SELECT * FROM read_alignments('a.bam') WHERE NOT alignment_is_qc_failed(flags);]
alignment_is_duplicate	scalar	Test if read is PCR/optical duplicate (SAM flag 0x400).	Also available as is_dup	[SELECT * FROM read_alignments('a.bam') WHERE NOT alignment_is_duplicate(flags);]
alignment_is_supplementary	scalar	Test if alignment is supplementary (SAM flag 0x800).	Also available as is_supplementary	[SELECT * FROM read_alignments('a.bam') WHERE NOT alignment_is_supplementary(flags);]
alignment_seq_identity	scalar	Calculate sequence identity between read and reference. Supports gap_compressed (default), gap_excluded, and blast methods.	NULL	[SELECT alignment_seq_identity(cigar, tag_nm, tag_md) FROM read_alignments('a.bam');]
alignment_query_length	scalar	Calculate total query length from a CIGAR string. Optionally includes hard-clipped bases.	NULL	[SELECT alignment_query_length(cigar) FROM read_alignments('a.bam');]
alignment_query_coverage	scalar	Calculate proportion of query bases covered by alignment. Supports 'aligned' (default) and 'mapped' types.	NULL	[SELECT alignment_query_coverage(cigar) FROM read_alignments('a.bam');]
align_pairwise_score	scalar	Compute gap-affine pairwise alignment score using WFA2. Returns integer score (0 = identical).	Default penalties: mismatch=4, gap_open=6, gap_extend=2	[SELECT align_pairwise_score('ACGT', 'ACAT');]
align_pairwise_cigar	scalar	Compute pairwise alignment score and extended CIGAR string using WFA2. Returns STRUCT(score, cigar).	Default penalties: mismatch=4, gap_open=6, gap_extend=2	[SELECT (align_pairwise_cigar('ACGT', 'ACAT')).cigar;]
align_pairwise_full	scalar	Compute pairwise alignment with score, CIGAR, and aligned sequences using WFA2. Returns STRUCT(score, cigar, query_aligned, subject_aligned).	Default penalties: mismatch=4, gap_open=6, gap_extend=2	[SELECT align_pairwise_full('ACGT', 'AGT');]
sequence_dna_reverse_complement	scalar	Calculate reverse complement of DNA sequences. Supports IUPAC ambiguity codes and preserves case.	NULL	[SELECT sequence_dna_reverse_complement('ATCG');]
sequence_rna_reverse_complement	scalar	Calculate reverse complement of RNA sequences. Supports IUPAC ambiguity codes and preserves case.	NULL	[SELECT sequence_rna_reverse_complement('AUCG');]
sequence_dna_as_regexp	scalar	Convert DNA sequences with IUPAC codes to regular expression patterns for pattern matching.	NULL	[SELECT sequence_dna_as_regexp('ATNGG');]
sequence_rna_as_regexp	scalar	Convert RNA sequences with IUPAC codes to regular expression patterns for pattern matching.	NULL	[SELECT sequence_rna_as_regexp('AUNGG');]
formula	scalar	Compute monoisotopic mass in Daltons from a chemical formula string.	NULL	[SELECT formula('C6H12O6');]
massql_to_sql	scalar	Convert a MassQL query string to its generated SQL for debugging and inspection.	NULL	[SELECT massql_to_sql('QUERY scaninfo(MS2DATA) WHERE MS2PROD=167.09', 'sample.mzML');]
compress_intervals	aggregate	Aggregate function that merges overlapping genomic intervals into a minimal set of non-overlapping intervals.	NULL	[SELECT reference, compress_intervals(position, stop_position) FROM read_alignments('a.bam') GROUP BY reference;]
massdefect	macro	Compute the fractional mass (mass defect) of an m/z value.	Macro	[SELECT massdefect(100.0352);]
mz_within	macro	Check if observed m/z is within absolute tolerance (Da) of target m/z.	Macro	[SELECT mz_within(200.05, 200.0, 0.1);]
mz_within_ppm	macro	Check if observed m/z is within relative tolerance (ppm) of target m/z.	Macro	[SELECT mz_within_ppm(200.05, 200.0, 10);]
mz_massdefect_within	macro	Check if an m/z value's mass defect falls within a specified range.	Macro	[SELECT mz_massdefect_within(100.0352, 0.01, 0.05);]
mzml_peaks	table_macro	Unnest mz_array and intensity_array into per-peak rows with normalized intensities.	Macro - do not quote parameters	[SELECT * FROM mzml_peaks(spectra);]
mzml_scaninfo	table_macro	Re-aggregate peak-level data to one row per scan with summary statistics.	Macro - do not quote parameters	[SELECT * FROM mzml_scaninfo(peaks);]
mzml_scansum	table_macro	Compute total intensity per spectrum.	Macro - do not quote parameters	[SELECT * FROM mzml_scansum(peaks);]
mzml_scannum	table_macro	Return distinct spectrum indices.	Macro - do not quote parameters	[SELECT * FROM mzml_scannum(peaks);]
mzml_scanmz	table_macro	Return distinct precursor m/z values.	Macro - do not quote parameters	[SELECT * FROM mzml_scanmz(peaks);]
mzml_scanmaxint	table_macro	Return maximum intensity per spectrum.	Macro - do not quote parameters	[SELECT * FROM mzml_scanmaxint(peaks);]
mzml_ms1_peaks	table_macro	Filter mzml_peaks output to MS level 1 only.	Macro - do not quote parameters	[SELECT * FROM mzml_ms1_peaks(spectra);]
mzml_ms2_peaks	table_macro	Filter mzml_peaks output to MS level 2 only.	Macro - do not quote parameters	[SELECT * FROM mzml_ms2_peaks(spectra);]
mzml_ms1_parent_peaks	table_macro	Return MS1 peaks from parent scans of filtered MS2 peaks.	Macro - do not quote parameters	[SELECT * FROM mzml_ms1_parent_peaks(spectra, ms2_filtered);]
mzml_ms2_child_peaks	table_macro	Return MS2 peaks from child scans of filtered MS1 peaks.	Macro - do not quote parameters	[SELECT * FROM mzml_ms2_child_peaks(spectra, ms1_filtered);]
mzml_ms1_where_ms2prod	table_macro	MS1 peaks from parent scans of MS2 spectra containing a product ion at target m/z.	Macro - do not quote parameters	[SELECT * FROM mzml_ms1_where_ms2prod(spectra, 167.09);]
mzml_ms2_where_ms1mz	table_macro	MS2 peaks from child scans of MS1 spectra containing a peak at target m/z.	Macro - do not quote parameters	[SELECT * FROM mzml_ms2_where_ms1mz(spectra, 349.18);]
mzml_ms1_where_ms2prec	table_macro	MS1 peaks from parent scans of MS2 spectra with matching precursor m/z.	Macro - do not quote parameters	[SELECT * FROM mzml_ms1_where_ms2prec(spectra, 349.18);]
mzml_ms2_where_ms2prod_and_ms1mz	table_macro	MS2 peaks from scans with a product ion and whose parent MS1 has a peak at target m/z.	Macro - do not quote parameters	[SELECT * FROM mzml_ms2_where_ms2prod_and_ms1mz(spectra, 167.09, 349.18);]
mzml_filter_mz	table_macro	Filter peaks near a target m/z within tolerance.	Macro - do not quote parameters	[SELECT * FROM mzml_filter_mz(peaks, 167.09, 0.1);]
mzml_filter_nl	table_macro	Filter peaks with neutral loss near a target value within tolerance.	Macro - do not quote parameters	[SELECT * FROM mzml_filter_nl(peaks, 18.01, 0.1);]
mzml_excluded_ms2prod	table_macro	MS2 peaks from scans that do NOT contain a product ion at target m/z.	Macro - do not quote parameters	[SELECT * FROM mzml_excluded_ms2prod(spectra, 167.09);]
mzml_excluded_ms1mz	table_macro	MS1 peaks from scans that do NOT contain a peak at target m/z.	Macro - do not quote parameters	[SELECT * FROM mzml_excluded_ms1mz(spectra, 349.18);]
mzml_excluded_ms2prec	table_macro	MS2 peaks from scans whose precursor is NOT near target m/z.	Macro - do not quote parameters	[SELECT * FROM mzml_excluded_ms2prec(spectra, 349.18);]
mzml_x_offset_pair	table_macro	Find spectra with peaks at variable X and X+delta (pattern matching).	Macro - do not quote parameters	[SELECT * FROM mzml_x_offset_pair(spectra, 14.0);]
mzml_x_offset_triplet	table_macro	Find spectra with peaks at X, X+delta2, and X+delta3.	Macro - do not quote parameters	[SELECT * FROM mzml_x_offset_triplet(spectra, 14.0, 28.0);]
mzml_x_offset_ntuple	table_macro	Find spectra with peaks at X plus a list of N offsets.	Macro - do not quote parameters	[SELECT * FROM mzml_x_offset_ntuple(spectra, [14.0, 28.0, 42.0]);]
mzml_x_offset_pair_range	table_macro	Offset pair matching with X constrained to a min/max range.	Macro - do not quote parameters	[SELECT * FROM mzml_x_offset_pair_range(spectra, 14.0, 100.0, 500.0);]
mzml_x_prec_prod	table_macro	Find MS2 spectra where precursor minus delta matches a product ion.	Macro - do not quote parameters	[SELECT * FROM mzml_x_prec_prod(spectra, 18.01, 0.1, 5.0);]
mzml_x_prec_massdefect	table_macro	Find MS2 spectra with precursor mass defect in a specified range.	Macro - do not quote parameters	[SELECT * FROM mzml_x_prec_massdefect(spectra, 0.01, 0.05);]
mzml_x_ms1_ms2_prec	table_macro	Find MS2 spectra whose precursor matches an MS1 peak.	Macro - do not quote parameters	[SELECT * FROM mzml_x_ms1_ms2_prec(spectra);]
mzml_or_cardinality	table_macro	Match spectra containing N-of-M target m/z values (cardinality constraint).	Macro - do not quote parameters	[SELECT * FROM mzml_or_cardinality(spectra, [167.09, 181.10, 195.12], 0.1, 2, 3);]
mzml_peak_pair	table	Find peak pairs at X and 2*(X - formula_mass) for formula-based pattern matching.	Macro - do not quote parameters	[SELECT * FROM mzml_peak_pair(spectra, 'Fe');]
mzml_isotope_pattern	table_macro	Match spectra with isotope patterns at specified offsets and intensity ratios.	Macro - do not quote parameters	[SELECT * FROM mzml_isotope_pattern(spectra, [0.0, 1.003], [1.0, 0.5], [20.0, 20.0], 0.01);]
mzml_i_norm	macro	Max-normalize an intensity array by base peak intensity.	Macro	[SELECT mzml_i_norm(intensity_array, base_peak_intensity) FROM read_mzml('s.mzML');]
mzml_i_tic_norm	macro	TIC-normalize an intensity array by total ion current.	Macro	[SELECT mzml_i_tic_norm(intensity_array, total_ion_current) FROM read_mzml('s.mzML');]
parse_gff_attributes	macro	Parse a GFF3 attribute string (semicolon-delimited key=value pairs) into a DuckDB MAP.	Macro - helper for read_gff	[SELECT parse_gff_attributes('ID=gene1;Name=foo');]
woltka_ogu_per_sample	table_macro	Compute Woltka OGU counts over alignment data for multiple samples. Accounts for multi-mapped reads.	Macro - do not quote parameters	[SELECT * FROM woltka_ogu_per_sample(alignments, sample_id, read_id);]
woltka_ogu	table_macro	Compute Woltka OGU counts over alignment data for a single sample. Accounts for multi-mapped reads.	Macro - do not quote parameters	[SELECT * FROM woltka_ogu(alignments, read_id);]
genome_coverage	table_macro	Compute genome coverage from alignment data. Compresses intervals per contig, maps to genomes, and returns proportion covered.	Macro - do not quote parameters	[SELECT * FROM genome_coverage(alignments, genome_lengths, contig_to_genome);]
bowtie2_available	scalar	Check if the bowtie2 binary is available in PATH.	Returns false in non-bowtie2 builds	[SELECT bowtie2_available();]
miint_version	scalar	Return the MIINT extension version string.	NULL	[SELECT miint_version();]
miint_versions	table	Return a table of dependency library versions (htslib, minimap2, WFA2-lib, etc.).	NULL	[SELECT * FROM miint_versions();]
mask_dust	scalar	NULL	NULL	NULL
detect_chimera_uchime_denovo	table	NULL	NULL	NULL
alignment_slice	table	NULL	NULL	NULL
compute_coverage_depth	aggregate	NULL	NULL	NULL
is_unmapped	scalar	NULL	NULL	NULL
read_ena_sequences	table	NULL	NULL	NULL
is_reverse	scalar	NULL	NULL	NULL
is_dup	scalar	NULL	NULL	NULL
search_sequences_vsearch	table	NULL	NULL	NULL
is_qcfail	scalar	NULL	NULL	NULL
is_read1	scalar	NULL	NULL	NULL
is_read2	scalar	NULL	NULL	NULL
is_secondary	scalar	NULL	NULL	NULL
read_ena_attributes	table	NULL	NULL	NULL
read_jplace_newick	table	NULL	NULL	NULL
merge_pairs_vsearch	scalar	NULL	NULL	NULL
cluster_sequences_vsearch	table	NULL	NULL	NULL
detect_chimera_uchime	table	NULL	NULL	NULL
is_mreverse	scalar	NULL	NULL	NULL
is_munmap	scalar	NULL	NULL	NULL
read_sam	table	NULL	NULL	NULL
tree_resolve_placement	table	NULL	NULL	NULL
align_mafft	table	NULL	NULL	NULL
is_supplementary	scalar	NULL	NULL	NULL
read_ena	table	NULL	NULL	NULL
deblur	table	NULL	NULL	NULL
is_paired	scalar	NULL	NULL	NULL
is_proper_pair	scalar	NULL	NULL	NULL

Overloaded Functions

This extension does not add any function overloads.

Added Types

This extension does not add any types.

Added Settings

This extension does not add any settings.