Search Shortcut cmd + k | ctrl + k
miint

Bioinformatics extension for microbiome research - read/write FASTQ, SAM/BAM, BIOM, GFF, Newick trees, query NCBI, and more.

Maintainer(s): wasade

Installing and Loading

INSTALL miint FROM community;
LOAD miint;

Example

-- Extract 16S rRNA gene sequences from E. coli K-12 MG1655
-- Fetches genome and annotations directly from NCBI
WITH ecoli_16s_positions AS (
    SELECT position, stop_position, strand,
           attributes['gene'] AS gene,
           attributes['product'] AS product
    FROM read_ncbi_annotation('NC_000913.3')
    WHERE type = 'rRNA'
      AND attributes['product'] LIKE '%16S%'
    ORDER BY position
)
SELECT concat_ws(':', read_id, position, stop_position, strand) AS read_id,
       sequence1[position:stop_position] AS sequence
FROM read_ncbi_fasta('NC_000913.3'), ecoli_16s_positions;

About miint

MIINT (MIcrobiome INTelligence) brings SQL to bioinformatics data analysis.

Read bioinformatics formats as tables:

  • read_alignments: SAM/BAM alignment files
  • read_fastx: FASTA/FASTQ sequence files (single/paired-end)
  • read_biom: BIOM abundance matrices
  • read_gff: GFF3 annotation files
  • read_newick: Phylogenetic trees
  • read_jplace: Phylogenetic placements

Query NCBI directly:

  • read_ncbi: Fetch sequence metadata by accession
  • read_ncbi_fasta: Fetch sequences from NCBI
  • read_ncbi_annotation: Fetch feature annotations from NCBI

Alignment with SQL:

  • align_minimap2: Align sequences using minimap2
  • align_bowtie2: Align sequences using bowtie2

Write to standard formats:

  • COPY ... FORMAT FASTQ/FASTA: Export sequences
  • COPY ... FORMAT SAM/BAM: Export alignments
  • COPY ... FORMAT BIOM: Export abundance tables
  • COPY ... FORMAT NEWICK: Export phylogenetic trees

Analysis functions:

  • SAM flag checking (alignment_is_primary, alignment_is_unmapped, etc.)
  • Sequence identity calculation (alignment_seq_identity)
  • Interval compression (compress_intervals)
  • Woltka OGU classification (woltka_ogu, woltka_ogu_per_sample)

See the documentation for complete details.

Added Functions

function_name function_type description comment examples
read_alignments table Read SAM/BAM alignment files as a table. Supports headerless SAM files with reference_lengths parameter, glob patterns, and parallel processing. Also available as read_sam (alias) [SELECT * FROM read_alignments('alignments.bam');]
read_fastx table Read FASTA/FASTQ sequence files as a table. Supports single-end and paired-end reads, glob patterns, gzip compression, and quality score offset conversion. NULL [SELECT * FROM read_fastx('reads.fastq.gz');]
read_biom table Read BIOM (Biological Observation Matrix) format files. Returns sparse COO format with sample_id, feature_id, and value columns. HDF5-based BIOM v2.1 format [SELECT * FROM read_biom('ogu_table.biom');]
read_gff table_macro Read GFF3 annotation files. Parses attributes column into a SQL MAP for easy querying. NULL [SELECT type, attributes['gene'] AS gene FROM read_gff('annotations.gff');]
read_ncbi table Fetch GenBank metadata from NCBI by accession number. Returns organism, length, taxonomy_id, and other metadata. Requires httpfs extension [SELECT * FROM read_ncbi('NC_000913.3');]
read_ncbi_fasta table Fetch FASTA sequences from NCBI by accession number. Returns data in read_fastx-compatible schema. Requires httpfs extension [SELECT * FROM read_ncbi_fasta('NC_000913.3');]
read_ncbi_annotation table Fetch feature annotations from NCBI by accession number. Returns data in read_gff-compatible schema. Requires httpfs extension [SELECT * FROM read_ncbi_annotation('NC_000913.3') WHERE type = 'gene';]
read_jplace table_macro Read jplace phylogenetic placement files. Returns best placement per fragment with edge_num, likelihood, and like_weight_ratio. NULL [SELECT fragment, edge_num, like_weight_ratio FROM read_jplace('placements.jplace');]
read_newick table Read Newick phylogenetic tree files. Returns one row per node with parent-child relationships and branch lengths. NULL [SELECT name, branch_length FROM read_newick('tree.nwk') WHERE is_tip;]
align_minimap2 table Align query sequences to subject sequences using minimap2. Supports pre-built indexes for 10-30x performance improvement. NULL [SELECT * FROM align_minimap2('queries', subject_table='subjects');]
save_minimap2_index table Build and save a minimap2 index to disk for reuse with align_minimap2. NULL [SELECT * FROM save_minimap2_index('subjects', 'refs.mmi', preset='sr');]
align_bowtie2 table Align query sequences to subject sequences using Bowtie2. Optimized for short reads. Requires bowtie2 in PATH [SELECT * FROM align_bowtie2('queries', 'subjects', preset='very-sensitive');]
alignment_is_paired scalar Test if read is paired (SAM flag 0x1). NULL [SELECT * FROM read_alignments('a.bam') WHERE alignment_is_paired(flags);]
alignment_is_proper_pair scalar Test if read is properly paired (SAM flag 0x2). NULL [SELECT * FROM read_alignments('a.bam') WHERE alignment_is_proper_pair(flags);]
alignment_is_unmapped scalar Test if read is unmapped (SAM flag 0x4). NULL [SELECT * FROM read_alignments('a.bam') WHERE NOT alignment_is_unmapped(flags);]
alignment_is_mate_unmapped scalar Test if mate is unmapped (SAM flag 0x8). NULL [SELECT * FROM read_alignments('a.bam') WHERE alignment_is_mate_unmapped(flags);]
alignment_is_reverse scalar Test if read is on reverse strand (SAM flag 0x10). NULL [SELECT * FROM read_alignments('a.bam') WHERE alignment_is_reverse(flags);]
alignment_is_mate_reverse scalar Test if mate is on reverse strand (SAM flag 0x20). NULL [SELECT * FROM read_alignments('a.bam') WHERE alignment_is_mate_reverse(flags);]
alignment_is_read1 scalar Test if read is first in pair (SAM flag 0x40). NULL [SELECT * FROM read_alignments('a.bam') WHERE alignment_is_read1(flags);]
alignment_is_read2 scalar Test if read is second in pair (SAM flag 0x80). NULL [SELECT * FROM read_alignments('a.bam') WHERE alignment_is_read2(flags);]
alignment_is_secondary scalar Test if alignment is secondary (SAM flag 0x100). NULL [SELECT * FROM read_alignments('a.bam') WHERE NOT alignment_is_secondary(flags);]
alignment_is_primary scalar Test if alignment is primary (neither secondary nor supplementary). NULL [SELECT * FROM read_alignments('a.bam') WHERE alignment_is_primary(flags);]
alignment_is_qc_failed scalar Test if read failed QC (SAM flag 0x200). NULL [SELECT * FROM read_alignments('a.bam') WHERE NOT alignment_is_qc_failed(flags);]
alignment_is_duplicate scalar Test if read is PCR/optical duplicate (SAM flag 0x400). NULL [SELECT * FROM read_alignments('a.bam') WHERE NOT alignment_is_duplicate(flags);]
alignment_is_supplementary scalar Test if alignment is supplementary (SAM flag 0x800). NULL [SELECT * FROM read_alignments('a.bam') WHERE NOT alignment_is_supplementary(flags);]
alignment_seq_identity scalar Calculate sequence identity between read and reference. Supports gap_compressed (default), gap_excluded, and blast methods. NULL [SELECT alignment_seq_identity(cigar, tag_nm, tag_md) FROM read_alignments('a.bam');]
alignment_query_length scalar Calculate total query length from a CIGAR string. Optionally includes hard-clipped bases. NULL [SELECT alignment_query_length(cigar) FROM read_alignments('a.bam');]
alignment_query_coverage scalar Calculate proportion of query bases covered by alignment. Supports 'aligned' (default) and 'mapped' types. NULL [SELECT alignment_query_coverage(cigar) FROM read_alignments('a.bam');]
sequence_dna_reverse_complement scalar Calculate reverse complement of DNA sequences. Supports IUPAC ambiguity codes and preserves case. NULL [SELECT sequence_dna_reverse_complement('ATCG');]
sequence_rna_reverse_complement scalar Calculate reverse complement of RNA sequences. Supports IUPAC ambiguity codes and preserves case. NULL [SELECT sequence_rna_reverse_complement('AUCG');]
sequence_dna_as_regexp scalar Convert DNA sequences with IUPAC codes to regular expression patterns for pattern matching. NULL [SELECT sequence_dna_as_regexp('ATNGG');]
sequence_rna_as_regexp scalar Convert RNA sequences with IUPAC codes to regular expression patterns for pattern matching. NULL [SELECT sequence_rna_as_regexp('AUNGG');]
compress_intervals aggregate Aggregate function that merges overlapping genomic intervals into a minimal set of non-overlapping intervals. NULL [SELECT reference, compress_intervals(position, stop_position) FROM read_alignments('a.bam') GROUP BY reference;]
woltka_ogu_per_sample table_macro Compute Woltka OGU counts over alignment data for multiple samples. Accounts for multi-mapped reads. Macro - do not quote parameters [SELECT * FROM woltka_ogu_per_sample(alignments, sample_id, read_id);]
woltka_ogu table_macro Compute Woltka OGU counts over alignment data for a single sample. Accounts for multi-mapped reads. Macro - do not quote parameters [SELECT * FROM woltka_ogu(alignments, read_id);]
align_minimap2_sharded table NULL NULL NULL
is_unmapped scalar NULL NULL NULL
bowtie2_available scalar NULL NULL NULL
is_reverse scalar NULL NULL NULL
is_dup scalar NULL NULL NULL
parse_gff_attributes macro NULL NULL NULL
align_bowtie2_sharded table NULL NULL NULL
is_qcfail scalar NULL NULL NULL
is_read1 scalar NULL NULL NULL
is_read2 scalar NULL NULL NULL
read_sequences_sff table NULL NULL NULL
is_secondary scalar NULL NULL NULL
read_sequences_sam table NULL NULL NULL
is_mreverse scalar NULL NULL NULL
is_munmap scalar NULL NULL NULL
read_sam table NULL NULL NULL
is_supplementary scalar NULL NULL NULL
is_paired scalar NULL NULL NULL
is_proper_pair scalar NULL NULL NULL

Added Settings

name description input_type scope aliases
auto_fallback_to_full_download Allows automatically falling back to full file downloads when possible. BOOLEAN GLOBAL []
ca_cert_file Path to a custom certificate file for self-signed certificates. VARCHAR GLOBAL []
enable_curl_server_cert_verification Enable server side certificate verification for CURL backend. BOOLEAN GLOBAL []
enable_global_s3_configuration Automatically fetch AWS credentials from environment variables. BOOLEAN GLOBAL []
enable_server_cert_verification Enable server side certificate verification. BOOLEAN GLOBAL []
force_download Forces upfront download of file BOOLEAN GLOBAL []
hf_max_per_page Debug option to limit number of items returned in list requests UBIGINT GLOBAL []
http_keep_alive Keep alive connections. Setting this to false can help when running into connection failures BOOLEAN GLOBAL []
http_retries HTTP retries on I/O error UBIGINT GLOBAL []
http_retry_backoff Backoff factor for exponentially increasing retry wait time FLOAT GLOBAL []
http_retry_wait_ms Time between retries UBIGINT GLOBAL []
http_timeout HTTP timeout read/write/connection/retry (in seconds) UBIGINT GLOBAL []
httpfs_client_implementation Select which is the HTTPUtil implementation to be used VARCHAR GLOBAL []
merge_http_secret_into_s3_request Merges http secret params into S3 requests BOOLEAN GLOBAL []
s3_access_key_id S3 Access Key ID VARCHAR GLOBAL []
s3_endpoint S3 Endpoint VARCHAR GLOBAL []
s3_kms_key_id S3 KMS Key ID VARCHAR GLOBAL []
s3_region S3 Region VARCHAR GLOBAL []
s3_requester_pays S3 use requester pays mode BOOLEAN GLOBAL []
s3_secret_access_key S3 Access Key VARCHAR GLOBAL []
s3_session_token S3 Session Token VARCHAR GLOBAL []
s3_uploader_max_filesize S3 Uploader max filesize (between 50GB and 5TB) VARCHAR GLOBAL []
s3_uploader_max_parts_per_file S3 Uploader max parts per file (between 1 and 10000) UBIGINT GLOBAL []
s3_uploader_thread_limit S3 Uploader global thread limit UBIGINT GLOBAL []
s3_url_compatibility_mode Disable Globs and Query Parameters on S3 URLs BOOLEAN GLOBAL []
s3_url_style S3 URL style VARCHAR GLOBAL []
s3_use_ssl S3 use SSL BOOLEAN GLOBAL []
unsafe_disable_etag_checks Disable checks on ETag consistency BOOLEAN GLOBAL []