Bioinformatics extension for microbiome research - read/write FASTQ, SAM/BAM, BIOM, GFF, Newick trees, query NCBI, and more.
Maintainer(s):
wasade
Installing and Loading
INSTALL miint FROM community;
LOAD miint;
Example
-- Extract 16S rRNA gene sequences from E. coli K-12 MG1655
-- Fetches genome and annotations directly from NCBI
WITH ecoli_16s_positions AS (
SELECT position, stop_position, strand,
attributes['gene'] AS gene,
attributes['product'] AS product
FROM read_ncbi_annotation('NC_000913.3')
WHERE type = 'rRNA'
AND attributes['product'] LIKE '%16S%'
ORDER BY position
)
SELECT concat_ws(':', read_id, position, stop_position, strand) AS read_id,
sequence1[position:stop_position] AS sequence
FROM read_ncbi_fasta('NC_000913.3'), ecoli_16s_positions;
About miint
MIINT (MIcrobiome INTelligence) brings SQL to bioinformatics data analysis.
Read bioinformatics formats as tables:
read_alignments: SAM/BAM alignment filesread_fastx: FASTA/FASTQ sequence files (single/paired-end)read_biom: BIOM abundance matricesread_gff: GFF3 annotation filesread_newick: Phylogenetic treesread_jplace: Phylogenetic placements
Query NCBI directly:
read_ncbi: Fetch sequence metadata by accessionread_ncbi_fasta: Fetch sequences from NCBIread_ncbi_annotation: Fetch feature annotations from NCBI
Alignment with SQL:
align_minimap2: Align sequences using minimap2align_bowtie2: Align sequences using bowtie2
Write to standard formats:
COPY ... FORMAT FASTQ/FASTA: Export sequencesCOPY ... FORMAT SAM/BAM: Export alignmentsCOPY ... FORMAT BIOM: Export abundance tablesCOPY ... FORMAT NEWICK: Export phylogenetic trees
Analysis functions:
- SAM flag checking (
alignment_is_primary,alignment_is_unmapped, etc.) - Sequence identity calculation (
alignment_seq_identity) - Interval compression (
compress_intervals) - Woltka OGU classification (
woltka_ogu,woltka_ogu_per_sample)
See the documentation for complete details.
Added Functions
| function_name | function_type | description | comment | examples |
|---|---|---|---|---|
| read_alignments | table | Read SAM/BAM alignment files as a table. Supports headerless SAM files with reference_lengths parameter, glob patterns, and parallel processing. | Also available as read_sam (alias) | [SELECT * FROM read_alignments('alignments.bam');] |
| read_fastx | table | Read FASTA/FASTQ sequence files as a table. Supports single-end and paired-end reads, glob patterns, gzip compression, and quality score offset conversion. | NULL | [SELECT * FROM read_fastx('reads.fastq.gz');] |
| read_biom | table | Read BIOM (Biological Observation Matrix) format files. Returns sparse COO format with sample_id, feature_id, and value columns. | HDF5-based BIOM v2.1 format | [SELECT * FROM read_biom('ogu_table.biom');] |
| read_gff | table_macro | Read GFF3 annotation files. Parses attributes column into a SQL MAP for easy querying. | NULL | [SELECT type, attributes['gene'] AS gene FROM read_gff('annotations.gff');] |
| read_ncbi | table | Fetch GenBank metadata from NCBI by accession number. Returns organism, length, taxonomy_id, and other metadata. | Requires httpfs extension | [SELECT * FROM read_ncbi('NC_000913.3');] |
| read_ncbi_fasta | table | Fetch FASTA sequences from NCBI by accession number. Returns data in read_fastx-compatible schema. | Requires httpfs extension | [SELECT * FROM read_ncbi_fasta('NC_000913.3');] |
| read_ncbi_annotation | table | Fetch feature annotations from NCBI by accession number. Returns data in read_gff-compatible schema. | Requires httpfs extension | [SELECT * FROM read_ncbi_annotation('NC_000913.3') WHERE type = 'gene';] |
| read_jplace | table_macro | Read jplace phylogenetic placement files. Returns best placement per fragment with edge_num, likelihood, and like_weight_ratio. | NULL | [SELECT fragment, edge_num, like_weight_ratio FROM read_jplace('placements.jplace');] |
| read_newick | table | Read Newick phylogenetic tree files. Returns one row per node with parent-child relationships and branch lengths. | NULL | [SELECT name, branch_length FROM read_newick('tree.nwk') WHERE is_tip;] |
| align_minimap2 | table | Align query sequences to subject sequences using minimap2. Supports pre-built indexes for 10-30x performance improvement. | NULL | [SELECT * FROM align_minimap2('queries', subject_table='subjects');] |
| save_minimap2_index | table | Build and save a minimap2 index to disk for reuse with align_minimap2. | NULL | [SELECT * FROM save_minimap2_index('subjects', 'refs.mmi', preset='sr');] |
| align_bowtie2 | table | Align query sequences to subject sequences using Bowtie2. Optimized for short reads. | Requires bowtie2 in PATH | [SELECT * FROM align_bowtie2('queries', 'subjects', preset='very-sensitive');] |
| alignment_is_paired | scalar | Test if read is paired (SAM flag 0x1). | NULL | [SELECT * FROM read_alignments('a.bam') WHERE alignment_is_paired(flags);] |
| alignment_is_proper_pair | scalar | Test if read is properly paired (SAM flag 0x2). | NULL | [SELECT * FROM read_alignments('a.bam') WHERE alignment_is_proper_pair(flags);] |
| alignment_is_unmapped | scalar | Test if read is unmapped (SAM flag 0x4). | NULL | [SELECT * FROM read_alignments('a.bam') WHERE NOT alignment_is_unmapped(flags);] |
| alignment_is_mate_unmapped | scalar | Test if mate is unmapped (SAM flag 0x8). | NULL | [SELECT * FROM read_alignments('a.bam') WHERE alignment_is_mate_unmapped(flags);] |
| alignment_is_reverse | scalar | Test if read is on reverse strand (SAM flag 0x10). | NULL | [SELECT * FROM read_alignments('a.bam') WHERE alignment_is_reverse(flags);] |
| alignment_is_mate_reverse | scalar | Test if mate is on reverse strand (SAM flag 0x20). | NULL | [SELECT * FROM read_alignments('a.bam') WHERE alignment_is_mate_reverse(flags);] |
| alignment_is_read1 | scalar | Test if read is first in pair (SAM flag 0x40). | NULL | [SELECT * FROM read_alignments('a.bam') WHERE alignment_is_read1(flags);] |
| alignment_is_read2 | scalar | Test if read is second in pair (SAM flag 0x80). | NULL | [SELECT * FROM read_alignments('a.bam') WHERE alignment_is_read2(flags);] |
| alignment_is_secondary | scalar | Test if alignment is secondary (SAM flag 0x100). | NULL | [SELECT * FROM read_alignments('a.bam') WHERE NOT alignment_is_secondary(flags);] |
| alignment_is_primary | scalar | Test if alignment is primary (neither secondary nor supplementary). | NULL | [SELECT * FROM read_alignments('a.bam') WHERE alignment_is_primary(flags);] |
| alignment_is_qc_failed | scalar | Test if read failed QC (SAM flag 0x200). | NULL | [SELECT * FROM read_alignments('a.bam') WHERE NOT alignment_is_qc_failed(flags);] |
| alignment_is_duplicate | scalar | Test if read is PCR/optical duplicate (SAM flag 0x400). | NULL | [SELECT * FROM read_alignments('a.bam') WHERE NOT alignment_is_duplicate(flags);] |
| alignment_is_supplementary | scalar | Test if alignment is supplementary (SAM flag 0x800). | NULL | [SELECT * FROM read_alignments('a.bam') WHERE NOT alignment_is_supplementary(flags);] |
| alignment_seq_identity | scalar | Calculate sequence identity between read and reference. Supports gap_compressed (default), gap_excluded, and blast methods. | NULL | [SELECT alignment_seq_identity(cigar, tag_nm, tag_md) FROM read_alignments('a.bam');] |
| alignment_query_length | scalar | Calculate total query length from a CIGAR string. Optionally includes hard-clipped bases. | NULL | [SELECT alignment_query_length(cigar) FROM read_alignments('a.bam');] |
| alignment_query_coverage | scalar | Calculate proportion of query bases covered by alignment. Supports 'aligned' (default) and 'mapped' types. | NULL | [SELECT alignment_query_coverage(cigar) FROM read_alignments('a.bam');] |
| sequence_dna_reverse_complement | scalar | Calculate reverse complement of DNA sequences. Supports IUPAC ambiguity codes and preserves case. | NULL | [SELECT sequence_dna_reverse_complement('ATCG');] |
| sequence_rna_reverse_complement | scalar | Calculate reverse complement of RNA sequences. Supports IUPAC ambiguity codes and preserves case. | NULL | [SELECT sequence_rna_reverse_complement('AUCG');] |
| sequence_dna_as_regexp | scalar | Convert DNA sequences with IUPAC codes to regular expression patterns for pattern matching. | NULL | [SELECT sequence_dna_as_regexp('ATNGG');] |
| sequence_rna_as_regexp | scalar | Convert RNA sequences with IUPAC codes to regular expression patterns for pattern matching. | NULL | [SELECT sequence_rna_as_regexp('AUNGG');] |
| compress_intervals | aggregate | Aggregate function that merges overlapping genomic intervals into a minimal set of non-overlapping intervals. | NULL | [SELECT reference, compress_intervals(position, stop_position) FROM read_alignments('a.bam') GROUP BY reference;] |
| woltka_ogu_per_sample | table_macro | Compute Woltka OGU counts over alignment data for multiple samples. Accounts for multi-mapped reads. | Macro - do not quote parameters | [SELECT * FROM woltka_ogu_per_sample(alignments, sample_id, read_id);] |
| woltka_ogu | table_macro | Compute Woltka OGU counts over alignment data for a single sample. Accounts for multi-mapped reads. | Macro - do not quote parameters | [SELECT * FROM woltka_ogu(alignments, read_id);] |
| align_minimap2_sharded | table | NULL | NULL | NULL |
| is_unmapped | scalar | NULL | NULL | NULL |
| bowtie2_available | scalar | NULL | NULL | NULL |
| is_reverse | scalar | NULL | NULL | NULL |
| is_dup | scalar | NULL | NULL | NULL |
| parse_gff_attributes | macro | NULL | NULL | NULL |
| align_bowtie2_sharded | table | NULL | NULL | NULL |
| is_qcfail | scalar | NULL | NULL | NULL |
| is_read1 | scalar | NULL | NULL | NULL |
| is_read2 | scalar | NULL | NULL | NULL |
| read_sequences_sff | table | NULL | NULL | NULL |
| is_secondary | scalar | NULL | NULL | NULL |
| read_sequences_sam | table | NULL | NULL | NULL |
| is_mreverse | scalar | NULL | NULL | NULL |
| is_munmap | scalar | NULL | NULL | NULL |
| read_sam | table | NULL | NULL | NULL |
| is_supplementary | scalar | NULL | NULL | NULL |
| is_paired | scalar | NULL | NULL | NULL |
| is_proper_pair | scalar | NULL | NULL | NULL |
Added Settings
| name | description | input_type | scope | aliases |
|---|---|---|---|---|
| auto_fallback_to_full_download | Allows automatically falling back to full file downloads when possible. | BOOLEAN | GLOBAL | [] |
| ca_cert_file | Path to a custom certificate file for self-signed certificates. | VARCHAR | GLOBAL | [] |
| enable_curl_server_cert_verification | Enable server side certificate verification for CURL backend. | BOOLEAN | GLOBAL | [] |
| enable_global_s3_configuration | Automatically fetch AWS credentials from environment variables. | BOOLEAN | GLOBAL | [] |
| enable_server_cert_verification | Enable server side certificate verification. | BOOLEAN | GLOBAL | [] |
| force_download | Forces upfront download of file | BOOLEAN | GLOBAL | [] |
| hf_max_per_page | Debug option to limit number of items returned in list requests | UBIGINT | GLOBAL | [] |
| http_keep_alive | Keep alive connections. Setting this to false can help when running into connection failures | BOOLEAN | GLOBAL | [] |
| http_retries | HTTP retries on I/O error | UBIGINT | GLOBAL | [] |
| http_retry_backoff | Backoff factor for exponentially increasing retry wait time | FLOAT | GLOBAL | [] |
| http_retry_wait_ms | Time between retries | UBIGINT | GLOBAL | [] |
| http_timeout | HTTP timeout read/write/connection/retry (in seconds) | UBIGINT | GLOBAL | [] |
| httpfs_client_implementation | Select which is the HTTPUtil implementation to be used | VARCHAR | GLOBAL | [] |
| merge_http_secret_into_s3_request | Merges http secret params into S3 requests | BOOLEAN | GLOBAL | [] |
| s3_access_key_id | S3 Access Key ID | VARCHAR | GLOBAL | [] |
| s3_endpoint | S3 Endpoint | VARCHAR | GLOBAL | [] |
| s3_kms_key_id | S3 KMS Key ID | VARCHAR | GLOBAL | [] |
| s3_region | S3 Region | VARCHAR | GLOBAL | [] |
| s3_requester_pays | S3 use requester pays mode | BOOLEAN | GLOBAL | [] |
| s3_secret_access_key | S3 Access Key | VARCHAR | GLOBAL | [] |
| s3_session_token | S3 Session Token | VARCHAR | GLOBAL | [] |
| s3_uploader_max_filesize | S3 Uploader max filesize (between 50GB and 5TB) | VARCHAR | GLOBAL | [] |
| s3_uploader_max_parts_per_file | S3 Uploader max parts per file (between 1 and 10000) | UBIGINT | GLOBAL | [] |
| s3_uploader_thread_limit | S3 Uploader global thread limit | UBIGINT | GLOBAL | [] |
| s3_url_compatibility_mode | Disable Globs and Query Parameters on S3 URLs | BOOLEAN | GLOBAL | [] |
| s3_url_style | S3 URL style | VARCHAR | GLOBAL | [] |
| s3_use_ssl | S3 use SSL | BOOLEAN | GLOBAL | [] |
| unsafe_disable_etag_checks | Disable checks on ETag consistency | BOOLEAN | GLOBAL | [] |