Read, analyze, and write Markdown files with block-level document representation and inline element support
Maintainer(s):
teaguesterling
Installing and Loading
INSTALL markdown FROM community;
LOAD markdown;
Example
-- Load the extension
LOAD markdown;
-- Read Markdown files with glob patterns
SELECT content FROM read_markdown('docs/**/*.md');
-- Parse into block-level elements (duck_block shape)
SELECT element_type, content, level
FROM read_markdown_blocks('README.md')
ORDER BY element_order;
-- Extract code blocks from Markdown text
SELECT cb.language, cb.code
FROM (
SELECT UNNEST(md_extract_code_blocks('```python\nprint("Hello")\n```')) as cb
);
-- Build rich text with inline elements
SELECT duck_blocks_to_md([
{kind: 'inline', element_type: 'text', content: 'Check out ', level: 1, encoding: 'text', attributes: MAP{}, element_order: 0},
{kind: 'inline', element_type: 'link', content: 'our docs', level: 1, encoding: 'text', attributes: MAP{'href': 'https://duckdb-markdown.readthedocs.io/'}, element_order: 1}
]);
-- Export query results as Markdown table
COPY (SELECT * FROM my_table) TO 'output.md' (FORMAT MARKDOWN);
-- Round-trip: read blocks, transform, write back
COPY (
SELECT kind, element_type, content, level, encoding, attributes
FROM read_markdown_blocks('doc.md')
) TO 'copy.md' (FORMAT MARKDOWN, markdown_mode 'blocks');
About markdown
The Markdown extension adds comprehensive Markdown processing capabilities to DuckDB, enabling structured analysis, transformation, and generation of Markdown documents.
Documentation: https://duckdb-markdown.readthedocs.io/
Key Features:
- File Reading Functions: Read Markdown files with
read_markdown(),read_markdown_sections(), andread_markdown_blocks()supporting glob patterns, metadata extraction, and block-level parsing - Block-Level Representation: Parse documents into duck_block format with
kind(block/inline),element_type,content,level,encoding,attributes, andelement_ordercolumns - Inline Element Support: Build rich text content with bold, italic, links, code, math, and more using the unified duck_block structure
- COPY TO Markdown: Export query results as Markdown tables, documents, or block-level representations with full round-trip support
- Content Extraction: Extract code blocks, links, images, and tables from Markdown content using structured LIST
return types - Document Processing: Convert markdown to HTML/text, validate content, extract metadata, and generate document statistics
- Replacement Scan Support: Query Markdown files directly using
FROM '*.md'syntax with full glob pattern support - Native MARKDOWN Type: Custom MARKDOWN type with automatic VARCHAR casting for seamless integration
- Cross-Platform Support: Works on Linux, macOS, WebAssembly, and Windows
- GitHub Flavored Markdown: Uses cmark-gfm for accurate parsing of modern Markdown features
- High Performance: Process thousands of documents efficiently with 4,000+ sections/second processing rate
Core Functions:
read_markdown()- Read Markdown files with comprehensive parameter supportread_markdown_sections()- Parse files into hierarchical sections with filtering optionsread_markdown_blocks()- Parse files into block-level elements (duck_block shape)duck_block_to_md()- Convert single block/inline element to Markdownduck_blocks_to_md()- Convert list of elements to Markdown documentduck_blocks_to_sections()- Convert blocks to hierarchical sectionsmd_extract_code_blocks()- Extract code blocks with language and metadatamd_extract_links()- Extract links with text, URL, and title informationmd_extract_images()- Extract images with alt text and metadatamd_extract_tables_json()- Extract tables as structured JSONmd_to_html()- Convert markdown content to HTMLmd_to_text()- Convert markdown to plain text for full-text searchmd_stats()- Get document statistics (word count, reading time, etc.)md_extract_metadata()- Extract frontmatter metadata as MAP
COPY TO Modes:
table(default) - Export any query as a formatted Markdown tabledocument- Reconstruct Markdown from sections with headings and contentblocks/duck_block- Round-trip block-level representation with inline element support
Example Use Cases:
- Documentation analysis across entire repositories
- Content quality assessment and auditing
- Large-scale documentation search and indexing
- Code example extraction and analysis
- Document transformation pipelines
- Rich text generation with inline formatting
- Knowledge base processing and content management
Performance:
Real-world benchmark: Processing 287 Markdown files (2,699 sections, 1,137 code blocks, 1,174 links) in 603ms.
Full test suite with 1108 passing assertions across 20 test files.
Added Functions
| function_name | function_type | description | comment | examples |
|---|---|---|---|---|
| duck_block_to_md | scalar | NULL | NULL | |
| duck_blocks_to_md | scalar | NULL | NULL | |
| duck_blocks_to_sections | scalar | NULL | NULL | |
| md_extract_code_blocks | scalar | NULL | NULL | |
| md_extract_images | scalar | NULL | NULL | |
| md_extract_links | scalar | NULL | NULL | |
| md_extract_metadata | scalar | NULL | NULL | |
| md_extract_section | scalar | NULL | NULL | |
| md_extract_sections | scalar | NULL | NULL | |
| md_extract_table_rows | scalar | NULL | NULL | |
| md_extract_tables_json | scalar | NULL | NULL | |
| md_section_breadcrumb | scalar | NULL | NULL | |
| md_stats | scalar | NULL | NULL | |
| md_to_html | scalar | NULL | NULL | |
| md_to_text | scalar | NULL | NULL | |
| md_valid | scalar | NULL | NULL | |
| read_markdown | table | NULL | NULL | |
| read_markdown_blocks | table | NULL | NULL | |
| read_markdown_sections | table | NULL | NULL | |
| value_to_md | scalar | NULL | NULL |
Added Types
| type_name | type_size | logical_type | type_category | internal |
|---|---|---|---|---|
| markdown | 16 | VARCHAR | STRING | true |
| md | 16 | VARCHAR | STRING | true |