Read and analyze Markdown files with comprehensive content extraction and document processing capabilities
Maintainer(s):
teaguesterling
Installing and Loading
INSTALL markdown FROM community;
LOAD markdown;
Example
-- Load the extension
LOAD markdown;
-- Read Markdown files with glob patterns
SELECT content FROM read_markdown('docs/**/*.md');
-- Read documentation sections with hierarchy
SELECT title, level, content
FROM read_markdown_sections('README.md', include_content := true);
-- Extract code blocks from Markdown text
SELECT cb.language, cb.code
FROM (
SELECT UNNEST(md_extract_code_blocks('```python\nprint("Hello, World!")\n```')) as cb
);
-- Analyze documentation repositories
SELECT
len(md_extract_code_blocks(content)) as code_examples,
len(md_extract_links(content)) as external_links,
len(md_extract_images(content)) as images
FROM read_markdown('**/*.md');
-- Use replacement scan syntax for convenience
SELECT * FROM '*.md';
SELECT * FROM 'docs/**/*.md';
About markdown
The Markdown extension adds comprehensive Markdown processing capabilities to DuckDB, enabling structured analysis of Markdown documents and content extraction for documentation analysis, content auditing, and knowledge base processing.
Key Features:
- File Reading Functions: Read Markdown files with
read_markdown()
andread_markdown_sections()
supporting glob patterns, metadata extraction, and hierarchical section parsing - Content Extraction: Extract code blocks, links, images, and tables from Markdown content using structured LIST
return types - Document Processing: Convert markdown to HTML/text, validate content, extract metadata, and generate document statistics
- Replacement Scan Support: Query Markdown files directly using
FROM '*.md'
syntax with full glob pattern support - Native MARKDOWN Type: Custom MARKDOWN type with automatic VARCHAR casting for seamless integration
- Cross-Platform Support: Works on Linux, macOS, and WebAssembly (Windows support in development)
- GitHub Flavored Markdown: Uses cmark-gfm for accurate parsing of modern Markdown features
- High Performance: Process thousands of documents efficiently with 4,000+ sections/second processing rate
- Comprehensive Parameter System: Flexible file processing with customizable options for content inclusion, size limits, and metadata extraction
Core Functions:
read_markdown()
- Read Markdown files with comprehensive parameter supportread_markdown_sections()
- Parse files into hierarchical sections with filtering optionsmd_extract_code_blocks()
- Extract code blocks with language and metadatamd_extract_links()
- Extract links with text, URL, and title informationmd_extract_images()
- Extract images with alt text and metadatamd_extract_tables_json()
- Extract tables as structured JSONmd_to_html()
- Convert markdown content to HTMLmd_to_text()
- Convert markdown to plain text for full-text searchmd_stats()
- Get document statistics (word count, reading time, etc.)md_extract_metadata()
- Extract frontmatter metadata as JSON
Example Use Cases:
- Documentation analysis across entire repositories
- Content quality assessment and auditing
- Large-scale documentation search and indexing
- Code example extraction and analysis
- Link validation and external reference tracking
- Knowledge base processing and content management
- Technical writing analytics and reporting
Performance Benchmarks:
Real-world performance: Processing 287 Markdown files (2,699 sections, 1,137 code blocks, 1,174 links) in 603ms on typical hardware.
The extension is built using cmark-gfm and includes a comprehensive test suite with 218+ passing assertions, ensuring reliable performance and accuracy for production use.
Added Functions
function_name | function_type | description | comment | examples |
---|---|---|---|---|
md_extract_code_blocks | scalar | NULL | NULL | |
md_extract_images | scalar | NULL | NULL | |
md_extract_links | scalar | NULL | NULL | |
md_extract_metadata | scalar | NULL | NULL | |
md_extract_section | scalar | NULL | NULL | |
md_extract_sections | scalar | NULL | NULL | |
md_extract_table_rows | scalar | NULL | NULL | |
md_extract_tables_json | scalar | NULL | NULL | |
md_section_breadcrumb | scalar | NULL | NULL | |
md_stats | scalar | NULL | NULL | |
md_to_html | scalar | NULL | NULL | |
md_to_text | scalar | NULL | NULL | |
md_valid | scalar | NULL | NULL | |
read_markdown | table | NULL | NULL | |
read_markdown_sections | table | NULL | NULL | |
value_to_md | scalar | NULL | NULL |
Added Types
type_name | type_size | logical_type | type_category | internal |
---|---|---|---|---|
markdown | 16 | VARCHAR | STRING | true |
md | 16 | VARCHAR | STRING | true |