Search Shortcut cmd + k | ctrl + k
markdown

Read, analyze, and write Markdown files with block-level document representation and inline element support

Maintainer(s): teaguesterling

Installing and Loading

INSTALL markdown FROM community;
LOAD markdown;

Example

-- Load the extension
LOAD markdown;

-- Read Markdown files with glob patterns
SELECT content FROM read_markdown('docs/**/*.md');

-- Parse into block-level elements (duck_block shape)
SELECT element_type, content, level
FROM read_markdown_blocks('README.md')
ORDER BY element_order;

-- Extract code blocks from Markdown text
SELECT cb.language, cb.code
FROM (
  SELECT UNNEST(md_extract_code_blocks('```python\nprint("Hello")\n```')) as cb
);

-- Build rich text with inline elements
SELECT duck_blocks_to_md([
  {kind: 'inline', element_type: 'text', content: 'Check out ', level: 1, encoding: 'text', attributes: MAP{}, element_order: 0},
  {kind: 'inline', element_type: 'link', content: 'our docs', level: 1, encoding: 'text', attributes: MAP{'href': 'https://duckdb-markdown.readthedocs.io/'}, element_order: 1}
]);

-- Export query results as Markdown table
COPY (SELECT * FROM my_table) TO 'output.md' (FORMAT MARKDOWN);

-- Round-trip: read blocks, transform, write back
COPY (
  SELECT kind, element_type, content, level, encoding, attributes
  FROM read_markdown_blocks('doc.md')
) TO 'copy.md' (FORMAT MARKDOWN, markdown_mode 'blocks');

About markdown

The Markdown extension adds comprehensive Markdown processing capabilities to DuckDB, enabling structured analysis, transformation, and generation of Markdown documents.

Documentation: https://duckdb-markdown.readthedocs.io/

Key Features:

  • File Reading Functions: Read Markdown files with read_markdown(), read_markdown_sections(), and read_markdown_blocks() supporting glob patterns, metadata extraction, and block-level parsing
  • Block-Level Representation: Parse documents into duck_block format with kind (block/inline), element_type, content, level, encoding, attributes, and element_order columns
  • Inline Element Support: Build rich text content with bold, italic, links, code, math, and more using the unified duck_block structure
  • COPY TO Markdown: Export query results as Markdown tables, documents, or block-level representations with full round-trip support
  • Content Extraction: Extract code blocks, links, images, and tables from Markdown content using structured LIST return types
  • Document Processing: Convert markdown to HTML/text, validate content, extract metadata, and generate document statistics
  • Replacement Scan Support: Query Markdown files directly using FROM '*.md' syntax with full glob pattern support
  • Native MARKDOWN Type: Custom MARKDOWN type with automatic VARCHAR casting for seamless integration
  • Cross-Platform Support: Works on Linux, macOS, WebAssembly, and Windows
  • GitHub Flavored Markdown: Uses cmark-gfm for accurate parsing of modern Markdown features
  • High Performance: Process thousands of documents efficiently with 4,000+ sections/second processing rate

Core Functions:

  • read_markdown() - Read Markdown files with comprehensive parameter support
  • read_markdown_sections() - Parse files into hierarchical sections with filtering options
  • read_markdown_blocks() - Parse files into block-level elements (duck_block shape)
  • duck_block_to_md() - Convert single block/inline element to Markdown
  • duck_blocks_to_md() - Convert list of elements to Markdown document
  • duck_blocks_to_sections() - Convert blocks to hierarchical sections
  • md_extract_code_blocks() - Extract code blocks with language and metadata
  • md_extract_links() - Extract links with text, URL, and title information
  • md_extract_images() - Extract images with alt text and metadata
  • md_extract_tables_json() - Extract tables as structured JSON
  • md_to_html() - Convert markdown content to HTML
  • md_to_text() - Convert markdown to plain text for full-text search
  • md_stats() - Get document statistics (word count, reading time, etc.)
  • md_extract_metadata() - Extract frontmatter metadata as MAP

COPY TO Modes:

  • table (default) - Export any query as a formatted Markdown table
  • document - Reconstruct Markdown from sections with headings and content
  • blocks / duck_block - Round-trip block-level representation with inline element support

Example Use Cases:

  • Documentation analysis across entire repositories
  • Content quality assessment and auditing
  • Large-scale documentation search and indexing
  • Code example extraction and analysis
  • Document transformation pipelines
  • Rich text generation with inline formatting
  • Knowledge base processing and content management

Performance:

Real-world benchmark: Processing 287 Markdown files (2,699 sections, 1,137 code blocks, 1,174 links) in 603ms.

Full test suite with 1108 passing assertions across 20 test files.

Added Functions

function_name function_type description comment examples
duck_block_to_md scalar NULL NULL  
duck_blocks_to_md scalar NULL NULL  
duck_blocks_to_sections scalar NULL NULL  
md_extract_code_blocks scalar NULL NULL  
md_extract_images scalar NULL NULL  
md_extract_links scalar NULL NULL  
md_extract_metadata scalar NULL NULL  
md_extract_section scalar NULL NULL  
md_extract_sections scalar NULL NULL  
md_extract_table_rows scalar NULL NULL  
md_extract_tables_json scalar NULL NULL  
md_section_breadcrumb scalar NULL NULL  
md_stats scalar NULL NULL  
md_to_html scalar NULL NULL  
md_to_text scalar NULL NULL  
md_valid scalar NULL NULL  
read_markdown table NULL NULL  
read_markdown_blocks table NULL NULL  
read_markdown_sections table NULL NULL  
value_to_md scalar NULL NULL  

Added Types

type_name type_size logical_type type_category internal
markdown 16 VARCHAR STRING true
md 16 VARCHAR STRING true