markdown

Search Shortcut cmd + k | ctrl + k

Documentation

markdown

Downloads 478this week

GitHub stars 10

Extension repository on GitHub

Extension descriptor (YAML)

Read, analyze, and write Markdown files with block-level document representation and inline element support

Maintainer(s): teaguesterling

Installing and Loading

INSTALL markdown FROM community;
LOAD markdown;

Example

-- Load the extension
LOAD markdown;

-- Read Markdown files with glob patterns
SELECT content FROM read_markdown('docs/**/*.md');

-- Parse into block-level elements (duck_block shape)
SELECT element_type, content, level
FROM read_markdown_blocks('README.md')
ORDER BY element_order;

-- Extract code blocks from Markdown text
SELECT cb.language, cb.code
FROM (
  SELECT UNNEST(md_extract_code_blocks('```python\nprint("Hello")\n```')) as cb
);

-- Build rich text with inline elements
SELECT duck_blocks_to_md([
  {kind: 'inline', element_type: 'text', content: 'Check out ', level: 1, encoding: 'text', attributes: MAP{}, element_order: 0},
  {kind: 'inline', element_type: 'link', content: 'our docs', level: 1, encoding: 'text', attributes: MAP{'href': 'https://duckdb-markdown.readthedocs.io/'}, element_order: 1}
]);

-- Export query results as Markdown table
COPY (SELECT * FROM my_table) TO 'output.md' (FORMAT MARKDOWN);

-- Round-trip: read blocks, transform, write back
COPY (
  SELECT kind, element_type, content, level, encoding, attributes
  FROM read_markdown_blocks('doc.md')
) TO 'copy.md' (FORMAT MARKDOWN, markdown_mode 'blocks');

About markdown

The Markdown extension adds comprehensive Markdown processing capabilities to DuckDB, enabling structured analysis, transformation, and generation of Markdown documents.

Documentation: https://duckdb-markdown.readthedocs.io/

Key Features:

File Reading Functions: Read Markdown files with read_markdown(), read_markdown_sections(), and read_markdown_blocks() supporting glob patterns, metadata extraction, and block-level parsing
Block-Level Representation: Parse documents into duck_block format with kind (block/inline), element_type, content, level, encoding, attributes, and element_order columns
Inline Element Support: Build rich text content with bold, italic, links, code, math, and more using the unified duck_block structure
COPY TO Markdown: Export query results as Markdown tables, documents, or block-level representations with full round-trip support
Content Extraction: Extract code blocks, links, images, and tables from Markdown content using structured LIST return types
Document Processing: Convert markdown to HTML/text, validate content, extract metadata, and generate document statistics
Replacement Scan Support: Query Markdown files directly using FROM '*.md' syntax with full glob pattern support
Native MARKDOWN Type: Custom MARKDOWN type with automatic VARCHAR casting for seamless integration
Cross-Platform Support: Works on Linux, macOS, WebAssembly, and Windows
GitHub Flavored Markdown: Uses cmark-gfm for accurate parsing of modern Markdown features
High Performance: Process thousands of documents efficiently with 4,000+ sections/second processing rate

Core Functions:

read_markdown() - Read Markdown files with comprehensive parameter support
read_markdown_sections() - Parse files into hierarchical sections with filtering options
read_markdown_blocks() - Parse files into block-level elements (duck_block shape)
duck_block_to_md() - Convert single block/inline element to Markdown
duck_blocks_to_md() - Convert list of elements to Markdown document
duck_blocks_to_sections() - Convert blocks to hierarchical sections
md_extract_code_blocks() - Extract code blocks with language and metadata
md_extract_links() - Extract links with text, URL, and title information
md_extract_images() - Extract images with alt text and metadata
md_extract_tables_json() - Extract tables as structured JSON
md_to_html() - Convert markdown content to HTML
md_to_text() - Convert markdown to plain text for full-text search
md_stats() - Get document statistics (word count, reading time, etc.)
md_extract_metadata() - Extract frontmatter metadata as MAP

COPY TO Modes:

table (default) - Export any query as a formatted Markdown table
document - Reconstruct Markdown from sections with headings and content
blocks / duck_block - Round-trip block-level representation with inline element support

Example Use Cases:

Documentation analysis across entire repositories
Content quality assessment and auditing
Large-scale documentation search and indexing
Code example extraction and analysis
Document transformation pipelines
Rich text generation with inline formatting
Knowledge base processing and content management

Performance:

Real-world benchmark: Processing 287 Markdown files (2,699 sections, 1,137 code blocks, 1,174 links) in 603ms.

Full test suite with 1108 passing assertions across 20 test files.

Added Functions

function_name	function_type	description	comment
duck_block_to_md	scalar	NULL	NULL
duck_blocks_to_md	scalar	NULL	NULL
duck_blocks_to_sections	scalar	NULL	NULL
md_extract_code_blocks	scalar	NULL	NULL
md_extract_images	scalar	NULL	NULL
md_extract_links	scalar	NULL	NULL
md_extract_metadata	scalar	NULL	NULL
md_extract_section	scalar	NULL	NULL
md_extract_sections	scalar	NULL	NULL
md_extract_table_rows	scalar	NULL	NULL
md_extract_tables_json	scalar	NULL	NULL
md_section_breadcrumb	scalar	NULL	NULL
md_stats	scalar	NULL	NULL
md_to_html	scalar	NULL	NULL
md_to_text	scalar	NULL	NULL
md_valid	scalar	NULL	NULL
read_markdown	table	NULL	NULL
read_markdown_blocks	table	NULL	NULL
read_markdown_sections	table	NULL	NULL
value_to_md	scalar	NULL	NULL

Added Types

type_name	type_size	logical_type	type_category	internal
markdown	16	VARCHAR	STRING	true
md	16	VARCHAR	STRING	true

Installing and Loading

Example

About markdown

Added Functions

Added Types

In this article