Search Shortcut cmd + k | ctrl + k
webbed

Comprehensive processing extension for web markup languages (XML and HTML) that enables SQL-native analysis of structured documents with intelligent schema inference, XPath-based data extraction, and powerful HTML table parsing capabilities.

Maintainer(s): teaguesterling

Installing and Loading

INSTALL webbed FROM community;
LOAD webbed;

Example

-- Load the extension
LOAD webbed;

-- Read XML files directly into tables
SELECT * FROM 'data.xml';
SELECT * FROM read_xml('config/*.xml');

-- Parse and extract from XML content using XPath
SELECT xml_extract_text('<book><title>Database Guide</title></book>', '//title');
-- Result: "Database Guide"

-- Parse and extract from HTML content
SELECT html_extract_text('<html><body><h1>Welcome</h1></body></html>', '//h1');
-- Result: "Welcome"

-- Extract HTML tables directly into DuckDB
SELECT * FROM html_extract_tables('<table><tr><th>Name</th><th>Age</th></tr><tr><td>John</td><td>25</td></tr></table>');

-- Extract links and images from HTML pages
SELECT html_extract_links('<a href="https://example.com">Click here</a>');
SELECT html_extract_images('<img src="photo.jpg" alt="Photo" width="800">');

-- Convert between XML and JSON formats
SELECT xml_to_json('<person><name>John</name><age>30</age></person>');
SELECT json_to_xml('{"name":"John","age":"30"}');

About webbed

DuckDB XML is a comprehensive extension that brings powerful XML and HTML processing capabilities to DuckDB, enabling SQL-native analysis of structured documents. The extension provides three core areas of functionality:

XML Processing & Analysis: Parse, validate, and extract data from XML documents using full XPath 1.0 expressions. Functions include xml_extract_text(), xml_extract_elements(), xml_extract_attributes(), xml_valid(), and xml_stats() for comprehensive document analysis. The extension handles namespaces, comments, CDATA sections, and provides utilities like xml_pretty_print() and xml_minify().

HTML Processing & Web Scraping: Advanced HTML parsing capabilities with specialized functions for web data extraction. Extract text content with html_extract_text(), parse HTML tables into structured data with html_extract_tables(), extract links with metadata using html_extract_links(), and extract images with attributes using html_extract_images(). Perfect for web scraping and HTML document analysis workflows.

Smart Schema Inference & File Reading: Automatically flatten XML/HTML documents into relational tables with intelligent type detection for dates, numbers, booleans, and nested structures. Functions like read_xml() and read_html() provide direct file-to-table conversion with configurable options for error handling, maximum file sizes, and schema customization.

Key XML Functions:

  • read_xml(pattern) - Read XML files with automatic schema inference
  • xml_extract_text(xml, xpath) - XPath-based text extraction
  • xml_extract_elements(xml, xpath) - Extract structured elements
  • xml_extract_attributes(xml, xpath) - Extract attributes as structs
  • xml_to_json(xml) / json_to_xml(json) - Format conversions
  • xml_stats(xml) - Document statistics and analysis
  • xml_validate_schema(xml, xsd) - XSD schema validation

Key HTML Functions:

  • read_html(pattern) - Read HTML files into tables
  • html_extract_tables(html) - Extract HTML tables as structured data
  • html_extract_links(html) - Extract all links with metadata
  • html_extract_images(html) - Extract images with attributes
  • html_extract_text(html, xpath) - XPath-based HTML text extraction
  • parse_html(content) - Parse HTML strings into structured format

Built on libxml2 for robust, standards-compliant parsing with comprehensive error handling, memory-safe RAII implementation, and 100% test coverage. The extension supports mixed file systems, configurable schema inference, and efficient processing of large document collections.

Added Functions

function_name function_type description comment examples
html_extract_images scalar NULL NULL  
html_extract_links scalar NULL NULL  
html_extract_table_rows scalar NULL NULL  
html_extract_tables table NULL NULL  
html_extract_tables_json scalar NULL NULL  
html_extract_text scalar NULL NULL  
json_to_xml scalar NULL NULL  
parse_html scalar NULL NULL  
read_html table NULL NULL  
read_html_objects table NULL NULL  
read_xml table NULL NULL  
read_xml_objects table NULL NULL  
to_xml scalar NULL NULL  
xml scalar NULL NULL  
xml_extract_all_text scalar NULL NULL  
xml_extract_attributes scalar NULL NULL  
xml_extract_cdata scalar NULL NULL  
xml_extract_comments scalar NULL NULL  
xml_extract_elements scalar NULL NULL  
xml_extract_elements_string scalar NULL NULL  
xml_extract_text scalar NULL NULL  
xml_libxml2_version scalar NULL NULL  
xml_minify scalar NULL NULL  
xml_namespaces scalar NULL NULL  
xml_pretty_print scalar NULL NULL  
xml_stats scalar NULL NULL  
xml_to_json scalar NULL NULL  
xml_valid scalar NULL NULL  
xml_validate_schema scalar NULL NULL  
xml_well_formed scalar NULL NULL  
xml_wrap_fragment scalar NULL NULL  

Added Types

type_name type_size logical_type type_category internal
HTML 16 VARCHAR STRING true
XML 16 VARCHAR STRING true
XMLFragment 16 VARCHAR STRING true