Comprehensive processing extension for web markup languages (XML and HTML) that enables SQL-native analysis of structured documents with intelligent schema inference, XPath-based data extraction, and powerful HTML table parsing capabilities.
Installing and Loading
INSTALL webbed FROM community;
LOAD webbed;
Example
-- Load the extension
LOAD webbed;
-- Read XML files directly into tables
SELECT * FROM 'data.xml';
SELECT * FROM read_xml('config/*.xml');
-- Parse and extract from XML content using XPath
SELECT xml_extract_text('<book><title>Database Guide</title></book>', '//title');
-- Result: "Database Guide"
-- Parse and extract from HTML content
SELECT html_extract_text('<html><body><h1>Welcome</h1></body></html>', '//h1');
-- Result: "Welcome"
-- Extract HTML tables directly into DuckDB
SELECT * FROM html_extract_tables('<table><tr><th>Name</th><th>Age</th></tr><tr><td>John</td><td>25</td></tr></table>');
-- Extract links and images from HTML pages
SELECT html_extract_links('<a href="https://example.com">Click here</a>');
SELECT html_extract_images('<img src="photo.jpg" alt="Photo" width="800">');
-- Convert between XML and JSON formats
SELECT xml_to_json('<person><name>John</name><age>30</age></person>');
SELECT json_to_xml('{"name":"John","age":"30"}');
About webbed
DuckDB XML is a comprehensive extension that brings powerful XML and HTML processing capabilities to DuckDB, enabling SQL-native analysis of structured documents. The extension provides three core areas of functionality:
XML Processing & Analysis: Parse, validate, and extract data from XML documents using full XPath 1.0 expressions. Functions include xml_extract_text()
, xml_extract_elements()
, xml_extract_attributes()
, xml_valid()
, and xml_stats()
for comprehensive document analysis. The extension handles namespaces, comments, CDATA sections, and provides utilities like xml_pretty_print()
and xml_minify()
.
HTML Processing & Web Scraping: Advanced HTML parsing capabilities with specialized functions for web data extraction. Extract text content with html_extract_text()
, parse HTML tables into structured data with html_extract_tables()
, extract links with metadata using html_extract_links()
, and extract images with attributes using html_extract_images()
. Perfect for web scraping and HTML document analysis workflows.
Smart Schema Inference & File Reading: Automatically flatten XML/HTML documents into relational tables with intelligent type detection for dates, numbers, booleans, and nested structures. Functions like read_xml()
and read_html()
provide direct file-to-table conversion with configurable options for error handling, maximum file sizes, and schema customization.
Key XML Functions:
read_xml(pattern)
- Read XML files with automatic schema inferencexml_extract_text(xml, xpath)
- XPath-based text extractionxml_extract_elements(xml, xpath)
- Extract structured elementsxml_extract_attributes(xml, xpath)
- Extract attributes as structsxml_to_json(xml)
/json_to_xml(json)
- Format conversionsxml_stats(xml)
- Document statistics and analysisxml_validate_schema(xml, xsd)
- XSD schema validation
Key HTML Functions:
read_html(pattern)
- Read HTML files into tableshtml_extract_tables(html)
- Extract HTML tables as structured datahtml_extract_links(html)
- Extract all links with metadatahtml_extract_images(html)
- Extract images with attributeshtml_extract_text(html, xpath)
- XPath-based HTML text extractionparse_html(content)
- Parse HTML strings into structured format
Built on libxml2 for robust, standards-compliant parsing with comprehensive error handling, memory-safe RAII implementation, and 100% test coverage. The extension supports mixed file systems, configurable schema inference, and efficient processing of large document collections.
Added Functions
function_name | function_type | description | comment | examples |
---|---|---|---|---|
html_extract_images | scalar | NULL | NULL | |
html_extract_links | scalar | NULL | NULL | |
html_extract_table_rows | scalar | NULL | NULL | |
html_extract_tables | table | NULL | NULL | |
html_extract_tables_json | scalar | NULL | NULL | |
html_extract_text | scalar | NULL | NULL | |
json_to_xml | scalar | NULL | NULL | |
parse_html | scalar | NULL | NULL | |
read_html | table | NULL | NULL | |
read_html_objects | table | NULL | NULL | |
read_xml | table | NULL | NULL | |
read_xml_objects | table | NULL | NULL | |
to_xml | scalar | NULL | NULL | |
xml | scalar | NULL | NULL | |
xml_extract_all_text | scalar | NULL | NULL | |
xml_extract_attributes | scalar | NULL | NULL | |
xml_extract_cdata | scalar | NULL | NULL | |
xml_extract_comments | scalar | NULL | NULL | |
xml_extract_elements | scalar | NULL | NULL | |
xml_extract_elements_string | scalar | NULL | NULL | |
xml_extract_text | scalar | NULL | NULL | |
xml_libxml2_version | scalar | NULL | NULL | |
xml_minify | scalar | NULL | NULL | |
xml_namespaces | scalar | NULL | NULL | |
xml_pretty_print | scalar | NULL | NULL | |
xml_stats | scalar | NULL | NULL | |
xml_to_json | scalar | NULL | NULL | |
xml_valid | scalar | NULL | NULL | |
xml_validate_schema | scalar | NULL | NULL | |
xml_well_formed | scalar | NULL | NULL | |
xml_wrap_fragment | scalar | NULL | NULL |
Added Types
type_name | type_size | logical_type | type_category | internal |
---|---|---|---|---|
HTML | 16 | VARCHAR | STRING | true |
XML | 16 | VARCHAR | STRING | true |
XMLFragment | 16 | VARCHAR | STRING | true |