Extract readable content from HTML using Mozilla's Readability algorithm
Maintainer(s):
onnimonni
Installing and Loading
INSTALL html_readability FROM community;
LOAD html_readability;
Example
SELECT parse_html('<html><head><title>Hello</title></head><body><article><p>World</p></article></body></html>');
About html_readability
The html_readability extension provides a parse_html() function that extracts readable content from HTML pages.
It uses Mozilla's Readability algorithm - the gold standard for extracting readable text from HTML. This is the same algorithm that powers Reader Mode in Firefox and Safari, providing clutter-free viewing by stripping away navigation, ads, and other non-essential elements.
Usage
-- Parse HTML and get all fields
SELECT parse_html(html_column) FROM pages;
-- Access individual fields
SELECT (parse_html(html)).title FROM pages;
SELECT (parse_html(html)).content FROM pages; -- cleaned HTML
SELECT (parse_html(html)).text FROM pages; -- plain text
Return Type
Returns STRUCT(title VARCHAR, content VARCHAR, text VARCHAR):
title: The extracted page titlecontent: Cleaned HTML of the main article contenttext: Plain text version of the content
Added Functions
| function_name | function_type | description | comment | examples |
|---|---|---|---|---|
| parse_html | scalar | NULL | NULL |