html_readability

Search Shortcut cmd + k | ctrl + k

Documentation

html_readability

Downloads 574this week

GitHub stars 0

Extension repository on GitHub

Extension descriptor (YAML)

Extract readable content from HTML using Mozilla's Readability algorithm

Maintainer(s): onnimonni

Installing and Loading

INSTALL html_readability FROM community;
LOAD html_readability;

Example

SELECT parse_html('<html><head><title>Hello</title></head><body><article><p>World</p></article></body></html>');

About html_readability

The html_readability extension provides a parse_html() function that extracts readable content from HTML pages.

It uses Mozilla's Readability algorithm - the gold standard for extracting readable text from HTML. This is the same algorithm that powers Reader Mode in Firefox and Safari, providing clutter-free viewing by stripping away navigation, ads, and other non-essential elements.

Usage

-- Parse HTML and get all fields
SELECT parse_html(html_column) FROM pages;

-- Access individual fields
SELECT (parse_html(html)).title FROM pages;
SELECT (parse_html(html)).content FROM pages;  -- cleaned HTML
SELECT (parse_html(html)).text FROM pages;     -- plain text

Return Type

Returns STRUCT(title VARCHAR, content VARCHAR, text VARCHAR):

title: The extracted page title
content: Cleaned HTML of the main article content
text: Plain text version of the content

Added Functions

function_name	function_type	description	comment	examples
parse_html	scalar	NULL	NULL