Search Shortcut cmd + k | ctrl + k
html_readability

Extract readable content from HTML using Mozilla's Readability algorithm

Maintainer(s): onnimonni

Installing and Loading

INSTALL html_readability FROM community;
LOAD html_readability;

Example

SELECT parse_html('<html><head><title>Hello</title></head><body><article><p>World</p></article></body></html>');

About html_readability

The html_readability extension provides a parse_html() function that extracts readable content from HTML pages.

It uses Mozilla's Readability algorithm - the gold standard for extracting readable text from HTML. This is the same algorithm that powers Reader Mode in Firefox and Safari, providing clutter-free viewing by stripping away navigation, ads, and other non-essential elements.

Usage

-- Parse HTML and get all fields
SELECT parse_html(html_column) FROM pages;

-- Access individual fields
SELECT (parse_html(html)).title FROM pages;
SELECT (parse_html(html)).content FROM pages;  -- cleaned HTML
SELECT (parse_html(html)).text FROM pages;     -- plain text

Return Type

Returns STRUCT(title VARCHAR, content VARCHAR, text VARCHAR):

  • title: The extracted page title
  • content: Cleaned HTML of the main article content
  • text: Plain text version of the content

Added Functions

function_name function_type description comment examples
parse_html scalar NULL NULL