Search Shortcut cmd + k | ctrl + k
web_archive

Query Common Crawl and Wayback Machine web archive CDX APIs directly from SQL

Maintainer(s): onnimonni

Installing and Loading

INSTALL web_archive FROM community;
LOAD web_archive;

Example

-- Find archived pages from Wayback Machine
SELECT url, timestamp, statuscode
FROM wayback_machine()
WHERE url LIKE 'example.com/%'
  AND statuscode = 200
LIMIT 10;

-- One snapshot per year using DISTINCT ON pushdown
SELECT DISTINCT ON(year) url, timestamp
FROM wayback_machine()
WHERE url = 'github.com/duckdb/duckdb'
  AND statuscode = 200
LIMIT 5;

-- Query Common Crawl index
SELECT url, timestamp, mimetype
FROM common_crawl_index()
WHERE url LIKE '%.example.com/%'
  AND statuscode = 200
LIMIT 10;

About web_archive

This extension provides two table functions to query web archive CDX APIs:

  • wayback_machine(): Query Internet Archive Wayback Machine (1996-present)
  • common_crawl_index(): Query Common Crawl archive (2008-present, monthly snapshots)

Features:

  • DISTINCT ON pushdown: DISTINCT ON(year) becomes &collapse=timestamp:4
  • Filter pushdown: WHERE clauses for statuscode/mimetype pushed to CDX API
  • LIMIT pushdown: Only fetches requested number of records
  • SELECT pushdown: Only fetches columns you need via &fl= parameter
  • Response fetching: Download archived page content via response.body
  • Virtual columns: year and month extracted from timestamp

For full documentation, visit the extension repository.

Added Functions

function_name function_type description comment examples
common_crawl_index table NULL NULL  
wayback_machine table NULL NULL