Search Shortcut cmd + k | ctrl + k
web_archive

Query Common Crawl and Wayback Machine web archive CDX APIs directly from SQL

Maintainer(s): onnimonni

Installing and Loading

INSTALL web_archive FROM community;
LOAD web_archive;

Example

-- Find archived pages from Wayback Machine
SELECT url, timestamp, statuscode
FROM wayback_machine()
WHERE url LIKE 'example.com/%'
  AND statuscode = 200
LIMIT 10;

-- One snapshot per year using DISTINCT ON pushdown
SELECT DISTINCT ON(year) url, timestamp
FROM wayback_machine()
WHERE url = 'github.com/duckdb/duckdb'
  AND statuscode = 200
LIMIT 5;

-- Query Common Crawl index
SELECT url, timestamp, mimetype
FROM common_crawl_index()
WHERE url LIKE '%.example.com/%'
  AND statuscode = 200
LIMIT 10;

About web_archive

This extension provides two table functions to query web archive CDX APIs:

  • wayback_machine(): Query Internet Archive Wayback Machine (1996-present)
  • common_crawl_index(): Query Common Crawl archive (2008-present, monthly snapshots)

Features:

  • DISTINCT ON pushdown: DISTINCT ON(year) becomes &collapse=timestamp:4
  • Filter pushdown: WHERE clauses for statuscode/mimetype pushed to CDX API
  • LIMIT pushdown: Only fetches requested number of records
  • SELECT pushdown: Only fetches columns you need via &fl= parameter
  • Response fetching: Download archived page content via response.body
  • Virtual columns: year and month extracted from timestamp

For full documentation, visit the extension repository.

Added Functions

function_name function_type description comment examples
common_crawl_index table NULL NULL  
wayback_machine table NULL NULL  

Overloaded Functions

This extension does not add any function overloads.

Added Types

This extension does not add any types.

Added Settings

This extension does not add any settings.