web_archive

Search Shortcut cmd + k | ctrl + k

Documentation

web_archive

Downloads 608this week

GitHub stars 16

Extension repository on GitHub

Extension descriptor (YAML)

Query Common Crawl and Wayback Machine web archive CDX APIs directly from SQL

Maintainer(s): onnimonni

Installing and Loading

INSTALL web_archive FROM community;
LOAD web_archive;

Example

-- Find archived pages from Wayback Machine
SELECT url, timestamp, statuscode
FROM wayback_machine()
WHERE url LIKE 'example.com/%'
  AND statuscode = 200
LIMIT 10;

-- One snapshot per year using DISTINCT ON pushdown
SELECT DISTINCT ON(year) url, timestamp
FROM wayback_machine()
WHERE url = 'github.com/duckdb/duckdb'
  AND statuscode = 200
LIMIT 5;

-- Query Common Crawl index
SELECT url, timestamp, mimetype
FROM common_crawl_index()
WHERE url LIKE '%.example.com/%'
  AND statuscode = 200
LIMIT 10;

About web_archive

This extension provides two table functions to query web archive CDX APIs:

wayback_machine(): Query Internet Archive Wayback Machine (1996-present)
common_crawl_index(): Query Common Crawl archive (2008-present, monthly snapshots)

Features: