Query Common Crawl and Wayback Machine web archive CDX APIs directly from SQL
Maintainer(s):
onnimonni
Installing and Loading
INSTALL web_archive FROM community;
LOAD web_archive;
Example
-- Find archived pages from Wayback Machine
SELECT url, timestamp, statuscode
FROM wayback_machine()
WHERE url LIKE 'example.com/%'
AND statuscode = 200
LIMIT 10;
-- One snapshot per year using DISTINCT ON pushdown
SELECT DISTINCT ON(year) url, timestamp
FROM wayback_machine()
WHERE url = 'github.com/duckdb/duckdb'
AND statuscode = 200
LIMIT 5;
-- Query Common Crawl index
SELECT url, timestamp, mimetype
FROM common_crawl_index()
WHERE url LIKE '%.example.com/%'
AND statuscode = 200
LIMIT 10;
About web_archive
This extension provides two table functions to query web archive CDX APIs:
- wayback_machine(): Query Internet Archive Wayback Machine (1996-present)
- common_crawl_index(): Query Common Crawl archive (2008-present, monthly snapshots)
Features:
- DISTINCT ON pushdown:
DISTINCT ON(year)becomes&collapse=timestamp:4 - Filter pushdown: WHERE clauses for statuscode/mimetype pushed to CDX API
- LIMIT pushdown: Only fetches requested number of records
- SELECT pushdown: Only fetches columns you need via
&fl=parameter - Response fetching: Download archived page content via
response.body - Virtual columns:
yearandmonthextracted from timestamp
For full documentation, visit the extension repository.
Added Functions
| function_name | function_type | description | comment | examples |
|---|---|---|---|---|
| common_crawl_index | table | NULL | NULL | |
| wayback_machine | table | NULL | NULL |