Parse WARC (Web ARChive) records for Common Crawl data processing
Maintainer(s):
onnimonni
Installing and Loading
INSTALL warc FROM community;
LOAD warc;
Example
-- Parse a WARC record from a gzip-compressed file
SELECT parse_warc(content) FROM read_blob('record.warc.gz');
┌─────────────────────────────────────────────────────────────────────────────┐
│ parse_warc(content) │
│ struct(warc_version varchar, warc_headers varchar, http_version varchar, │
│ http_status integer, http_headers varchar, http_body blob) │
├─────────────────────────────────────────────────────────────────────────────┤
│ {'warc_version': '1.0', 'warc_headers': '{"WARC-Type": "response", ...}', │
│ 'http_version': 'HTTP/1.1', 'http_status': 200, │
│ 'http_headers': '{"content-type": "text/html", ...}', │
│ 'http_body': <!doctype html>...} │
└─────────────────────────────────────────────────────────────────────────────┘
-- Extract specific fields
SELECT
(parse_warc(content)).http_status,
(parse_warc(content)).http_body
FROM read_blob('record.warc.gz');
About warc
The WARC extension parses WARC (Web ARChive) records, the standard format used by Common Crawl and web archiving tools. It enables efficient processing of web archive data directly in DuckDB.
Function
parse_warc(data)
Parse a WARC record and return a struct with all components.
Parameters:
data(BLOB or VARCHAR): WARC record data (auto-detects gzip compression)
Returns: STRUCT with fields:
warc_version(VARCHAR): WARC format version (e.g., "1.0")warc_headers(VARCHAR): JSON object of WARC headershttp_version(VARCHAR): HTTP version (e.g., "HTTP/1.1")http_status(INTEGER): HTTP status code (e.g., 200)http_headers(VARCHAR): JSON object of HTTP headers (lowercase keys)http_body(BLOB): Response body content
Common Crawl Workflow
The recommended workflow for processing Common Crawl data:
- Query the columnar index (Parquet) to find records of interest
- Fetch only the specific byte ranges you need using HTTP Range requests
- Parse with this extension
-- Example: Parse a downloaded Common Crawl record
-- First download: curl -r"46376769-46377713" "https://data.commoncrawl.org/crawl-data/..." > record.warc.gz
SELECT
(parse_warc(content)).http_status,
decode((parse_warc(content)).http_body) as html
FROM read_blob('record.warc.gz');
Features
- Auto-detects gzip compression
- Handles binary content (skips body for non-text responses)
- HTTP header keys are lowercased for consistent access
- Works with both BLOB and VARCHAR input types
Added Functions
| function_name | function_type | description | comment | examples |
|---|---|---|---|---|
| parse_warc | scalar | NULL | NULL |