Read .zim (Kiwix / openZIM) archives directly in DuckDB via libzim, from local files or remote S3/HTTP — offline Wikipedia, WikiMed, Stack Exchange, iFixit, and more, with a zim:// filesystem and full-text search.
Installing and Loading
INSTALL zim FROM community;
LOAD zim;
Example
-- Load the extension
LOAD zim;
-- List every content entry in an archive (one row per entry).
-- A listing scan is lazy: it never decompresses the article bodies.
SELECT path, title, mimetype FROM read_zim('wikipedia.zim') LIMIT 10;
-- A bare FROM on a .zim file is rewritten via a replacement scan.
SELECT count(*) FROM 'wikipedia.zim';
-- Read across many archives at once: a glob or a LIST of paths.
SELECT * FROM read_zim('archives/*.zim');
SELECT * FROM read_zim(['wikimed.zim', 'wikipedia.zim']);
-- Read a remote archive over S3/HTTP (LOAD httpfs first): byte-range requests
-- fetch only the bytes a query touches, not the whole multi-GB file.
SELECT zim_main_entry('https://dumps.wikimedia.org/.../wikipedia_en_simple.zim');
-- Narrow the scan: exact lookup, prefix listing, or a mimetype filter.
SELECT * FROM read_zim('wikipedia.zim', path := 'A/Berlin');
SELECT * FROM read_zim('wikipedia.zim', title_prefix := 'Aspirin');
SELECT * FROM read_zim('wikipedia.zim', mimetype := 'text/html');
-- Bulk content, gated by mimetype: only matching entries are decompressed,
-- everything else keeps its row with NULL content (pushdown, not a filter).
SELECT path, content FROM read_zim('wikipedia.zim', include_content := 'text/html');
-- The zim:// filesystem: any path-reading function can address an entry inside
-- the archive (read_text / read_blob, or webbed's read_html, etc.).
SELECT * FROM read_text('zim://wikipedia.zim/A/Berlin');
SELECT * FROM read_blob('zim://wikipedia.zim/I/logo.png');
SELECT * FROM read_text('zim://wikipedia.zim/A/B*'); -- glob over content paths
-- Full-text search over the archive's Xapian index (native builds).
SELECT path, title, score, snippet
FROM zim_search('wikipedia.zim', 'photosynthesis', max_results := 20);
-- Title autocomplete (works on every build, incl. WebAssembly).
SELECT path, title FROM zim_suggest('wikipedia.zim', 'Photosyn');
-- Federated: search across a whole shelf at once (glob or LIST); the `file`
-- column says which archive each hit came from.
SELECT file, path, title FROM zim_search('library/*.zim', 'insulin', max_results := 5);
-- Single-entry access, metadata, and utilities.
SELECT zim_get_text('wikipedia.zim', 'A/Berlin');
SELECT zim_main_entry('wikipedia.zim');
SELECT * FROM read_zim_metadata('wikipedia.zim');
SELECT zim_counter('wikipedia.zim'); -- mimetype -> count histogram
SELECT zim_info('wikipedia.zim'); -- counts, flags, uuid
SELECT zim_illustration('wikipedia.zim'); -- cover image / favicon (BLOB)
SELECT zim_random('wikipedia.zim'); -- a random entry's path
SELECT zim_check('wikipedia.zim'); -- integrity check
About zim
The zim extension reads .zim files — the archive format produced by
openZIM and served by Kiwix —
directly inside DuckDB. ZIM packages an entire website into a single
compressed, content-addressed file: offline Wikipedia, WikiMed, Wiktionary,
Project Gutenberg, Stack Exchange, iFixit, TED, and hundreds of other
libraries. This extension turns any such archive into a SQL-queryable table —
or a queryable filesystem — without unpacking it.
Reading archives: read_zim() returns one row per content entry —
path, title, mimetype, and the (lazily materialized) content. The
scan supports projection pushdown and lazy content: a query that only
inspects the listing never decompresses article bodies. Point it at a single
file, a glob, or a LIST(VARCHAR) to read many archives at once, and a
bare FROM 'archive.zim' is rewritten automatically via a replacement scan.
Narrow the scan with exact lookup (path / title), prefix listing
(path_prefix / title_prefix), a mimetype filter, or
listing := 'path' | 'title'. include_content accepts a mimetype or list
of mimetypes to decompress content for only the entries you want, and
WHERE path/title/mimetype predicates are pushed down into libzim.
Local or remote: archives are read from local files or, with httpfs
loaded, straight from S3/HTTP via byte-range requests — a query against a
multi-GB remote archive fetches only the bytes it touches, and recently-used
remote handles are kept warm between calls.
The zim:// filesystem: a read-only filesystem registers zim://, so any
path-reading function — DuckDB's own read_text / read_blob, or
webbed's read_html, or
anything else that goes through the file layer — can address an entry inside an
archive (zim://wikipedia.zim/A/Photosynthesis), including globs. Redirects
resolve like symlinks. This composes the whole ecosystem over ZIM contents with
no coupling and no GPL linkage into those extensions.
Full-text search & suggestions: when the archive carries a Xapian index,
zim_search() queries it, returning (path, title, score, snippet, file) ranked
by relevance; zim_suggest() gives title autocomplete and works on every build
(it falls back to a title-prefix listing without Xapian, so it's available on
WebAssembly too, where full-text search is not). Both are federated: the first
argument can be a single path, a glob, or a LIST, so a query runs across many
archives at once.
Metadata, scalars & utilities: read_zim_metadata(), zim_metadata() /
zim_metadata_keys(), zim_counter() (the self-describing mimetype histogram),
and zim_info() (counts, flags, uuid); plus zim_get_content / zim_get_text,
zim_has_entry, zim_redirect_target, zim_mimetype, zim_main_entry,
zim_illustration (cover image), zim_random, and zim_check (integrity).
Every VARCHAR output is binary-safe — non-UTF-8 bytes come back as NULL
rather than being mangled. Built on libzim;
licensed GPL-2.0-or-later, inherited from libzim.
Added Functions
| function_name | function_type | description | comment | examples |
|---|---|---|---|---|
| read_zim | table | NULL | NULL | |
| read_zim_metadata | table | NULL | NULL | |
| zim_check | scalar | NULL | NULL | |
| zim_counter | scalar | NULL | NULL | |
| zim_get_content | scalar | NULL | NULL | |
| zim_get_text | scalar | NULL | NULL | |
| zim_has_entry | scalar | NULL | NULL | |
| zim_illustration | scalar | NULL | NULL | |
| zim_info | scalar | NULL | NULL | |
| zim_main_entry | scalar | NULL | NULL | |
| zim_metadata | scalar | NULL | NULL | |
| zim_metadata_keys | scalar | NULL | NULL | |
| zim_mimetype | scalar | NULL | NULL | |
| zim_random | scalar | NULL | NULL | |
| zim_redirect_target | scalar | NULL | NULL | |
| zim_search | table | NULL | NULL | |
| zim_suggest | table | NULL | NULL |
Overloaded Functions
This extension does not add any function overloads.
Added Types
This extension does not add any types.
Added Settings
This extension does not add any settings.