Search Shortcut cmd + k | ctrl + k
zim

Read .zim (Kiwix / openZIM) archives directly in DuckDB via libzim, from local files or remote S3/HTTP — offline Wikipedia, WikiMed, Stack Exchange, iFixit, and more, with a zim:// filesystem and full-text search.

Maintainer(s): teaguesterling

Installing and Loading

INSTALL zim FROM community;
LOAD zim;

Example

-- Load the extension
LOAD zim;

-- List every content entry in an archive (one row per entry).
-- A listing scan is lazy: it never decompresses the article bodies.
SELECT path, title, mimetype FROM read_zim('wikipedia.zim') LIMIT 10;

-- A bare FROM on a .zim file is rewritten via a replacement scan.
SELECT count(*) FROM 'wikipedia.zim';

-- Read across many archives at once: a glob or a LIST of paths.
SELECT * FROM read_zim('archives/*.zim');
SELECT * FROM read_zim(['wikimed.zim', 'wikipedia.zim']);

-- Read a remote archive over S3/HTTP (LOAD httpfs first): byte-range requests
-- fetch only the bytes a query touches, not the whole multi-GB file.
SELECT zim_main_entry('https://dumps.wikimedia.org/.../wikipedia_en_simple.zim');

-- Narrow the scan: exact lookup, prefix listing, or a mimetype filter.
SELECT * FROM read_zim('wikipedia.zim', path := 'A/Berlin');
SELECT * FROM read_zim('wikipedia.zim', title_prefix := 'Aspirin');
SELECT * FROM read_zim('wikipedia.zim', mimetype := 'text/html');

-- Bulk content, gated by mimetype: only matching entries are decompressed,
-- everything else keeps its row with NULL content (pushdown, not a filter).
SELECT path, content FROM read_zim('wikipedia.zim', include_content := 'text/html');

-- The zim:// filesystem: any path-reading function can address an entry inside
-- the archive (read_text / read_blob, or webbed's read_html, etc.).
SELECT * FROM read_text('zim://wikipedia.zim/A/Berlin');
SELECT * FROM read_blob('zim://wikipedia.zim/I/logo.png');
SELECT * FROM read_text('zim://wikipedia.zim/A/B*');   -- glob over content paths

-- Full-text search over the archive's Xapian index (native builds).
SELECT path, title, score, snippet
FROM zim_search('wikipedia.zim', 'photosynthesis', max_results := 20);

-- Title autocomplete (works on every build, incl. WebAssembly).
SELECT path, title FROM zim_suggest('wikipedia.zim', 'Photosyn');

-- Federated: search across a whole shelf at once (glob or LIST); the `file`
-- column says which archive each hit came from.
SELECT file, path, title FROM zim_search('library/*.zim', 'insulin', max_results := 5);

-- Single-entry access, metadata, and utilities.
SELECT zim_get_text('wikipedia.zim', 'A/Berlin');
SELECT zim_main_entry('wikipedia.zim');
SELECT * FROM read_zim_metadata('wikipedia.zim');
SELECT zim_counter('wikipedia.zim');         -- mimetype -> count histogram
SELECT zim_info('wikipedia.zim');            -- counts, flags, uuid
SELECT zim_illustration('wikipedia.zim');    -- cover image / favicon (BLOB)
SELECT zim_random('wikipedia.zim');          -- a random entry's path
SELECT zim_check('wikipedia.zim');           -- integrity check

About zim

The zim extension reads .zim files — the archive format produced by openZIM and served by Kiwix — directly inside DuckDB. ZIM packages an entire website into a single compressed, content-addressed file: offline Wikipedia, WikiMed, Wiktionary, Project Gutenberg, Stack Exchange, iFixit, TED, and hundreds of other libraries. This extension turns any such archive into a SQL-queryable table — or a queryable filesystem — without unpacking it.

Reading archives: read_zim() returns one row per content entry — path, title, mimetype, and the (lazily materialized) content. The scan supports projection pushdown and lazy content: a query that only inspects the listing never decompresses article bodies. Point it at a single file, a glob, or a LIST(VARCHAR) to read many archives at once, and a bare FROM 'archive.zim' is rewritten automatically via a replacement scan. Narrow the scan with exact lookup (path / title), prefix listing (path_prefix / title_prefix), a mimetype filter, or listing := 'path' | 'title'. include_content accepts a mimetype or list of mimetypes to decompress content for only the entries you want, and WHERE path/title/mimetype predicates are pushed down into libzim.

Local or remote: archives are read from local files or, with httpfs loaded, straight from S3/HTTP via byte-range requests — a query against a multi-GB remote archive fetches only the bytes it touches, and recently-used remote handles are kept warm between calls.

The zim:// filesystem: a read-only filesystem registers zim://, so any path-reading function — DuckDB's own read_text / read_blob, or webbed's read_html, or anything else that goes through the file layer — can address an entry inside an archive (zim://wikipedia.zim/A/Photosynthesis), including globs. Redirects resolve like symlinks. This composes the whole ecosystem over ZIM contents with no coupling and no GPL linkage into those extensions.

Full-text search & suggestions: when the archive carries a Xapian index, zim_search() queries it, returning (path, title, score, snippet, file) ranked by relevance; zim_suggest() gives title autocomplete and works on every build (it falls back to a title-prefix listing without Xapian, so it's available on WebAssembly too, where full-text search is not). Both are federated: the first argument can be a single path, a glob, or a LIST, so a query runs across many archives at once.

Metadata, scalars & utilities: read_zim_metadata(), zim_metadata() / zim_metadata_keys(), zim_counter() (the self-describing mimetype histogram), and zim_info() (counts, flags, uuid); plus zim_get_content / zim_get_text, zim_has_entry, zim_redirect_target, zim_mimetype, zim_main_entry, zim_illustration (cover image), zim_random, and zim_check (integrity).

Every VARCHAR output is binary-safe — non-UTF-8 bytes come back as NULL rather than being mangled. Built on libzim; licensed GPL-2.0-or-later, inherited from libzim.

Added Functions

function_name function_type description comment examples
read_zim table NULL NULL  
read_zim_metadata table NULL NULL  
zim_check scalar NULL NULL  
zim_counter scalar NULL NULL  
zim_get_content scalar NULL NULL  
zim_get_text scalar NULL NULL  
zim_has_entry scalar NULL NULL  
zim_illustration scalar NULL NULL  
zim_info scalar NULL NULL  
zim_main_entry scalar NULL NULL  
zim_metadata scalar NULL NULL  
zim_metadata_keys scalar NULL NULL  
zim_mimetype scalar NULL NULL  
zim_random scalar NULL NULL  
zim_redirect_target scalar NULL NULL  
zim_search table NULL NULL  
zim_suggest table NULL NULL  

Overloaded Functions

This extension does not add any function overloads.

Added Types

This extension does not add any types.

Added Settings

This extension does not add any settings.