Search Shortcut cmd + k | ctrl + k
zim

Read .zim (Kiwix / openZIM) archives directly in DuckDB via libzim — offline Wikipedia, WikiMed, Stack Exchange, iFixit, and more.

Maintainer(s): teaguesterling

Installing and Loading

INSTALL zim FROM community;
LOAD zim;

Example

-- Load the extension
LOAD zim;

-- List every content entry in an archive (one row per entry).
-- A listing scan is lazy: it never decompresses the article bodies.
SELECT path, title, mimetype FROM read_zim('wikipedia.zim') LIMIT 10;

-- A bare FROM on a .zim file is rewritten via a replacement scan.
SELECT count(*) FROM 'wikipedia.zim';

-- Read across many archives at once: a glob or a LIST of paths.
SELECT * FROM read_zim('archives/*.zim');
SELECT * FROM read_zim(['wikimed.zim', 'wikipedia.zim']);

-- Narrow the scan: exact lookup, prefix listing, or a mimetype filter.
SELECT * FROM read_zim('wikipedia.zim', path := 'A/Berlin');
SELECT * FROM read_zim('wikipedia.zim', title_prefix := 'Aspirin');
SELECT * FROM read_zim('wikipedia.zim', mimetype := 'text/html');

-- Iterate by title (front articles) instead of by path.
SELECT * FROM read_zim('wikipedia.zim', listing := 'title');

-- Pull the bytes / text of a single entry by path.
SELECT zim_get_text('wikipedia.zim', 'A/Berlin');
SELECT zim_get_content('wikipedia.zim', 'I/logo.png');   -- BLOB
SELECT zim_has_entry('wikipedia.zim', 'A/Berlin');
SELECT zim_mimetype('wikipedia.zim', 'A/Berlin');
SELECT zim_redirect_target('wikipedia.zim', 'A/NYC');
SELECT zim_main_entry('wikipedia.zim');

-- Archive metadata and the self-describing mimetype histogram.
SELECT * FROM read_zim_metadata('wikipedia.zim');
SELECT zim_metadata('wikipedia.zim', 'Title');
SELECT zim_metadata_keys('wikipedia.zim');
SELECT zim_counter('wikipedia.zim');   -- mimetype -> count
SELECT zim_info('wikipedia.zim');      -- counts, flags, uuid

About zim

The zim extension reads .zim files — the archive format produced by openZIM and served by Kiwix — directly inside DuckDB. ZIM packages an entire website into a single compressed, content-addressed file: offline Wikipedia, WikiMed, Wiktionary, Project Gutenberg, Stack Exchange, iFixit, TED, and hundreds of other libraries. This extension turns any such archive into a SQL-queryable table without unpacking it.

Reading archives: read_zim() returns one row per content entry — path, title, mimetype, and the (lazily materialized) content. The scan supports projection pushdown and lazy content: a query that only inspects the listing never decompresses article bodies. Point it at a single file, a glob, or a LIST(VARCHAR) to read many archives at once, and a bare FROM 'archive.zim' is rewritten automatically via a replacement scan. Narrow the scan with exact lookup (path / title), prefix listing (path_prefix / title_prefix), a mimetype filter, or listing := 'path' | 'title'. Conflicting parameter modes are rejected rather than silently resolved.

Single-entry access: zim_get_text() and zim_get_content() (BLOB) fetch one entry by path; zim_has_entry(), zim_mimetype(), zim_redirect_target(), and zim_main_entry() answer the common point lookups without a full scan.

Metadata: read_zim_metadata() and the zim_metadata() / zim_metadata_keys() scalars expose the archive's metadata (title, language, publisher, illustration, …). zim_counter() returns the self-describing mimetype histogram embedded in the archive, and zim_info() reports entry counts, flags, and the archive UUID.

Correctness: every VARCHAR output is binary-safe — non-UTF-8 bytes come back as NULL rather than being mangled — so the extension is safe to point at arbitrary real-world archives.

Built on libzim. This release is search-less (built without Xapian); full-text search (zim_search), a zim:// virtual filesystem, and ATTACH support are planned for later phases. Licensed GPL-2.0-or-later, inherited from libzim.

Added Functions

function_name function_type description comment examples
read_zim table NULL NULL  
read_zim_metadata table NULL NULL  
zim_counter scalar NULL NULL  
zim_get_content scalar NULL NULL  
zim_get_text scalar NULL NULL  
zim_has_entry scalar NULL NULL  
zim_info scalar NULL NULL  
zim_main_entry scalar NULL NULL  
zim_metadata scalar NULL NULL  
zim_metadata_keys scalar NULL NULL  
zim_mimetype scalar NULL NULL  
zim_redirect_target scalar NULL NULL  

Overloaded Functions

This extension does not add any function overloads.

Added Types

This extension does not add any types.

Added Settings

This extension does not add any settings.