Read text, metadata, words, lines, and tables from PDF files, with optional Tesseract OCR for scanned pages.
Installing and Loading
INSTALL pdf FROM community;
LOAD pdf;
Example
-- Load the extension
LOAD pdf;
-- One row per page (filename, page, page_count, text, width, height).
-- Accepts a single file, a list, or a glob.
SELECT page, text
FROM read_pdf('report.pdf');
-- Search across many PDFs at once
SELECT filename, page
FROM read_pdf('reports/*.pdf')
WHERE text ILIKE '%revenue%';
-- Document metadata, one row per file
SELECT title, author, pages FROM read_pdf_meta('report.pdf');
-- Extract tabular regions from digital PDFs
SELECT * FROM read_pdf_tables('financial_report.pdf');
-- Whole document as plain text (scalar)
SELECT pdf_to_text('report.pdf') AS full_text;
About pdf
The pdf extension brings native PDF reading to DuckDB using the Poppler
library for rendering and Tesseract for OCR of scanned pages.
All table functions accept a single path, a list of paths, or a glob
('docs/*.pdf'). Shared named parameters: first_page, last_page,
password, layout ('reading' | 'physical' | 'raw'), and the OCR knobs
ocr, auto_ocr, ocr_language, ocr_dpi, ocr_psm, ocr_oem,
tessdata_dir.
Table functions
read_pdf(path, ...)— one row per page; columns: filename, page, page_count, text, width, height (page size in PDF points).read_pdf_lines(path, ...)— one row per layout-preserving line of text; columns: filename, page, line, text. A PDF-aware analog toread_lines.read_pdf_meta(path)— one row per file; columns: filename, title, author, subject, keywords, creator, producer, pages, pdf_version, encrypted.read_pdf_words(path, ...)— one row per word; columns: filename, page, word, x0, y0, x1, y1 (bounding box in PDF user-space points, origin bottom-left), font_name, font_size.read_pdf_tables(path, ...)— extracts tabular regions from digital PDFs; columns: filename, page, table_index, row_index, cells (VARCHAR[]).
Scalar functions
pdf_to_text(path[, layout])— entire document as a plain-text VARCHAR.pdf_to_html(path)— document rendered to HTML.pdf_to_xml(path)— document rendered to XML (Poppler pdftoxml format).pdf_to_svg(path, page)— a single page rendered to SVG.
OCR (scanned / image-only PDFs)
Pages with no extractable text layer are OCR'd automatically (auto_ocr,
on by default); pass ocr := true to force OCR on every page. OCR requires a
Tesseract language model at runtime — package managers do not ship one — but
once you install one the usual way it works with no configuration: the
extension auto-detects the standard model directories used by Homebrew
(brew install tesseract tesseract-lang), apt
(apt-get install tesseract-ocr tesseract-ocr-eng), and the Windows
installer. To use a non-standard location, pass
tessdata_dir := '/path/to/tessdata' per query, or set the TESSDATA_PREFIX
environment variable (resolution order: tessdata_dir → TESSDATA_PREFIX →
auto-detected paths). Select the language with ocr_language
(e.g. ocr_language := 'deu'). If no model is found anywhere, OCR raises a
clear, actionable error rather than returning empty text.
Table extraction: scope
read_pdf_tables uses a precision-first geometric heuristic (word
bounding-box column clustering with a regularity gate) on digital PDFs. It
reliably handles clean, aligned tables and avoids emitting spurious tables
from prose, but it does not do ML-based table-structure recognition — merged
cells, borderless/sparse tables, and scanned tables are out of scope. For
state-of-the-art document understanding, reach for tools like docling,
marker, or a cloud Document AI service; this extension targets the ~80% of
everyday text/word/line/metadata/simple-table extraction directly in SQL.
Platform support
Linux (x86_64, arm64), macOS (x86_64, arm64), and Windows (x64, MSVC) are supported — all dependencies are resolved through vcpkg and statically linked. The mingw/rtools Windows variants and windows_arm64 are excluded (different toolchain / untested), and WebAssembly is excluded because Poppler and Tesseract cannot be linked into the wasm target.
License
GPL-2.0-or-later. Poppler is GPL-2.0; statically linking it requires the combined work to be distributed under the GPL.
Added Functions
| function_name | function_type | description | comment | examples |
|---|---|---|---|---|
| pdf_to_text | scalar | NULL | NULL | |
| read_pdf | table | NULL | NULL | |
| read_pdf_lines | table | NULL | NULL | |
| read_pdf_meta | table | NULL | NULL | |
| read_pdf_tables | table | NULL | NULL | |
| read_pdf_words | table | NULL | NULL |
Overloaded Functions
This extension does not add any function overloads.
Added Types
This extension does not add any types.
Added Settings
This extension does not add any settings.