Search Shortcut cmd + k | ctrl + k
pdf

Read text, metadata, words, lines, and tables from PDF files, with optional Tesseract OCR for scanned pages.

Maintainer(s): asubbarao

Installing and Loading

INSTALL pdf FROM community;
LOAD pdf;

Example

-- Load the extension
LOAD pdf;

-- One row per page (filename, page, page_count, text, width, height).
-- Accepts a single file, a list, or a glob.
SELECT page, text
FROM read_pdf('report.pdf');

-- Search across many PDFs at once
SELECT filename, page
FROM read_pdf('reports/*.pdf')
WHERE text ILIKE '%revenue%';

-- Document metadata, one row per file
SELECT title, author, pages FROM read_pdf_meta('report.pdf');

-- Extract tabular regions from digital PDFs
SELECT * FROM read_pdf_tables('financial_report.pdf');

-- Whole document as plain text (scalar)
SELECT pdf_to_text('report.pdf') AS full_text;

About pdf

The pdf extension brings native PDF reading to DuckDB using the Poppler library for rendering and Tesseract for OCR of scanned pages.

All table functions accept a single path, a list of paths, or a glob ('docs/*.pdf'). Shared named parameters: first_page, last_page, password, layout ('reading' | 'physical' | 'raw'), and the OCR knobs ocr, auto_ocr, ocr_language, ocr_dpi, ocr_psm, ocr_oem, tessdata_dir.

Table functions

  • read_pdf(path, ...) — one row per page; columns: filename, page, page_count, text, width, height (page size in PDF points).
  • read_pdf_lines(path, ...) — one row per layout-preserving line of text; columns: filename, page, line, text. A PDF-aware analog to read_lines.
  • read_pdf_meta(path) — one row per file; columns: filename, title, author, subject, keywords, creator, producer, pages, pdf_version, encrypted.
  • read_pdf_words(path, ...) — one row per word; columns: filename, page, word, x0, y0, x1, y1 (bounding box in PDF user-space points, origin bottom-left), font_name, font_size.
  • read_pdf_tables(path, ...) — extracts tabular regions from digital PDFs; columns: filename, page, table_index, row_index, cells (VARCHAR[]).

Scalar functions

  • pdf_to_text(path[, layout]) — entire document as a plain-text VARCHAR.
  • pdf_to_html(path) — document rendered to HTML.
  • pdf_to_xml(path) — document rendered to XML (Poppler pdftoxml format).
  • pdf_to_svg(path, page) — a single page rendered to SVG.

OCR (scanned / image-only PDFs)

Pages with no extractable text layer are OCR'd automatically (auto_ocr, on by default); pass ocr := true to force OCR on every page. OCR requires a Tesseract language model at runtime — package managers do not ship one — but once you install one the usual way it works with no configuration: the extension auto-detects the standard model directories used by Homebrew (brew install tesseract tesseract-lang), apt (apt-get install tesseract-ocr tesseract-ocr-eng), and the Windows installer. To use a non-standard location, pass tessdata_dir := '/path/to/tessdata' per query, or set the TESSDATA_PREFIX environment variable (resolution order: tessdata_dirTESSDATA_PREFIX → auto-detected paths). Select the language with ocr_language (e.g. ocr_language := 'deu'). If no model is found anywhere, OCR raises a clear, actionable error rather than returning empty text.

Table extraction: scope

read_pdf_tables uses a precision-first geometric heuristic (word bounding-box column clustering with a regularity gate) on digital PDFs. It reliably handles clean, aligned tables and avoids emitting spurious tables from prose, but it does not do ML-based table-structure recognition — merged cells, borderless/sparse tables, and scanned tables are out of scope. For state-of-the-art document understanding, reach for tools like docling, marker, or a cloud Document AI service; this extension targets the ~80% of everyday text/word/line/metadata/simple-table extraction directly in SQL.

Platform support

Linux (x86_64, arm64), macOS (x86_64, arm64), and Windows (x64, MSVC) are supported — all dependencies are resolved through vcpkg and statically linked. The mingw/rtools Windows variants and windows_arm64 are excluded (different toolchain / untested), and WebAssembly is excluded because Poppler and Tesseract cannot be linked into the wasm target.

License

GPL-2.0-or-later. Poppler is GPL-2.0; statically linking it requires the combined work to be distributed under the GPL.

Added Functions

function_name function_type description comment examples
pdf_to_text scalar NULL NULL  
read_pdf table NULL NULL  
read_pdf_lines table NULL NULL  
read_pdf_meta table NULL NULL  
read_pdf_tables table NULL NULL  
read_pdf_words table NULL NULL  

Overloaded Functions

This extension does not add any function overloads.

Added Types

This extension does not add any types.

Added Settings

This extension does not add any settings.