Search Shortcut cmd + k | ctrl + k
Full-Text Search

DuckDB supports full-text search via the fts extension. A full-text index allows for a query to quickly search for all occurrences of individual words within longer text strings.

Example: Shakespeare Corpus

Here’s an example of building a full-text index of Shakespeare’s plays.

CREATE TABLE corpus AS
    SELECT * FROM 'https://blobs.duckdb.org/data/shakespeare.parquet';
DESCRIBE corpus;
┌─────────────┬─────────────┬─────────┐
│ column_name │ column_type │  null   │
├─────────────┼─────────────┼─────────┤
│ line_id     │ VARCHAR     │ YES     │
│ play_name   │ VARCHAR     │ YES     │
│ line_number │ VARCHAR     │ YES     │
│ speaker     │ VARCHAR     │ YES     │
│ text_entry  │ VARCHAR     │ YES     │
└─────────────┴─────────────┴─────────┘

The text of each line is in text_entry, and a unique key for each line is in line_id.

Creating a Full-Text Search Index

First, we create the index, specifying the table name, the unique id column, and the column(s) to index. We will just index the single column text_entry, which contains the text of the lines in the play.

PRAGMA create_fts_index('corpus', 'line_id', 'text_entry');

The table is now ready to query using the Okapi BM25 ranking function. Rows with no match return a null score.

What does Shakespeare say about butter?

SELECT
    fts_main_corpus.match_bm25(line_id, 'butter') AS score,
    line_id, play_name, speaker, text_entry
FROM corpus
WHERE score IS NOT NULL
ORDER BY score DESC;
┌────────────────────┬─────────────┬──────────────────────────┬──────────────┬────────────────────────────────────────────────────┐
│       score        │   line_id   │        play_name         │   speaker    │                     text_entry                     │
│       double       │   varchar   │         varchar          │   varchar    │                      varchar                       │
├────────────────────┼─────────────┼──────────────────────────┼──────────────┼────────────────────────────────────────────────────┤
│  4.427313429798464 │ H4/2.4.494  │ Henry IV                 │ Carrier      │ As fat as butter.                                  │
│  3.836270302568675 │ H4/1.2.21   │ Henry IV                 │ FALSTAFF     │ prologue to an egg and butter.                     │
│  3.836270302568675 │ H4/2.1.55   │ Henry IV                 │ Chamberlain  │ They are up already, and call for eggs and butter; │
│ 3.3844488405497115 │ H4/4.2.21   │ Henry IV                 │ FALSTAFF     │ toasts-and-butter, with hearts in their bellies no │
│ 3.3844488405497115 │ H4/4.2.62   │ Henry IV                 │ PRINCE HENRY │ already made thee butter. But tell me, Jack, whose │
│ 3.3844488405497115 │ AWW/4.1.40  │ Alls well that ends well │ PAROLLES     │ butter-womans mouth and buy myself another of      │
│ 3.3844488405497115 │ AYLI/3.2.93 │ As you like it           │ TOUCHSTONE   │ right butter-womens rank to market.                │
│ 3.3844488405497115 │ KL/2.4.132  │ King Lear                │ Fool         │ kindness to his horse, buttered his hay.           │
│ 3.0278411214953107 │ AWW/5.2.9   │ Alls well that ends well │ Clown        │ henceforth eat no fish of fortunes buttering.      │
│ 3.0278411214953107 │ MWW/2.2.260 │ Merry Wives of Windsor   │ FALSTAFF     │ Hang him, mechanical salt-butter rogue! I will     │
│ 3.0278411214953107 │ MWW/2.2.284 │ Merry Wives of Windsor   │ FORD         │ rather trust a Fleming with my butter, Parson Hugh │
│ 3.0278411214953107 │ MWW/3.5.7   │ Merry Wives of Windsor   │ FALSTAFF     │ Ill have my brains taen out and buttered, and give │
│ 3.0278411214953107 │ MWW/3.5.102 │ Merry Wives of Windsor   │ FALSTAFF     │ to heat as butter; a man of continual dissolution  │
│  2.739219044070792 │ H4/2.4.115  │ Henry IV                 │ PRINCE HENRY │ Didst thou never see Titan kiss a dish of butter?  │
├────────────────────┴─────────────┴──────────────────────────┴──────────────┴────────────────────────────────────────────────────┤
│ 14 rows                                                                                                               5 columns │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Unlike standard indexes, full-text indexes don’t auto-update as the underlying data is changed, so you need to PRAGMA drop_fts_index(my_fts_index) and recreate it when appropriate.

Note on Generating the Corpus Table

For more details, see the “Generating a Shakespeare corpus for full-text searching from JSON” blog post

  • The Columns are: line_id, play_name, line_number, speaker, text_entry.
  • We need a unique key for each row in order for full-text searching to work.
  • The line_id “KL/2.4.132” means King Lear, Act 2, Scene 4, Line 132.
About this page

Last modified: 2024-03-02