Vector Similarity Search in DuckDB

Max Gabrielsson2024-05-03

TL;DR: This blog post shows a preview of DuckDB's new vss extension, which introduces support for HNSW (Hierarchical Navigable Small Worlds) indexes to accelerate vector similarity search.

In DuckDB v0.10.0, we introduced the ARRAY data type, which stores fixed-sized lists, to complement the existing variable-size LIST data type.

The initial motivation for adding this data type was to provide optimized operations for lists that can utilize the positional semantics of their child elements and avoid branching as all lists have the same length. Think e.g., the sort of array manipulations you'd do in NumPy: stacking, shifting, multiplying – you name it. Additionally, we wanted to improve our interoperability with Apache Arrow, as previously Arrow's fixed-size list types would be converted to regular variable-size lists when ingested into DuckDB, losing some type information.

However, as the hype for vector embeddings and semantic similarity search was growing, we also snuck in a couple of distance metric functions for this new ARRAY type: array_distance, array_inner_product and array_cosine_similarity

If you're one of today's lucky 10,000 and haven't heard of word embeddings or vector search, the short version is that it's a technique used to represent documents, images, entities – data as high-dimensional vectors and then search for similar vectors in a vector space, using some sort of mathematical "distance" expression to measure similarity. This is used in a wide range of applications, from natural language processing to recommendation systems and image recognition, and has recently seen a surge in popularity due to the advent of generative AI and availability of pre-trained models.

This got the community really excited! While we (DuckDB Labs) initially went on record saying that we would not be adding a vector similarity search index to DuckDB as we deemed it to be too far out of scope, we were very interested in supporting custom indexes through extensions in general. Shoot, I've been personally nagging on about wanting to plug-in an "R-Tree" index since the inception of DuckDBs spatial extension! So when one of our client projects evolved into creating a proof-of-concept custom "HNSW" index extension, we said that we'd give it a shot. And… well, one thing led to another.

Fast forward to now and we're happy to announce the availability of the vss vector similarity search extension for DuckDB! While some may say we're late to the vector search party, we'd like to think the party is just getting started!

Alright, so what's in vss?

The Vector Similarity Search (VSS) Extension

On the surface, vss seems like a comparatively small DuckDB extension. It does not provide any new data types, scalar functions or copy functions, but rather a single new index type: HNSW (Hierarchical Navigable Small Worlds), which is a graph-based index structure that is particularly well-suited for high-dimensional vector similarity search.

-- Create a table with an array column
CREATE TABLE embeddings (vec FLOAT[3]);

-- Create an HNSW index on the column
CREATE INDEX idx ON embeddings USING HNSW (vec);

This index type can't be used to enforce constraints or uniqueness like the built-in ART index, and can't be used to speed up joins or index regular columns either. Instead, the HNSW index is only applicable to columns of the ARRAY type containing FLOAT elements and will only be used to accelerate queries calculating the "distance" between a constant FLOAT ARRAY and the FLOAT ARRAY's in the indexed column, ordered by the resulting distance and returning the top-n results. That is, queries of the form:

SELECT *
FROM embeddings
ORDER BY array_distance(vec, [1, 2, 3]::FLOAT[3])
LIMIT 3;

will have their logical plan optimized to become a projection over a new HNSW index scan operator, removing the limit and sort altogether. We can verify this by checking the EXPLAIN output:

EXPLAIN
SELECT *
FROM embeddings
ORDER BY array_distance(vec, [1, 2, 3]::FLOAT[3])
LIMIT 3;

┌───────────────────────────┐
│         PROJECTION        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             #0            │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│         PROJECTION        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│            vec            │
│array_distance(vec, [1.0, 2│
│         .0, 3.0])         │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│      HNSW_INDEX_SCAN      │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│   t1 (HNSW INDEX SCAN :   │
│            idx)           │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│            vec            │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│           EC: 3           │
└───────────────────────────┘

You can pass the HNSW index creation statement a metric parameter to decide what kind of distance metric to use. The supported metrics are l2sq, cosine and inner_product, matching the three built-in distance functions: array_distance, array_cosine_similarity and array_inner_product. The default is l2sq, which uses Euclidean distance (array_distance):

CREATE INDEX l2sq_idx ON embeddings USING HNSW (vec)
WITH (metric = 'l2sq');

To use cosine distance (array_cosine_similarity):

CREATE INDEX cos_idx ON embeddings USING HNSW (vec)
WITH (metric = 'cosine');

To use inner product (array_inner_product):

CREATE INDEX ip_idx ON embeddings USING HNSW (vec)
WITH (metric = 'ip');

Implementation

The vss extension is based on the usearch library, which provides a flexible C++ implementation of the HNSW index data structure boasting very impressive performance benchmarks. While we currently only use a subset of all the functionality and tuning options provided by usearch, we're excited to explore how we can leverage more of its features in the future. So far we're mostly happy that it aligns so nicely with DuckDB's development ethos. Much like DuckDB itself, usearch is written in portable C++11 with no external dependencies and released under a permissive license, making it super smooth to integrate into our extension build and distribution pipeline.

Limitations

The big limitation as of now is that the HNSW index can only be created in in-memory databases, unless the SET hnsw_enable_experimental_persistence = ⟨bool⟩ configuration parameter is set to true. If this parameter is not set, any attempt to create an HNSW index in a disk-backed database will result in an error message, but if the parameter is set, the index will not only be created in memory, but also persisted to disk as part of the DuckDB database file during checkpointing. After restarting or loading a database file with a persisted HNSW index, the index will be lazily loaded back into memory whenever the associated table is first accessed, which is significantly faster than having to re-create the index from scratch.

The reasoning for locking this feature behind an experimental flag is that we still have some known issues related to persistence of custom indexes that we want to address before enabling it by default. In particular, WAL recovery is not yet properly implemented for custom indexes, meaning that if a crash occurs or the database is shut down unexpectedly while there are uncommited changes to a HNSW-indexed table, you can end up with data loss or corruption of the index. While it is technically possible to recover from a unexpected shutdown manually by first starting DuckDB separately, loading the vss extension and then ATTACHing the database file, which ensures that the HNSW index functionality is available during WAL-playback, you should not rely on this for production workloads.

We're actively working on addressing this and other issues related to index persistence, which will hopefully make it into DuckDB v0.10.3, but for now we recommend using the HNSW index in in-memory databases only.

At runtime however, much like the ART the HNSW index must be able to fit into RAM in its entirety, and the memory allocated by the HNSW at runtime is allocated "outside" of the DuckDB memory management system, meaning that it wont respect DuckDB's memory_limit configuration parameter.

Another current limitation with the HNSW index so far are that it only supports the FLOAT (a 32-bit, single-precision floating point) type for the array elements and only distance metrics corresponding to the three built in distance functions, array_distance, array_inner_product and array_cosine_similarity. But this is also something we're looking to expand upon in the near future as it is much less of a technical limitation and more of a "we haven't gotten around to it yet" limitation.

Conclusion

The vss extension for DuckDB is a new extension that adds support for creating HNSW indexes on fixed-size list columns in DuckDB, accelerating vector similarity search queries. The extension can currently be installed on DuckDB v0.10.2 on all supported platforms (including WASM!) by running INSTALL vss; LOAD vss. The vss extension treads new ground for DuckDB extensions by providing a custom index type and we're excited to refine and expand on this functionality going forward.

While we're still working on addressing some of the limitations above, particularly those related to persistence (and performance), we still really want to share this early version the vss extension as we believe this will open up a lot of cool opportunities for the community. So make sure to check out the vss extension documentation for more information on how to work with this extension!

This work was made possible by the sponsorship of a DuckDB Labs customer! If you are interested in similar work for specific capabilities, please reach out to DuckDB Labs. Alternatively, we're happy to welcome contributors! Please reach out to the DuckDB Labs team over on Discord or on the vss extension GitHub repository to keep up with the latest developments.

Vector Similarity Search in DuckDB

The Vector Similarity Search (VSS) Extension

Implementation

Limitations

Conclusion

Recent posts

DuckCon #5 in Seattle

Memory Management in DuckDB

DuckDB Community Extensions