USearch exposes the SIMD-accelerated distance functions for SQLite databases aimed to accelerate vector search and fuzzy string matching operations. This includes:
- Cosine and Euclidean distances for float vectors,
- Jaccard and Hamming distances for binary vectors,
- Levenshtein and Hamming distances for strings,
- Haversine distance for geographical coordinates.
The SIMD-acceleration covers AVX2, most subsets of AVX512, ARM NEON, and Arm SVE instruction sets, more than most BLAS libraries. The implementations are coming from SimSIMD and StringZilla. They are most efficient when vectors are stored as BLOBs, but for broader compatibility can also handle JSONs, and even separate columns containing vector elements.
USearch currently ships the SQLite extensions as part of the Python wheels, so they can be obtained from PyPi.
pip install usearch # brings sqlite extensions
To USearch extensions to your SQLite database, you can use the following commands:
import sqlite3
import usearch
conn = sqlite3.connect(":memory:")
conn.enable_load_extension(True)
conn.load_extension(usearch.sqlite_path())
Afterwards, the following script should work fine.
-- Create a table with a JSON column for vectors
CREATE TABLE vectors_table (
id SERIAL PRIMARY KEY,
vector JSON NOT NULL
);
-- Insert two 3D vectors
INSERT INTO vectors_table (id, vector)
VALUES
(42, '[1.0, 2.0, 3.0]'),
(43, '[4.0, 5.0, 6.0]');
-- Compute the distances to [7.0, 8.0, 9.0] using
-- the `distance_cosine_f32` extension function
SELECT
id,
distance_cosine_f32(vt.vector, '[7.0, 8.0, 9.0]') AS distance
FROM vectors_table AS vt;
String functions accept BLOBs and TEXTs, and return the distance as an integer. This includes:
distance_levenshtein_bytes
,distance_levenshtein_unicode
,distance_hamming_bytes
,distance_hamming_unicode
.
The Levenshtein distance would report the number of insertions, deletions, and substitutions required to transform one string into another, while the Hamming distance would only report the substitutons.
When applied to strings of different length, the Hamming distance would evalute the prefix of the longer string and will add the length difference to the result.
The _bytes
variants compute the distance in the number of bytes, while the _unicode
variants compute the distance in the number of Unicode code points, assuming the inputs are UTF8 encoded.
Take a look at the following three similar looking, but distinct strings:
école
- 6 codepoints (runes), 7 bytes.école
- 5 codepoints (runes), 6 bytes.écolé
- 5 codepoints (runes), 7 bytes.
Those have three different letter "e" variants, including etter "é" as a single character, etter as "e" and an accent "´", as well as the plain letter "e". This would result in different distances between those strings, depending on the chosen metric.
CREATE TABLE strings_table (
id SERIAL PRIMARY KEY,
word TEXT NOT NULL
);
INSERT INTO strings_table (id, word)
VALUES
(42, 'école'),
(43, 'école');
SELECT
st.id,
distance_levenshtein_bytes(st.word, 'écolé') AS lb,
distance_levenshtein_unicode(st.word, 'écolé') AS lu,
distance_hamming_bytes(st.word, 'écolé') AS hb,
distance_hamming_unicode(st.word, 'écolé') AS hu,
distance_levenshtein_bytes(st.word, 'écolé', 2) AS lbb,
distance_levenshtein_unicode(st.word, 'écolé', 2) AS lub,
distance_hamming_bytes(st.word, 'écolé', 2) AS hbb,
distance_hamming_unicode(st.word, 'écolé', 2) AS hub
FROM strings_table AS st;
The last 4 columns will contain the bounded versions of the distance, which are faster to compute, thanks to the early stopping condition. They might be handy, if you want to accelerate search for something like an "autocomplete" feature in a search bar. The output will look like this:
id | lb | lu | hb | hu | lbb | lub | hbb | hub |
---|---|---|---|---|---|---|---|---|
42 | 5 | 3 | 7 | 6 | 2 | 2 | 2 | 2 |
43 | 2 | 1 | 2 | 1 | 2 | 1 | 2 | 1 |
If using SQLite as a feature store, you might be dealing with single-bit vectors.
distance_hamming_binary
- the Hamming distance, meaning number of bits that differ between two vectors,distance_jaccard_binary
- the Jaccard distance, meaning the number of bits that differ between two vectors divided by the number of bits that are set in at least one of the vectors.
Here is an example:
CREATE TABLE binary_vectors (
id SERIAL PRIMARY KEY,
vector BLOB NOT NULL
);
INSERT INTO binary_vectors (id, vector)
VALUES
(42, X'FFFFFF'), -- 111111111111111111111111 in binary
(43, X'000000'); -- 000000000000000000000000 in binary
SELECT bv.id,
distance_hamming_binary(bv.vector, X'FFFF00') AS hamming_distance,
distance_jaccard_binary(bv.vector, X'FFFF00') AS jaccard_distance
FROM binary_vectors AS bv;
Distance functions for dense vectors can be used on both BLOBs and JSONs.
Every name is structured as distance_<metric>_<type>
, where
<metric>
is the name of the metric, likecosine
,inner
,sqeuclidean
, anddivergence
,<type>
is the type of the vector elements, likef64
,f32
,f16
, andi8
.
The cosine
metric is the cosine similarity, the inner
metric is the inner (dot) product, the sqeuclidean
metric is the squared Euclidean distance, and the divergence
metric is the Jensen-Shannon divergence - the symmetric variant of the Kullback-Leibler divergence.
distance_sqeuclidean_f64
distance_cosine_f64
distance_inner_f64
distance_divergence_f64
distance_sqeuclidean_f32
distance_cosine_f32
distance_inner_f32
distance_divergence_f32
distance_sqeuclidean_f16
distance_cosine_f16
distance_inner_f16
distance_divergence_f16
distance_sqeuclidean_i8
distance_cosine_i8
distance_inner_i8
distance_divergence_i8
When dealing with geographical or other spherical coordinates, you can use:
distance_haversine_meters
- the Haversine distance on the sphere multiplied by the Earth's radius in meters.
Feel free to contribute more examples to this file 🤗