Skip to content

Natural language search design #4

@franzaps

Description

@franzaps

Combine the existing FTS5/BM25 keyword search with vector similarity search to support natural language app queries. Scoring handles the fallback naturally.

Architecture

Query
  │
  ├──► FTS5 (BM25)         ── keyword/trigram matching
  │
  ├──► sqlite-vec (cosine)  ── semantic similarity
  │
  └──► Hybrid merge
         │
         score = α × norm(bm25) + (1 - α) × cosine_sim
         │
         ▼
       Ranked results

Components

Embedding model

all-MiniLM-L6-v2 — 384-dimensional vectors, ~80 MB.
Run via ONNX Runtime in-process or as a local HTTP embedding server.

Vector table (sqlite-vec)

CREATE VIRTUAL TABLE IF NOT EXISTS apps_vec USING vec0(
    id TEXT PRIMARY KEY,
    embedding float[384]
);

Loaded as a SQLite extension alongside FTS5.

Embedding pipeline

On every KindApp (32267) insert, concatenate name + summary + content,
compute the embedding, and upsert into apps_vec.

A one-time migration job embeds all existing apps.

Hybrid query

WITH fts_results AS (
    SELECT fts.id, bm25(apps_fts, 0, 20, 5, 1) AS score
    FROM apps_fts fts
    WHERE apps_fts MATCH ?
    LIMIT 50
),
vec_results AS (
    SELECT id, distance
    FROM apps_vec
    WHERE embedding MATCH ?
    ORDER BY distance
    LIMIT 50
)
SELECT COALESCE(f.id, v.id) AS id,
       COALESCE(f.score, 0) * :alpha + COALESCE(1.0 - v.distance, 0) * :beta AS combined
FROM fts_results f
FULL OUTER JOIN vec_results v ON f.id = v.id
ORDER BY combined DESC
LIMIT ?;

Existing tag filters (#f, #t, authors, date range) apply as additional
WHERE clauses on a final JOIN to the events table, same as today.

Tuning

α controls the BM25/vector balance. Start at 0.4 and tune empirically.

  • Short keyword queries (e.g. "signal") — BM25 dominates via exact match.
  • Natural language queries (e.g. "privacy focused messenger") — cosine
    similarity dominates when exact words don't appear in metadata.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions