Skip to content

Pipeline semantic search#19

Merged
AymanL merged 14 commits intomainfrom
pipeline_semantic_search
Mar 16, 2026
Merged

Pipeline semantic search#19
AymanL merged 14 commits intomainfrom
pipeline_semantic_search

Conversation

@AymanL
Copy link
Collaborator

@AymanL AymanL commented Mar 6, 2026

Summary

  • Search: Query string is embedded with the same E5 model as ingestion (using the query: prefix). search_chunks(query, k) returns the top‑k chunks by pgvector cosine distance; only chunks with a stored embedding are considered.
  • Implementation of embed_query() in embedding.py, new search.py with search_chunks() using pgvector CosineDistance and select_related("source_file").
  • Tests: Unit tests for search_chunks (ordering, exclusion of null embeddings, k, k=0). Integration test: pipeline with mocked parse and embeddings, then search; result order matches the test vectors. All search tests use mocked embed_query (no real model). Constants and helpers clarified (e.g. _one_hot_vector, constant_vector, SECOND_CLOSEST*).
  • Postgres in docker-compose mapped to host port 5432:5432 so DATABASE_URL=...@localhost:5432/... works for local and test runs.

Scope

Semantic search implementation and its tests only. No UI for search yet; no real-PDF E2E test.

Base automatically changed from pipeline-embedding to main March 8, 2026 16:31
Copy link
Collaborator

@cgoudet cgoudet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merci pour le boulot!

Returns a list of (chunk, distance) tuples. Chunk includes source_file
via the ORM relation for display (e.g. source_file.doi).
"""
if k <= 0:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plutot un message d'erreur?

return []
query_vector = embed_query(query)
qs = (
DocumentChunk.objects.filter(embedding__isnull=False)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Est ce que postgres a un système d'ANN? plutot que de tester toutes les lignes?

@AymanL AymanL merged commit 74af3da into main Mar 16, 2026
1 check passed
@AymanL AymanL deleted the pipeline_semantic_search branch March 16, 2026 11:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants