Skip to content

Pipeline embedding#18

Merged
AymanL merged 10 commits intomainfrom
pipeline-embedding
Mar 8, 2026
Merged

Pipeline embedding#18
AymanL merged 10 commits intomainfrom
pipeline-embedding

Conversation

@AymanL
Copy link
Collaborator

@AymanL AymanL commented Mar 5, 2026

Summary

This PR adds embedding persistence on top of the Docling+chunking pipeline baseline.

Changes

  • Add pgvector-backed embedding storage on chunks
  • Implement real embedding step in ingestion:
    • add_embeddings() now uses intfloat/multilingual-e5-base
    • passage prefixing (passage: )
    • batched encoding (batch_size=32)
    • DB persistence via bulk_update
  • Add embedding-focused ingestion tests
  • Keep pipeline tests deterministic: captures/validates embedding call inputs without running model inference

Test plan

  • pytest tests/ingestion/test_embedding.py -q
  • pytest tests/ingestion/test_chunking.py tests/ingestion/test_run_pipeline.py -q

@AymanL AymanL requested a review from cgoudet March 6, 2026 14:07
Copy link
Collaborator

@cgoudet cgoudet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merci! Super boulot!

Peux tu te mettre en contact avec Hugo qui devait intégrer le type vector dans la db. Comme tu le fais ici il peut passer à autre chose.


model = _get_model()
for batch in _iter_batches(persisted_chunks, EMBED_BATCH_SIZE):
texts = [f"{PASSAGE_PREFIX}{chunk.content}" for chunk in batch]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pourquoi tu ajoutes ce préfixe au texte?

for chunk, vector in zip(batch, vectors):
chunk.embedding = vector.tolist() if hasattr(vector, "tolist") else list(vector)

DocumentChunk.objects.bulk_update(persisted_chunks, ["embedding"])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quitte à faire un batch autant faire l'update du batch à chaque fois non?

@pytest.mark.django_db
def test_add_embeddings_updates_persisted_chunks(monkeypatch):
source = SourceFile.objects.create(doi="x", s3_key="k", status=SourceFile.Status.STORED)
chunk_1 = DocumentChunk.objects.create(source_file=source, content="alpha", order=1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pour être un poil plus propre on pourrait partir sur des Factory pour générer les objets à tester.


embedding_module.add_embeddings([persisted, empty_chunk, unsaved_chunk])

persisted.refresh_from_db()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Je dirais ici qu'il faut récupérer toute la base pour voir si empty et unsaved n'ont pas été sauvegardés justement.

Base automatically changed from pipeline-docling-chunking to main March 8, 2026 16:14
@AymanL AymanL merged commit 6ad758b into main Mar 8, 2026
1 check passed
@AymanL AymanL deleted the pipeline-embedding branch March 8, 2026 16:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants