Pipeline embedding by AymanL · Pull Request #18 · dataforgoodfr/14_EUFactForce

AymanL · 2026-03-05T19:44:43Z

Summary

This PR adds embedding persistence on top of the Docling+chunking pipeline baseline.

Changes

Add pgvector-backed embedding storage on chunks
Implement real embedding step in ingestion:
- add_embeddings() now uses intfloat/multilingual-e5-base
- passage prefixing (passage: )
- batched encoding (batch_size=32)
- DB persistence via bulk_update
Add embedding-focused ingestion tests
Keep pipeline tests deterministic: captures/validates embedding call inputs without running model inference

Test plan

pytest tests/ingestion/test_embedding.py -q
pytest tests/ingestion/test_chunking.py tests/ingestion/test_run_pipeline.py -q

cgoudet

Merci! Super boulot!

Peux tu te mettre en contact avec Hugo qui devait intégrer le type vector dans la db. Comme tu le fais ici il peut passer à autre chose.

cgoudet · 2026-03-08T08:36:24Z

eu_fact_force/ingestion/embedding.py

+
+    model = _get_model()
+    for batch in _iter_batches(persisted_chunks, EMBED_BATCH_SIZE):
+        texts = [f"{PASSAGE_PREFIX}{chunk.content}" for chunk in batch]


Pourquoi tu ajoutes ce préfixe au texte?

cgoudet · 2026-03-08T08:37:37Z

eu_fact_force/ingestion/embedding.py

+        for chunk, vector in zip(batch, vectors):
+            chunk.embedding = vector.tolist() if hasattr(vector, "tolist") else list(vector)
+
+    DocumentChunk.objects.bulk_update(persisted_chunks, ["embedding"])


Quitte à faire un batch autant faire l'update du batch à chaque fois non?

cgoudet · 2026-03-08T08:39:18Z

tests/ingestion/test_embedding.py

+@pytest.mark.django_db
+def test_add_embeddings_updates_persisted_chunks(monkeypatch):
+    source = SourceFile.objects.create(doi="x", s3_key="k", status=SourceFile.Status.STORED)
+    chunk_1 = DocumentChunk.objects.create(source_file=source, content="alpha", order=1)


Pour être un poil plus propre on pourrait partir sur des Factory pour générer les objets à tester.

cgoudet · 2026-03-08T08:47:39Z

tests/ingestion/test_embedding.py

+
+    embedding_module.add_embeddings([persisted, empty_chunk, unsaved_chunk])
+
+    persisted.refresh_from_db()


Je dirais ici qu'il faut récupérer toute la base pour voir si empty et unsaved n'ont pas été sauvegardés justement.

AymanL added 5 commits March 5, 2026 18:17

integrate Docling parsing and paragraph chunking into ingestion pipeline

6312bf8

integrate Docling chunking pipeline and add chunk embedding column

0af60af

add_embeddings with multilingual-e5-base, batching, and bulk_update.

7c1b36f

test adaptation

f24fe7d

update readme

2a8f555

AymanL requested a review from cgoudet March 6, 2026 14:07

cgoudet approved these changes Mar 8, 2026

View reviewed changes

Base automatically changed from pipeline-docling-chunking to main March 8, 2026 16:14

AymanL added 5 commits March 8, 2026 16:18

Merge branch 'main' into pipeline-embedding

72ea5f3

explaining the PASSAGE_PREFIX

b6e3fda

update documents inside the batch

aedf8f5

create a document chunk factory

844876d

make sure empty and unsaved have not been persisted

4cb4a7e

AymanL merged commit 6ad758b into main Mar 8, 2026
1 check passed

AymanL deleted the pipeline-embedding branch March 8, 2026 16:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pipeline embedding#18

Pipeline embedding#18
AymanL merged 10 commits intomainfrom
pipeline-embedding

AymanL commented Mar 5, 2026

Uh oh!

cgoudet left a comment

Uh oh!

cgoudet Mar 8, 2026

Uh oh!

cgoudet Mar 8, 2026

Uh oh!

cgoudet Mar 8, 2026

Uh oh!

cgoudet Mar 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		embedding_module.add_embeddings([persisted, empty_chunk, unsaved_chunk])

		persisted.refresh_from_db()

Conversation

AymanL commented Mar 5, 2026

Summary

Changes

Test plan

Uh oh!

cgoudet left a comment

Choose a reason for hiding this comment

Uh oh!

cgoudet Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

cgoudet Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

cgoudet Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

cgoudet Mar 8, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants