-
Notifications
You must be signed in to change notification settings - Fork 3
Pipeline embedding #18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
10 commits
Select commit
Hold shift + click to select a range
6312bf8
integrate Docling parsing and paragraph chunking into ingestion pipeline
AymanL 0af60af
integrate Docling chunking pipeline and add chunk embedding column
AymanL 7c1b36f
add_embeddings with multilingual-e5-base, batching, and bulk_update.
AymanL f24fe7d
test adaptation
AymanL 2a8f555
update readme
AymanL 72ea5f3
Merge branch 'main' into pipeline-embedding
AymanL b6e3fda
explaining the PASSAGE_PREFIX
AymanL aedf8f5
update documents inside the batch
AymanL 844876d
create a document chunk factory
AymanL 4cb4a7e
make sure empty and unsaved have not been persisted
AymanL File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,8 +1,45 @@ | ||
| from eu_fact_force.ingestion.models import DocumentChunk | ||
| from typing import Iterator | ||
|
|
||
| MODEL_ID = "intfloat/multilingual-e5-base" | ||
| # E5 models expect "passage: " for documents to index and "query: " for search queries (asymmetric retrieval). | ||
| PASSAGE_PREFIX = "passage: " | ||
| EMBED_BATCH_SIZE = 32 | ||
| _MODEL = None | ||
|
|
||
|
|
||
| def _get_model(): | ||
| global _MODEL | ||
| if _MODEL is None: | ||
| from sentence_transformers import SentenceTransformer | ||
|
|
||
| _MODEL = SentenceTransformer(MODEL_ID) | ||
| return _MODEL | ||
|
|
||
|
|
||
| def _iter_batches(items: list[DocumentChunk], batch_size: int) -> Iterator[list[DocumentChunk]]: | ||
| for start in range(0, len(items), batch_size): | ||
| yield items[start : start + batch_size] | ||
|
|
||
|
|
||
| def add_embeddings(chunks: list[DocumentChunk]): | ||
| """ | ||
| Add embeddings to the chunks and update in the DB. | ||
| """ | ||
| pass | ||
| persisted_chunks = [ | ||
| chunk for chunk in chunks if chunk.pk is not None and chunk.content.strip() | ||
| ] | ||
| if not persisted_chunks: | ||
| return | ||
|
|
||
| model = _get_model() | ||
| for batch in _iter_batches(persisted_chunks, EMBED_BATCH_SIZE): | ||
| texts = [f"{PASSAGE_PREFIX}{chunk.content}" for chunk in batch] | ||
| vectors = model.encode( | ||
| texts, | ||
| show_progress_bar=False, | ||
| normalize_embeddings=True, | ||
| ) | ||
| for chunk, vector in zip(batch, vectors): | ||
| chunk.embedding = vector.tolist() if hasattr(vector, "tolist") else list(vector) | ||
| DocumentChunk.objects.bulk_update(batch, ["embedding"]) | ||
22 changes: 22 additions & 0 deletions
22
eu_fact_force/ingestion/migrations/0002_documentchunk_embedding.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,22 @@ | ||
| from django.db import migrations | ||
| from pgvector.django import VectorExtension, VectorField | ||
|
|
||
|
|
||
| class Migration(migrations.Migration): | ||
| dependencies = [ | ||
| ("ingestion", "0001_initial"), | ||
| ] | ||
|
|
||
| operations = [ | ||
| VectorExtension(), | ||
| migrations.AddField( | ||
| model_name="documentchunk", | ||
| name="embedding", | ||
| field=VectorField( | ||
| blank=True, | ||
| dimensions=768, | ||
| help_text="Dense embedding vector for semantic retrieval.", | ||
| null=True, | ||
| ), | ||
| ), | ||
| ] |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,24 @@ | ||
| """Factories for test data.""" | ||
|
|
||
| import factory | ||
| from factory.django import DjangoModelFactory | ||
|
|
||
| from eu_fact_force.ingestion.models import DocumentChunk, SourceFile | ||
|
|
||
|
|
||
| class SourceFileFactory(DjangoModelFactory): | ||
| class Meta: | ||
| model = SourceFile | ||
|
|
||
| doi = "" | ||
| s3_key = "" | ||
| status = SourceFile.Status.STORED | ||
|
|
||
|
|
||
| class DocumentChunkFactory(DjangoModelFactory): | ||
| class Meta: | ||
| model = DocumentChunk | ||
|
|
||
| source_file = factory.SubFactory(SourceFileFactory) | ||
| content = "" | ||
| order = 0 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,57 @@ | ||
| """Tests for ingestion embedding persistence.""" | ||
|
|
||
| import pytest | ||
|
|
||
| from eu_fact_force.ingestion import embedding as embedding_module | ||
| from eu_fact_force.ingestion.models import DocumentChunk | ||
| from tests.factories import DocumentChunkFactory, SourceFileFactory | ||
|
|
||
|
|
||
| class _FakeModel: | ||
| def __init__(self): | ||
| self.calls: list[list[str]] = [] | ||
|
|
||
| def encode(self, texts, show_progress_bar, normalize_embeddings): | ||
| self.calls.append(list(texts)) | ||
| return [[0.1] * 768 for _ in texts] | ||
|
|
||
|
|
||
| @pytest.mark.django_db | ||
| def test_add_embeddings_updates_persisted_chunks(monkeypatch): | ||
| source = SourceFileFactory() | ||
| chunk_1 = DocumentChunkFactory(source_file=source, content="alpha", order=1) | ||
| chunk_2 = DocumentChunkFactory(source_file=source, content="beta", order=2) | ||
|
|
||
| fake_model = _FakeModel() | ||
| monkeypatch.setattr(embedding_module, "_get_model", lambda: fake_model) | ||
|
|
||
| embedding_module.add_embeddings([chunk_1, chunk_2]) | ||
|
|
||
| chunk_1.refresh_from_db() | ||
| chunk_2.refresh_from_db() | ||
| assert len(chunk_1.embedding) == 768 | ||
| assert len(chunk_2.embedding) == 768 | ||
| assert chunk_1.embedding[0] == pytest.approx(0.1) | ||
| assert chunk_2.embedding[0] == pytest.approx(0.1) | ||
| assert fake_model.calls == [["passage: alpha", "passage: beta"]] | ||
|
|
||
|
|
||
| @pytest.mark.django_db | ||
| def test_add_embeddings_skips_unsaved_and_empty_chunks(monkeypatch): | ||
| """Only persisted, non-empty chunks are embedded; unsaved and empty-content chunks are ignored.""" | ||
| source = SourceFileFactory() | ||
| persisted = DocumentChunkFactory(source_file=source, content="ok", order=1) | ||
| empty_chunk = DocumentChunk(source_file=source, content=" ", order=2) | ||
| unsaved_chunk = DocumentChunk(source_file=source, content="temp", order=3) | ||
|
|
||
| fake_model = _FakeModel() | ||
| monkeypatch.setattr(embedding_module, "_get_model", lambda: fake_model) | ||
|
|
||
| embedding_module.add_embeddings([persisted, empty_chunk, unsaved_chunk]) | ||
|
|
||
| # DB still has only the one persisted chunk; empty_chunk and unsaved_chunk were never saved | ||
| assert DocumentChunk.objects.count() == 1 | ||
| persisted.refresh_from_db() | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Je dirais ici qu'il faut récupérer toute la base pour voir si empty et unsaved n'ont pas été sauvegardés justement. |
||
| assert len(persisted.embedding) == 768 | ||
| assert persisted.embedding[0] == pytest.approx(0.1) | ||
| assert fake_model.calls == [["passage: ok"]] | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pourquoi tu ajoutes ce préfixe au texte?