Gnievesponce prompt tune embedd chunking#1826
Merged
Conversation
added 2 commits
March 24, 2025 23:41
AlonsoGuevara
approved these changes
Mar 27, 2025
opensourcemukul
pushed a commit
to opensourcemukul/graphrag
that referenced
this pull request
Sep 13, 2025
* Added support for embeddings chunking as defined by the config. * ran semvisor -t patch * Eliminated redunant code by using the embed_text strategy directly * Added fix to support brakets within the corpus text; For example, inline LaTeX within a markdown file --------- Co-authored-by: Gabriel Nieves <gnievesponce@microsoft.com>
Brandsma
pushed a commit
to ThalamusLabs/MMGraphRAG
that referenced
this pull request
Nov 6, 2025
* Added support for embeddings chunking as defined by the config. * ran semvisor -t patch * Eliminated redunant code by using the embed_text strategy directly * Added fix to support brakets within the corpus text; For example, inline LaTeX within a markdown file --------- Co-authored-by: Gabriel Nieves <gnievesponce@microsoft.com>
JonasReuter
pushed a commit
to JonasReuter/graphrag
that referenced
this pull request
Apr 13, 2026
* Added support for embeddings chunking as defined by the config. * ran semvisor -t patch * Eliminated redunant code by using the embed_text strategy directly * Added fix to support brakets within the corpus text; For example, inline LaTeX within a markdown file --------- Co-authored-by: Gabriel Nieves <gnievesponce@microsoft.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
When running prompt tune using the automatic selection method, the system will attempt to embed all the text chunks within one request no matter the size of the payload.
By default, the batch-size should not be larger than 16 text chunks and the token count should be below 8191 for the whole batch.
Related Issues
#1825
Proposed Changes
Modify graphrag/prompt_tune/loader/input.py to add logic that chunks large embeddings jobs/request similarly to how we do it in the indexing workflow. Here is an example workflow with a correct batching strategy: graphrag/index/operations/embed_text/strategies/openai.py
Checklist
Additional Notes
No additional notes