Skip to content

Feature implementation from commits 2f2cfa7..ad5b512#4

Open
yashuatla wants to merge 15 commits intofeature-base-4from
feature-head-4
Open

Feature implementation from commits 2f2cfa7..ad5b512#4
yashuatla wants to merge 15 commits intofeature-base-4from
feature-head-4

Conversation

@yashuatla
Copy link
Copy Markdown
Owner

This PR contains changes from a range of commits from the original repository.

Commit Range: 2f2cfa7..ad5b512
Files Changed: 344 (145 programming files)
Programming Ratio: 42.2%

Commits included:

AlonsoGuevara and others added 15 commits January 13, 2025 17:41
* Add new inputs and missing vector store for retrieving vectors

* Format

* Semver

* Remove .Identifier files

* Fix spellcheck

* Remove unnecessary input file for notebooks
* Adding basic wrappes for reduce in drift

* Add response_type parameter to run_drift_search and enhance reduce response functionality

* Add streaming endpoint

* Semver

* Spellcheck

* Ruff checks

* Count tokens on reduce

* Use list comprehension and remove llm_params map in favor of just using kwargs
* Add vector field to retrievable fields for Azure AI Search

* Add DRIFT and Basic search to smoke tests

* Semver

* Format

* Remove DRIFT smoke test for now (brittle)
* Refactor config

- Add new ModelConfig to represent LLM settings
    - Combines LLMParameters, ParallelizationParameters, encoding_model, and async_mode
- Add top level models config that is a list of available LLM ModelConfigs
- Remove LLMConfig inheritance and delete LLMConfig
    - Replace the inheritance with a model_id reference to the ModelConfig listed in the top level models config
- Remove all fallbacks and hydration logic from create_graphrag_config
    - This removes the automatic env variable overrides
- Support env variables within config files using Templating
    - This requires "$" to be escaped with extra "$" so ".*\\.txt$" becomes ".*\\.txt$$"
- Update init content to initialize new config file with the ModelConfig structure

* Use dict of ModelConfig instead of list

* Add model validations and unit tests

* Fix ruff checks

* Add semversioner change

* Fix unit tests

* validate root_dir in pydantic model

* Rename ModelConfig to LanguageModelConfig

* Rename ModelConfigMissingError to LanguageModelConfigMissingError

* Add validationg for unexpected API keys

* Allow skipping pydantic validation for testing/mocking purposes.

* Add default lm configs to verb tests

* smoke test

* remove config from flows to fix llm arg mapping

* Fix embedding llm arg mapping

* Remove timestamp from smoke test outputs

* Remove unused "subworkflows" smoke test properties

* Add models to smoke test configs

* Update smoke test output path

* Send logs to logs folder

* Fix output path

* Fix csv test file pattern

* Update placeholder

* Format

* Instantiate default model configs

* Fix unit tests for config defaults

* Fix migration notebook

* Remove create_pipeline_config

* Remove several unused config models

* Remove indexing embedding and input configs

* Move embeddings function to config

* Remove skip_workflows

* Remove skip embeddings in favor of explicit naming

* fix unit test spelling mistake

* self.models[model_id] is already a language model. Remove redundant casting.

* update validation errors to instruct users to rerun graphrag init

* instantiate LanguageModelConfigs with validation

* skip validation in unit tests

* update verb tests to use default model settings instead of skipping validation

* test using llm settings

* cleanup verb tests

* remove unsafe default model config

* remove the ability to skip pydantic validation

* remove None union types when default values are set

* move vector_store from embeddings to top level of config and delete resolve_paths

* update vector store settings

* fix vector store and smoke tests

* fix serializing vector_store settings

* fix vector_store usage

* fix vector_store type

* support cli overrides for loading graphrag config

* rename storage to output

* Add --force flag to init

* Remove run_id and resume, fix Drift config assignment

* Ruff

---------

Co-authored-by: Nathan Evans <github@talkswithnumbers.com>
Co-authored-by: Alonso Guevara <alonsog@microsoft.com>
Updated the auto prompt tuning doc with `--selection-method` instead of only `--method` as per the latest API.

Co-authored-by: Alonso Guevara <alonsog@microsoft.com>
* added multi-global-query function header

* ported over code for merging dataframes

* added connection to global streaming api function

* added function header for update context helper

* implemented and incorperated update_context function

* Updated to make sure 'parent' column in final_communities gets incremented for multi index.

* first cut at multi_local_seach function

* several minor changes and fixes

* Updated multi index local search.

* Cleaned up code.

* fixed lambda function ruff errors

* fixed more ruff errors

* moved query api helpers to util file

* moved index api helpers to util file

* merged in code left out of conflict

* changed GraphRagConfig object to support lists of vector stores

* Updated with fixes for multi_local_search.

* Minor updates.

* Minor updates.

* Updates for ruff check.

* Minor updates.

* removed redundant vector_store_configs arg

* ruff formatting changes

* semversioner

* Minor fix.

* spellcheck fixes

* ruff

* test fix for cicd errors

* another test fix

* added explicit typing for ci tests

* added dict type check for vector_store during indexing

* more ruff fixes

* moved type check

* Removed streaming. Added multi drift and basic searches.

* Formatting changes.

* Updates for pyright.

* Update for ruff.

* Ruff formatted.

* first cut at fixing vector store typing errors

* got multi local search working with new config

* ruff and test fixes

* added fix for embeddings type error

* renamed multi index api functions

* ruff

* convert config model to dict[VectorStoreConfig]

* modified tests to support new vector_store model

* ruff fixes

* changed some test setups to match new model

* changed ci/cd settings files to match new structure

* Fix stderror check

* fixed bug in vector_store_config validation

* ruff

* add database_name field to vectorstoreconfig

* removed print statements

* small refactoring for PR comments

* modified default config in test

* modified vector store config unit test

---------

Co-authored-by: dorbaker <dorbaker@microsoft.com>
Co-authored-by: Alonso Guevara <alonsog@microsoft.com>
* Add NLP extraction workflow

* Add text unit community summarization

* Add CLI flag for indexing method

* Regenerate poetry.lock

* Fix claims loading

* Merge fixes

* Add workflow overrides to config

* Semver

* Add graph pruning config

* Remove degree re-compute from pruning

* Switch to percentile for edge weight pruning

* Add NLP extraction config

* Add new NLP extractor options

* Add FGR workflows to util method

* Use a generator factory for workflows

* Update pruning defaults

---------

Co-authored-by: Alonso Guevara <alonsog@microsoft.com>
* Require explicit azure auth settings when using AOI.

- Must set LanguageModel.azure_auth_type to either
"api_key" or "managed_identity" when using AOI.

* Fix smoke tests

* Use general auth_type property instead of azure_auth_type

* Remove unused error type

* Update validation

* Update validation comment
…t#1672)

* remove unused columns and change property document_attribute_columns to metadata

* format file

* fix 'metadata' column on output

* run check

* fix test on nltk

* remove docs changes
Comment thread graphrag/api/index.py
if is_resume_run and is_update_run:
msg = "Cannot resume and update a run at the same time."
raise ValueError(msg)
is_update_run = bool(config.update_index_output)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🐛 Correctness Issue

Removed critical validation check.

The validation preventing simultaneous resume and update operations was removed, which could lead to data corruption or unexpected behavior.

Current Code (Diff):

-    is_update_run = bool(config.update_index_output)
+    is_update_run = bool(config.update_index_output)
+
+    if is_resume_run and is_update_run:
+        msg = "Cannot resume and update a run at the same time."
+        raise ValueError(msg)
📝 Committable suggestion

‼️ IMPORTANT
Trust, but verify! 🕵️ Please review this suggestion with the care of a code archaeologist - check that it perfectly replaces the highlighted code, preserves all lines, maintains proper indentation, and won't break anything in production. Your future self will thank you! 🚀

Suggested change
is_update_run = bool(config.update_index_output)
is_update_run = bool(config.update_index_output)
if is_resume_run and is_update_run:
msg = "Cannot resume and update a run at the same time."
raise ValueError(msg)

Comment thread graphrag/api/query.py
Comment on lines +1203 to +1204
else:
vector_store_args = None
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🐛 Correctness Issue

Misplaced else clause nullifies vector_store_args.

The else clause after the for loop will always execute, setting vector_store_args to None and causing the application to crash when get_embedding_store tries to use it

Current Code (Diff):

- |4 spaces|else:
- |8 spaces|vector_store_args = None

🔄 Dependencies Affected

graphrag/api/query.py

Function: get_embedding_store

Issue: Function expects config_args to be a dictionary, but will receive None

Suggestion: Remove the else clause that sets vector_store_args to None


Comment thread graphrag/cli/main.py

match method:
case SearchType.LOCAL:
case SearchMethod.LOCAL:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🐛 Correctness Issue

Type mismatch in match-case statement.

The match statement is using SearchType.LOCAL but the method parameter is of type SearchMethod, which would cause a runtime error when executed.

Current Code (Diff):

-         case SearchType.LOCAL:
+         case SearchMethod.LOCAL:
📝 Committable suggestion

‼️ IMPORTANT
Trust, but verify! 🕵️ Please review this suggestion with the care of a code archaeologist - check that it perfectly replaces the highlighted code, preserves all lines, maintains proper indentation, and won't break anything in production. Your future self will thank you! 🚀

Suggested change
case SearchMethod.LOCAL:
case SearchMethod.LOCAL:

match settings.embeddings.target:
case TextEmbeddingTarget.all:
return all_embeddings.difference(settings.embeddings.skip)
return all_embeddings
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🐛 Correctness Issue

Breaking API Change: Removed Embedding Filtering.

The code removes the ability to skip specific embeddings when using TextEmbeddingTarget.all, which will break existing code that relies on this filtering behavior.

Current Code (Diff):

- return all_embeddings
+ return all_embeddings.difference(settings.embeddings.skip)
📝 Committable suggestion

‼️ IMPORTANT
Trust, but verify! 🕵️ Please review this suggestion with the care of a code archaeologist - check that it perfectly replaces the highlighted code, preserves all lines, maintains proper indentation, and won't break anything in production. Your future self will thank you! 🚀

Suggested change
return all_embeddings
return all_embeddings.difference(settings.embeddings.skip)


Note that we use dot notation in our names, but many vector stores do not support this - so we convert to dashes.
"""
if validate and embedding_name not in all_embeddings:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🐛 Correctness Issue

Undefined variable 'all_embeddings' will cause NameError.

The function references 'all_embeddings' which is not defined or imported, causing a runtime exception when the function is called with validate=True.

Current Code (Diff):

    if validate and embedding_name not in all_embeddings:
📝 Committable suggestion

‼️ IMPORTANT
Trust, but verify! 🕵️ Please review this suggestion with the care of a code archaeologist - check that it perfectly replaces the highlighted code, preserves all lines, maintains proper indentation, and won't break anything in production. Your future self will thank you! 🚀

Suggested change
if validate and embedding_name not in all_embeddings:
if validate and embedding_name not in all_embeddings:

🔄 Dependencies Affected

graphrag/index/operations/embed_text/embed_text.py

Function: _get_collection_name

Issue: Will raise NameError when calling create_collection_name() due to undefined all_embeddings variable

Suggestion: Ensure all_embeddings is imported in the file where create_collection_name is defined


graphrag/utils/api.py

Function: get_embedding_store

Issue: Will raise NameError when calling create_collection_name() due to undefined all_embeddings variable

Suggestion: Ensure all_embeddings is imported in the file where create_collection_name is defined


Comment thread graphrag/config/enums.py

class StorageType(str, Enum):
"""The storage type for the pipeline."""
class OutputType(str, Enum):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🐛 Correctness Issue

Breaking API Change: Class Rename.

Renaming StorageType to OutputType will break all existing code that imports or uses the StorageType class, causing compilation failures and runtime exceptions.

Proposed Code:

class OutputType(str, Enum):
    """The output type for the pipeline."""

config = create_graphrag_config(root_dir=str(root))
config_path = _search_for_config_in_root_dir(root_dir)

if not config_path:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🐛 Correctness Issue

Breaking API Change: Default Config Behavior Removed.

The function now raises FileNotFoundError when no config file is found instead of creating a default configuration, which will break existing code that relies on the old behavior.

Current Code (Diff):

-    if not config_path:
-        msg = f"Config file not found in root directory: {root_dir}"
-        raise FileNotFoundError(msg)
+    if not config_path:
+        # For backward compatibility, create default config instead of raising error
+        return create_graphrag_config(root_dir=str(root_dir))
📝 Committable suggestion

‼️ IMPORTANT
Trust, but verify! 🕵️ Please review this suggestion with the care of a code archaeologist - check that it perfectly replaces the highlighted code, preserves all lines, maintains proper indentation, and won't break anything in production. Your future self will thank you! 🚀

Suggested change
if not config_path:
if not config_path:
# For backward compatibility, create default config instead of raising error
return create_graphrag_config(root_dir=str(root_dir))

def __repr__(self):
"""Get a string representation."""
return f'"{self.value}"'
from graphrag.config.enums import ChunkStrategyType
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🐛 Correctness Issue

Breaking Change: Import Path Modification.

Moving ChunkStrategyType to a different module breaks imports in dependent files, causing runtime exceptions.

🔄 Dependencies Affected

graphrag/index/operations/chunk_text/chunk_text.py

Function: import statements

Issue: Import statement will fail because ChunkStrategyType is no longer defined in chunking_config.py

Suggestion: Update import statement to import ChunkStrategyType from graphrag.config.enums

Current Code (Diff):

- from graphrag.config.models.chunking_config import ChunkingConfig, ChunkStrategyType
+ from graphrag.config.models.chunking_config import ChunkingConfig
+ from graphrag.config.enums import ChunkStrategyType

Comment on lines +43 to +45
def resolved_strategy(
self, root_dir: str, model_config: LanguageModelConfig
) -> dict:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🐛 Correctness Issue

Breaking API change in method signature.

The resolved_strategy method signature has changed from accepting an encoding_model string to requiring a LanguageModelConfig object, which will break existing callers.

Current Code (Diff):

- |4 spaces|     def resolved_strategy(
- |8 spaces|         self, root_dir: str, model_config: LanguageModelConfig
- |4 spaces|     ) -> dict:
+ |4 spaces|     def resolved_strategy(
+ |8 spaces|         self, root_dir: str, model_config: LanguageModelConfig = None, encoding_model: str | None = None
+ |4 spaces|     ) -> dict:

Proposed Code:

    def resolved_strategy(
        self, root_dir: str, model_config: LanguageModelConfig = None, encoding_model: str | None = None
    ) -> dict:
📝 Committable suggestion

‼️ IMPORTANT
Trust, but verify! 🕵️ Please review this suggestion with the care of a code archaeologist - check that it perfectly replaces the highlighted code, preserves all lines, maintains proper indentation, and won't break anything in production. Your future self will thank you! 🚀

Suggested change
def resolved_strategy(
self, root_dir: str, model_config: LanguageModelConfig
) -> dict:
def resolved_strategy(
self, root_dir: str, model_config: LanguageModelConfig = None, encoding_model: str | None = None
) -> dict:

🔄 Dependencies Affected

Unknown

Function: Any caller of ClaimExtractionConfig.resolved_strategy

Issue: Existing callers pass an encoding_model string parameter which is no longer accepted

Suggestion: Update callers to create and pass a LanguageModelConfig object instead of an encoding_model string


Comment on lines +36 to +38
def resolved_strategy(
self, root_dir: str, model_config: LanguageModelConfig
) -> dict:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🐛 Correctness Issue

Breaking API Change: New Required Parameter.

The resolved_strategy method signature has been changed to require a new parameter which will break all existing callers of this method.

Proposed Code:

    def resolved_strategy(
        self, root_dir: str, model_config: LanguageModelConfig
    ) -> dict:

"""Validate the vector store configuration."""
for store in self.vector_store.values():
if store.type == VectorStoreType.LanceDB:
if not store.db_uri or store.db_uri.strip == "":
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🐛 Correctness Issue

Method reference used without calling it.

The strip method is being referenced but not called, causing the empty string check to always fail which will lead to runtime errors with empty URIs.

Current Code (Diff):

-                 if not store.db_uri or store.db_uri.strip == "":
+                 if not store.db_uri or store.db_uri.strip() == "":
📝 Committable suggestion

‼️ IMPORTANT
Trust, but verify! 🕵️ Please review this suggestion with the care of a code archaeologist - check that it perfectly replaces the highlighted code, preserves all lines, maintains proper indentation, and won't break anything in production. Your future self will thank you! 🚀

Suggested change
if not store.db_uri or store.db_uri.strip == "":
if not store.db_uri or store.db_uri.strip() == "":

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants