Conversation
a3dd470 to
ab68b9f
Compare
There was a problem hiding this comment.
Pull request overview
This PR introduces the GraphRAG LLM package, which consolidates and refactors the language model infrastructure. The changes replace the existing graphrag.language_model module with a new graphrag-llm package, standardizing the interface for completion and embedding models across the codebase.
Changes:
- Removed the existing language model implementation and replaced it with the new
graphrag-llmpackage - Updated configuration models to use
completion_model_idandembedding_model_idinstead of genericmodel_id - Refactored all workflows and operations to use the new LLM interfaces (
LLMCompletionandLLMEmbedding)
Reviewed changes
Copilot reviewed 257 out of 260 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/verbs/util.py | Removed deprecated model configuration constants |
| tests/verbs/test_*.py | Updated test files to use new configuration utility functions |
| tests/unit/config/utils.py | Added new model config assertion helpers and updated default configs |
| tests/unit/config/test_config.py | Removed validation tests that are now handled by the LLM package |
| packages/graphrag/graphrag/config/models/*_config.py | Updated field names from model_id to completion_model_id/embedding_model_id |
| packages/graphrag/graphrag/index/workflows/*.py | Refactored to use new LLM creation functions and interfaces |
| packages/graphrag/graphrag/query/**/*.py | Updated query components to use new LLM interfaces |
| packages/graphrag/graphrag/language_model/** | Removed old language model implementation |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 251 out of 265 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
* Remove graph embedding and UMAP (#2048) * Remove umap/layout operation * Remove graph embedding * Bump unified-search to GR 2.5.0 * Remove graph vis from unified-search * Remove file filtering (#2050) * Remove document filtering * Semver * Fix integ tests * Fix file find tuple * Fix another dangling find tuple * Remove text unit grouping (#2052) * Remove text unit group_by_columns * Semver * Fix default token split test * Fix models in config test samples * Fix token length in context sort test * Fix document sort * Re-implement hierarchical Leiden (#2049) * Use graspologic-native hierarchical leiden * Re-implement largest_connected_component * Copy in modularity * Use graspologic-native directly in pyproject * Remove directed graph tests (we don't use this) * Semver * Remove graspologic dep * Use 4.1 and text-embedding-3-large as defaults * Update comment * Clean vector store (#2077) * clean vector store code * fix * fix launch.json --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Update v3/main missing config + functions (#2082) * reduce schema fields (#2089) * reduce schema fields * fix launch.json --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Remove strategy dicts (#2090) * Remove "strategy" from community reports config/workflow * Remove extraction strategy from extract_graph * Remove summarization strategy from extract_graph * Remove strategy from claim extraction * Strongly type prompt templates * Remove strategy from embed_text * Push hydrated params into community report workflows * Push hyrdated params into extract covariates * Push hydrated params into extract graph NLP * Push hydrated params into extract graph * Push hydrated params into text embeddings * Remove a few more low-level defaults * Semver * Remove configurable prompt delimiters * Update smoke tests * Remove fnllm (#2095) * Sort deps alpha * Remove multi search (#2093) * Remove multi-search from CLI * Remove multi-search from API * Flatten vector_store config * Push hydrated vector store down to embed_text * Remove outputs from config * Remove multi-search notebook/docs * Add missing response_type in basic search API * Fix basic search context and id mapping * Fix v1 migration notebook * Fix query entity search tests * V3 docs and cleanup (#2100) * Remove community contrib notebooks * Add migration notebook and breaking changes page edits * Update/polish docs * Make model instance name configurable * Add vector schema updates to v3 migration notebook * Spellcheck * Bump smoke test runtimes * Remove document overwrite (#2101) * remove document overwrite from vector store configuration * remove document overwrite and refactor load documents method * fix test * fix test * fix test --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Unified factory (#2105) * Simplify Factory interface * Migrate CacheFactory to standard base class * Migrate LoggerFactory to standard base class * Migrate StorageFactory to standard base class * Migrate VectorStoreFactory to standard base class * Update vector store example notebook * Delete notebook outputs * Move default providers into factories * Move retry/limit tests into integ * Split language model factories * Set smoke test tpm/rpm * Fix factory integ tests * Add method to smoke test, switch text to 'fast' * Fix text smoke config for fast workflow * Add new workflows to text smoke test * Convert input readers to a proper factory * Remove covariates from fast smoke test * Update docs for input factory * Bump smoke runtime * Even longer runtime * min-csv timeout * Remove unnecessary lambdas * Prefix vector store (#2106) * add prefix to vector store configuration and removal of container name * docs updated * change prefix property name * change prefix property name * feedback implemented --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * fix for container name * Restructure project as monorepo. (#2111) * Restructure project as monorepo. * Fix formatting * Storage fixes and cleanup (#2118) * Fix pipeline recursion * Remove base_dir from storage.find * Remove max_count from storage.find * Remove prefix on storage integ test * Add base_dir in creation_date test * Wrap base_dir in Path * Use constants for input/update directories * Nov 2025 housekeeping (#2120) * Remove gensim sideload * Split CI build/type checks from unit tests * Thorough review of docs to align with v3 * Format * Fix version * Fix type * Graphrag config (#2119) * Add load_config to graphrag-common package. * Empty graph guards (#2126) * Remove networkx from graph_extractor and clean out redundancy * Bubble pipeline error to console * Remove embeddings optional new (#2128) * remove optional embeddings * fix test * fix tests * fix pipeline * fix test * fix test * fix test * fix tests --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Format * Add empty checks for NLP graphs (#2133) * Init command asks for models (#2137) * Add init prompting for models * Remove hard-coded model config validation * Switch to typer option prompt for full CLI use with models * Update getting started for init model input * Bump request timeout and overall smoke test timeout * Add graphrag-storage. (#2127) * Add graphrag-storage. * Python update (3.13) (#2149) * Update to python 3.14 as default, with range down to 3.10 * Fix enum value in query cli * Update pyarrow * Update py version for storage package * Remove 3.10 * add fastuuid * Update Python support to 3.11-3.14 with stricter dependency constraints - Set minimum Python version to 3.11 (removed 3.10 support) - Added support for Python 3.14 - Updated CI workflows: single-version jobs use 3.14, matrix jobs use 3.11 and 3.14 - Fixed license format to use SPDX-compatible format for Python 3.14 - Updated pyarrow to >=22.0.0 for Python 3.14 wheel support - Added explicit fastuuid~=0.14 and blis~=1.3 for Python 3.14 compatibility - Replaced all loose version constraints (>=) with compatible release (~=) for better lock file control - Applied stricter versioning to all packages: graphrag, graphrag-common, graphrag-storage, unified-search-app * update uv lock * Pin blis to ~=1.3.3 to ensure Python 3.14 wheel availability * Update uv lock * Update numpy to >=2.0.0 for Python 3.14 Windows compatibility Numpy 1.25.x has access violation issues on Python 3.14 Windows. Numpy 2.x has proper Python 3.14 support including Windows wheels. * update uv lock * Update pandas to >=2.3.0 for numpy 2.x compatibility Pandas 2.2.x was compiled against numpy 1.x and causes ABI incompatibility errors with numpy 2.x. Pandas 2.3.0+ supports numpy 2.x properly. * update uv.lock * Add scipy>=1.15.0 for numpy 2.x compatibility Scipy versions < 1.15.0 have C extensions built against numpy 1.x and are incompatible with numpy 2.x, causing dtype size errors. * update uv lock * Update Python support to 3.11-3.13 with compatible dependencies - Set Python version range to 3.11-3.13 (removed 3.14 support) - Updated CI workflows: single-version jobs use 3.13, matrix jobs use 3.11 and 3.13 - Dependencies optimized for Python 3.13 compatibility: - pyarrow~=22.0 (has Python 3.13 wheels) - numpy~=1.26 - pandas~=2.2 - blis~=1.0 - fastuuid~=0.13 - Applied stricter version constraints using ~= operator throughout - Updated uv.lock with resolved dependencies * Update numpy to 2.1+ and pandas to 2.3+ for Python 3.13 Windows compatibility Numpy 1.26.x causes access violations on Python 3.13 Windows. Numpy 2.1+ has proper Python 3.13 support with Windows wheels. Pandas 2.3+ is required for numpy 2.x compatibility. * update vsts.yml python version * Add GraphRAG Cache package. (#2153) * Add GraphRAG Cache package. * Fix a bunch of module comments and function visibility (#2154) * Issue #2004 fix (#2159) * fix issue #2004 using KeenhoChu idea in his PR * add unit test for dynamic community selection * add unit test for dynamic community selection implementing #2158 logic --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Mismatch between header in community report generation prompt examples and input data (id vs human_readable_id) (#2161) * fix issue #860 for mismatch in prompts and input * fix format --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Chunker factory (#2156) * Delete NoopTextSplitter * Delete unused check_token_limit * Add base chunking factory and migrate workflow to use it * Split apart chunker module * Co-locate chunking/splitting * Collapse token splitting functionality into one class/function * Restore create_base_text_units parameterization * Move Tokenizer base class to common package * Move pre-pending into chunkers * Streamline config * Fix defaults construction * Add prepending tests * Remove chunk_size_includes_metadata config * Revert ChunkingDocument interface * Move metadata prepending to a util * Move Tokenizer back to GR core * Fix tokenizer removal from chunker * Set defaults for chunking config * Move chunking to monorepo package * Format * Typo * Add ChunkResult model * Streamline chunking config * Add missing version updates for graphrag_chunking * Input factory (#2168) * Update input factory to match other factories * Move input config alongside input readers * Move file pattern logic into InputReader * Set encoding default * Clean up optional column configs * Combine structured data extraction * Remove pandas from input loading * Throw if empty documents * Add json lines (jsonl) input support * Store raw data * Fix merge imports * Move metadata handling entirely to chunking * Nicer automatic title * Typo * Add get_property utility for nested dictionary access with dot notation * Update structured_file_reader to use get_property utility * Extract input module into new graphrag-input monorepo package - Create new graphrag-input package with input loading utilities - Move InputConfig, InputFileType, InputReader, TextDocument, and file readers (CSV, JSON, JSONL, Text) - Add get_property utility for nested dictionary access with dot notation - Include hashing utility for document ID generation - Update all imports throughout codebase to use graphrag_input - Add package to workspace configuration and release tasks - Remove old graphrag.index.input module * Rename ChunkResult to TextChunk and add transformer support - Rename chunk_result.py to text_chunk.py with ChunkResult -> TextChunk - Add 'original' field to TextChunk to track pre-transform text - Add optional transform callback to chunker.chunk() method - Add add_metadata transformer for prepending metadata to chunks - Update create_chunk_results to apply transforms and populate original - Update sentence_chunker and token_chunker with transform support - Refactor create_base_text_units to use new transformer pattern - Rename pluck_metadata to get/collect methods on TextDocument * Back-compat comment * Align input config type name with other factory configs * Add MarkItDown support * Remove pattern default from MarkItDown reader * Remove plugins flag (implicit disabled) * Format * Update verb tests * Separate storage from input config * Add empty objects for NaN raw_data * Fix smoke tests * Fix BOM in csv smoke * Format * DRIFT fixes (#2171) * Use stable ids for community reports * Remove deprecated title from embedding flow * Remove embedding column from df loaders * Fix lancedb insertion * Add drift back to smoke tests * Fix mock embedder to match default embedding length * Fix DRIFT notebook * Push drift_k_followups through to prompt * Format * Vector package (#2172) * Extract graphrag-vectors package * Simplify vector factory usage and config defaults * Update factory integ initializers * Fix mock patch * Format * Register vector stores in tests * Set a default vector store name * Update vector readme * Remove impls from init * Move some validation into impls * Remove index_prefix * Move duplicate method to base class * Fix smoke vector config * Update index bug (#2173) * fix update index bug * blob storage bug fix --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Add GraphRAG LLM package. (#2174) * Update documentation for v3 release (#2176) update documentation for v3 release Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Graphrag llm cleanup (#2181) * Migration update (#2180) * fix formatting. --------- Co-authored-by: Nathan Evans <github@talkswithnumbers.com> Co-authored-by: gaudyb <85708998+gaudyb@users.noreply.github.com> Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> Co-authored-by: Andres Morales <86074752+andresmor-ms@users.noreply.github.com>
* Pin pandas (microsoft#2179) * Release v2.7.1 (microsoft#2186) * Release v2.7.1 (microsoft#2187) * Update Python publish workflow for PyPI (microsoft#2188) Debug publish workflow * V3/main (microsoft#2190) * Remove graph embedding and UMAP (microsoft#2048) * Remove umap/layout operation * Remove graph embedding * Bump unified-search to GR 2.5.0 * Remove graph vis from unified-search * Remove file filtering (microsoft#2050) * Remove document filtering * Semver * Fix integ tests * Fix file find tuple * Fix another dangling find tuple * Remove text unit grouping (microsoft#2052) * Remove text unit group_by_columns * Semver * Fix default token split test * Fix models in config test samples * Fix token length in context sort test * Fix document sort * Re-implement hierarchical Leiden (microsoft#2049) * Use graspologic-native hierarchical leiden * Re-implement largest_connected_component * Copy in modularity * Use graspologic-native directly in pyproject * Remove directed graph tests (we don't use this) * Semver * Remove graspologic dep * Use 4.1 and text-embedding-3-large as defaults * Update comment * Clean vector store (microsoft#2077) * clean vector store code * fix * fix launch.json --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Update v3/main missing config + functions (microsoft#2082) * reduce schema fields (microsoft#2089) * reduce schema fields * fix launch.json --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Remove strategy dicts (microsoft#2090) * Remove "strategy" from community reports config/workflow * Remove extraction strategy from extract_graph * Remove summarization strategy from extract_graph * Remove strategy from claim extraction * Strongly type prompt templates * Remove strategy from embed_text * Push hydrated params into community report workflows * Push hyrdated params into extract covariates * Push hydrated params into extract graph NLP * Push hydrated params into extract graph * Push hydrated params into text embeddings * Remove a few more low-level defaults * Semver * Remove configurable prompt delimiters * Update smoke tests * Remove fnllm (microsoft#2095) * Sort deps alpha * Remove multi search (microsoft#2093) * Remove multi-search from CLI * Remove multi-search from API * Flatten vector_store config * Push hydrated vector store down to embed_text * Remove outputs from config * Remove multi-search notebook/docs * Add missing response_type in basic search API * Fix basic search context and id mapping * Fix v1 migration notebook * Fix query entity search tests * V3 docs and cleanup (microsoft#2100) * Remove community contrib notebooks * Add migration notebook and breaking changes page edits * Update/polish docs * Make model instance name configurable * Add vector schema updates to v3 migration notebook * Spellcheck * Bump smoke test runtimes * Remove document overwrite (microsoft#2101) * remove document overwrite from vector store configuration * remove document overwrite and refactor load documents method * fix test * fix test * fix test --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Unified factory (microsoft#2105) * Simplify Factory interface * Migrate CacheFactory to standard base class * Migrate LoggerFactory to standard base class * Migrate StorageFactory to standard base class * Migrate VectorStoreFactory to standard base class * Update vector store example notebook * Delete notebook outputs * Move default providers into factories * Move retry/limit tests into integ * Split language model factories * Set smoke test tpm/rpm * Fix factory integ tests * Add method to smoke test, switch text to 'fast' * Fix text smoke config for fast workflow * Add new workflows to text smoke test * Convert input readers to a proper factory * Remove covariates from fast smoke test * Update docs for input factory * Bump smoke runtime * Even longer runtime * min-csv timeout * Remove unnecessary lambdas * Prefix vector store (microsoft#2106) * add prefix to vector store configuration and removal of container name * docs updated * change prefix property name * change prefix property name * feedback implemented --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * fix for container name * Restructure project as monorepo. (microsoft#2111) * Restructure project as monorepo. * Fix formatting * Storage fixes and cleanup (microsoft#2118) * Fix pipeline recursion * Remove base_dir from storage.find * Remove max_count from storage.find * Remove prefix on storage integ test * Add base_dir in creation_date test * Wrap base_dir in Path * Use constants for input/update directories * Nov 2025 housekeeping (microsoft#2120) * Remove gensim sideload * Split CI build/type checks from unit tests * Thorough review of docs to align with v3 * Format * Fix version * Fix type * Graphrag config (microsoft#2119) * Add load_config to graphrag-common package. * Empty graph guards (microsoft#2126) * Remove networkx from graph_extractor and clean out redundancy * Bubble pipeline error to console * Remove embeddings optional new (microsoft#2128) * remove optional embeddings * fix test * fix tests * fix pipeline * fix test * fix test * fix test * fix tests --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Format * Add empty checks for NLP graphs (microsoft#2133) * Init command asks for models (microsoft#2137) * Add init prompting for models * Remove hard-coded model config validation * Switch to typer option prompt for full CLI use with models * Update getting started for init model input * Bump request timeout and overall smoke test timeout * Add graphrag-storage. (microsoft#2127) * Add graphrag-storage. * Python update (3.13) (microsoft#2149) * Update to python 3.14 as default, with range down to 3.10 * Fix enum value in query cli * Update pyarrow * Update py version for storage package * Remove 3.10 * add fastuuid * Update Python support to 3.11-3.14 with stricter dependency constraints - Set minimum Python version to 3.11 (removed 3.10 support) - Added support for Python 3.14 - Updated CI workflows: single-version jobs use 3.14, matrix jobs use 3.11 and 3.14 - Fixed license format to use SPDX-compatible format for Python 3.14 - Updated pyarrow to >=22.0.0 for Python 3.14 wheel support - Added explicit fastuuid~=0.14 and blis~=1.3 for Python 3.14 compatibility - Replaced all loose version constraints (>=) with compatible release (~=) for better lock file control - Applied stricter versioning to all packages: graphrag, graphrag-common, graphrag-storage, unified-search-app * update uv lock * Pin blis to ~=1.3.3 to ensure Python 3.14 wheel availability * Update uv lock * Update numpy to >=2.0.0 for Python 3.14 Windows compatibility Numpy 1.25.x has access violation issues on Python 3.14 Windows. Numpy 2.x has proper Python 3.14 support including Windows wheels. * update uv lock * Update pandas to >=2.3.0 for numpy 2.x compatibility Pandas 2.2.x was compiled against numpy 1.x and causes ABI incompatibility errors with numpy 2.x. Pandas 2.3.0+ supports numpy 2.x properly. * update uv.lock * Add scipy>=1.15.0 for numpy 2.x compatibility Scipy versions < 1.15.0 have C extensions built against numpy 1.x and are incompatible with numpy 2.x, causing dtype size errors. * update uv lock * Update Python support to 3.11-3.13 with compatible dependencies - Set Python version range to 3.11-3.13 (removed 3.14 support) - Updated CI workflows: single-version jobs use 3.13, matrix jobs use 3.11 and 3.13 - Dependencies optimized for Python 3.13 compatibility: - pyarrow~=22.0 (has Python 3.13 wheels) - numpy~=1.26 - pandas~=2.2 - blis~=1.0 - fastuuid~=0.13 - Applied stricter version constraints using ~= operator throughout - Updated uv.lock with resolved dependencies * Update numpy to 2.1+ and pandas to 2.3+ for Python 3.13 Windows compatibility Numpy 1.26.x causes access violations on Python 3.13 Windows. Numpy 2.1+ has proper Python 3.13 support with Windows wheels. Pandas 2.3+ is required for numpy 2.x compatibility. * update vsts.yml python version * Add GraphRAG Cache package. (microsoft#2153) * Add GraphRAG Cache package. * Fix a bunch of module comments and function visibility (microsoft#2154) * Issue microsoft#2004 fix (microsoft#2159) * fix issue microsoft#2004 using KeenhoChu idea in his PR * add unit test for dynamic community selection * add unit test for dynamic community selection implementing microsoft#2158 logic --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Mismatch between header in community report generation prompt examples and input data (id vs human_readable_id) (microsoft#2161) * fix issue microsoft#860 for mismatch in prompts and input * fix format --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Chunker factory (microsoft#2156) * Delete NoopTextSplitter * Delete unused check_token_limit * Add base chunking factory and migrate workflow to use it * Split apart chunker module * Co-locate chunking/splitting * Collapse token splitting functionality into one class/function * Restore create_base_text_units parameterization * Move Tokenizer base class to common package * Move pre-pending into chunkers * Streamline config * Fix defaults construction * Add prepending tests * Remove chunk_size_includes_metadata config * Revert ChunkingDocument interface * Move metadata prepending to a util * Move Tokenizer back to GR core * Fix tokenizer removal from chunker * Set defaults for chunking config * Move chunking to monorepo package * Format * Typo * Add ChunkResult model * Streamline chunking config * Add missing version updates for graphrag_chunking * Input factory (microsoft#2168) * Update input factory to match other factories * Move input config alongside input readers * Move file pattern logic into InputReader * Set encoding default * Clean up optional column configs * Combine structured data extraction * Remove pandas from input loading * Throw if empty documents * Add json lines (jsonl) input support * Store raw data * Fix merge imports * Move metadata handling entirely to chunking * Nicer automatic title * Typo * Add get_property utility for nested dictionary access with dot notation * Update structured_file_reader to use get_property utility * Extract input module into new graphrag-input monorepo package - Create new graphrag-input package with input loading utilities - Move InputConfig, InputFileType, InputReader, TextDocument, and file readers (CSV, JSON, JSONL, Text) - Add get_property utility for nested dictionary access with dot notation - Include hashing utility for document ID generation - Update all imports throughout codebase to use graphrag_input - Add package to workspace configuration and release tasks - Remove old graphrag.index.input module * Rename ChunkResult to TextChunk and add transformer support - Rename chunk_result.py to text_chunk.py with ChunkResult -> TextChunk - Add 'original' field to TextChunk to track pre-transform text - Add optional transform callback to chunker.chunk() method - Add add_metadata transformer for prepending metadata to chunks - Update create_chunk_results to apply transforms and populate original - Update sentence_chunker and token_chunker with transform support - Refactor create_base_text_units to use new transformer pattern - Rename pluck_metadata to get/collect methods on TextDocument * Back-compat comment * Align input config type name with other factory configs * Add MarkItDown support * Remove pattern default from MarkItDown reader * Remove plugins flag (implicit disabled) * Format * Update verb tests * Separate storage from input config * Add empty objects for NaN raw_data * Fix smoke tests * Fix BOM in csv smoke * Format * DRIFT fixes (microsoft#2171) * Use stable ids for community reports * Remove deprecated title from embedding flow * Remove embedding column from df loaders * Fix lancedb insertion * Add drift back to smoke tests * Fix mock embedder to match default embedding length * Fix DRIFT notebook * Push drift_k_followups through to prompt * Format * Vector package (microsoft#2172) * Extract graphrag-vectors package * Simplify vector factory usage and config defaults * Update factory integ initializers * Fix mock patch * Format * Register vector stores in tests * Set a default vector store name * Update vector readme * Remove impls from init * Move some validation into impls * Remove index_prefix * Move duplicate method to base class * Fix smoke vector config * Update index bug (microsoft#2173) * fix update index bug * blob storage bug fix --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Add GraphRAG LLM package. (microsoft#2174) * Update documentation for v3 release (microsoft#2176) update documentation for v3 release Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Graphrag llm cleanup (microsoft#2181) * Migration update (microsoft#2180) * fix formatting. --------- Co-authored-by: Nathan Evans <github@talkswithnumbers.com> Co-authored-by: gaudyb <85708998+gaudyb@users.noreply.github.com> Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> Co-authored-by: Andres Morales <86074752+andresmor-ms@users.noreply.github.com> * Release v3.0.0 (microsoft#2191) * Fix deps (microsoft#2193) * fix missing project urls * fix missing deps. * Release v3.0.1 (microsoft#2195) * add TableProvider to enable future row-by-row streaming (microsoft#2189) * write dataframe * changed some workflows * 1a * add fixed files * add versioning * add patch and remove utility * pr changes * Python 3.13 (microsoft#2208) * make graphrag-llm supports 3.13 * Semver --------- Co-authored-by: Deo <liangzhanzhao@metrodata.cn> Co-authored-by: Zhanzhao (Deo) Liang <liangzhanzhao1985@gmail.com> * update vector store example. (microsoft#2202) * Table factory (microsoft#2214) * Add table provider factory * Semver * Remove unnecessary response format check. (microsoft#2213) - Fixes: microsoft#2203 * add csv table provider (microsoft#2215) * add csv table provider * add in provider * add semver * change list_tables to list() * Add DataReader class for typed dataframe loading (microsoft#2220) * Add DataReader class for typed dataframe loading Introduce DataReader that wraps TableProvider and applies type coercion functions when loading dataframes from weakly-typed formats (e.g. CSV). - Add DataReader class with methods for each table type: entities, relationships, communities, community_reports, covariates, text_units, and documents - Add typed loading functions in dfs.py for community_reports, covariates, text_units, and documents (entities, relationships, communities already existed) - Integrate DataReader into all 17 indexing workflows replacing raw read_dataframe calls - Integrate DataReader into CLI query's _resolve_output_files for typed loading across all search types (global, local, drift, basic) - Export DataReader from data_model package __init__ * Fix column check * Add notebook example support for each package (microsoft#2205) * add notebook example support for each package * add notebook example support for each package * semversioner change * feedback implemented for notebooks * feedback implemented for notebooks * feedback implemented for notebooks --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Streamline workflows (microsoft#2225) * Move document ID, human_readable_id, and raw_data setup from create_final_documents into load workflows Consolidates core document field initialization (id string cast, human_readable_id index, raw_data default) into load_input_documents and load_update_documents so that create_final_documents only handles the text unit join. Also applies the same setup in the run_pipeline input_documents bypass paths. * Remove overzealous input document assignment * Semver * Format * Add async iterator support to InputReader and use in load workflows (microsoft#2226) * Add async iterator support to InputReader and use in load workflows InputReader now implements __aiter__ so it can be used as `async for doc in reader`. The core iteration logic is in _iterate_files(), and read_files() delegates to the iterator for batch loading. Both load_input_documents and load_update_documents workflows now use the async iterator with dataclasses.asdict for DataFrame construction. * Format * add memory profiling (microsoft#2227) * add profiling * add unit test for profiling * fix property name --------- Co-authored-by: Nathan Evans <github@talkswithnumbers.com> Co-authored-by: Derek Worthen <worthend.derek@gmail.com> Co-authored-by: Alonso Guevara <alonsog@microsoft.com> Co-authored-by: gaudyb <85708998+gaudyb@users.noreply.github.com> Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> Co-authored-by: Andres Morales <86074752+andresmor-ms@users.noreply.github.com> Co-authored-by: Dayenne Souza <ddesouza@microsoft.com> Co-authored-by: Deo <liangzhanzhao@metrodata.cn> Co-authored-by: Zhanzhao (Deo) Liang <liangzhanzhao1985@gmail.com>
* initial version * update version * implements of CommunityDetection * add vetctor search * improve community detection * add Sample program * fix the prompts * update community summarization * improvements * Use structured AiServices * small improvements * Query (#2) * small improvements * add initial version of query part * add advanced methods * update prompts * update * add global and drift mode. * drift mode and global mode improvement * improve the query part * gettting closer * small implement * parameterized * Sample query program * small improvements * add reading parquest. * update * improvement * update * CLI improvement * default values * add question generator * getting closer * get closer * Add prograss logging * update based on review * improvements based on review * 📝 Add docstrings to `query` Docstrings generation was requested by @takanori-ugai. * #2 (comment) The following files were modified: * `kotlin/src/main/kotlin/com/microsoft/graphrag/SampleIndexer.kt` * `kotlin/src/main/kotlin/com/microsoft/graphrag/SampleQueries.kt` * `kotlin/src/main/kotlin/com/microsoft/graphrag/cli/GraphRagCli.kt` * `kotlin/src/main/kotlin/com/microsoft/graphrag/index/CommunityReportWorkflow.kt` * `kotlin/src/main/kotlin/com/microsoft/graphrag/index/EmbedWorkflow.kt` * `kotlin/src/main/kotlin/com/microsoft/graphrag/index/ExtractGraphWorkflow.kt` * `kotlin/src/main/kotlin/com/microsoft/graphrag/index/LocalVectorStore.kt` * `kotlin/src/main/kotlin/com/microsoft/graphrag/index/PipelineTypes.kt` * `kotlin/src/main/kotlin/com/microsoft/graphrag/index/RunPipeline.kt` * `kotlin/src/main/kotlin/com/microsoft/graphrag/index/StateCodec.kt` * `kotlin/src/main/kotlin/com/microsoft/graphrag/index/Workflows.kt` * `kotlin/src/main/kotlin/com/microsoft/graphrag/logger/Progress.kt` * `kotlin/src/main/kotlin/com/microsoft/graphrag/query/AdvancedQueryEngines.kt` * `kotlin/src/main/kotlin/com/microsoft/graphrag/query/BasicQueryEngine.kt` * `kotlin/src/main/kotlin/com/microsoft/graphrag/query/BasicSearchContextBuilder.kt` * `kotlin/src/main/kotlin/com/microsoft/graphrag/query/CollectingQueryCallbacks.kt` * `kotlin/src/main/kotlin/com/microsoft/graphrag/query/ContextRecords.kt` * `kotlin/src/main/kotlin/com/microsoft/graphrag/query/DriftSearchEngine.kt` * `kotlin/src/main/kotlin/com/microsoft/graphrag/query/GlobalSearchEngine.kt` * `kotlin/src/main/kotlin/com/microsoft/graphrag/query/LocalSearchContextBuilder.kt` * `kotlin/src/main/kotlin/com/microsoft/graphrag/query/NameUtils.kt` * `kotlin/src/main/kotlin/com/microsoft/graphrag/query/QueryCallbacks.kt` * `kotlin/src/main/kotlin/com/microsoft/graphrag/query/QueryConfigLoader.kt` * `kotlin/src/main/kotlin/com/microsoft/graphrag/query/QueryIndexLoader.kt` * `kotlin/src/main/kotlin/com/microsoft/graphrag/query/QuestionGen.kt` --------- Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> * improvement based on review * fix some warnings * rest of implementation * add AGENTS.md * Catch up (#4) * Pin pandas (microsoft#2179) * Release v2.7.1 (microsoft#2186) * Release v2.7.1 (microsoft#2187) * Update Python publish workflow for PyPI (microsoft#2188) Debug publish workflow * V3/main (microsoft#2190) * Remove graph embedding and UMAP (microsoft#2048) * Remove umap/layout operation * Remove graph embedding * Bump unified-search to GR 2.5.0 * Remove graph vis from unified-search * Remove file filtering (microsoft#2050) * Remove document filtering * Semver * Fix integ tests * Fix file find tuple * Fix another dangling find tuple * Remove text unit grouping (microsoft#2052) * Remove text unit group_by_columns * Semver * Fix default token split test * Fix models in config test samples * Fix token length in context sort test * Fix document sort * Re-implement hierarchical Leiden (microsoft#2049) * Use graspologic-native hierarchical leiden * Re-implement largest_connected_component * Copy in modularity * Use graspologic-native directly in pyproject * Remove directed graph tests (we don't use this) * Semver * Remove graspologic dep * Use 4.1 and text-embedding-3-large as defaults * Update comment * Clean vector store (microsoft#2077) * clean vector store code * fix * fix launch.json --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Update v3/main missing config + functions (microsoft#2082) * reduce schema fields (microsoft#2089) * reduce schema fields * fix launch.json --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Remove strategy dicts (microsoft#2090) * Remove "strategy" from community reports config/workflow * Remove extraction strategy from extract_graph * Remove summarization strategy from extract_graph * Remove strategy from claim extraction * Strongly type prompt templates * Remove strategy from embed_text * Push hydrated params into community report workflows * Push hyrdated params into extract covariates * Push hydrated params into extract graph NLP * Push hydrated params into extract graph * Push hydrated params into text embeddings * Remove a few more low-level defaults * Semver * Remove configurable prompt delimiters * Update smoke tests * Remove fnllm (microsoft#2095) * Sort deps alpha * Remove multi search (microsoft#2093) * Remove multi-search from CLI * Remove multi-search from API * Flatten vector_store config * Push hydrated vector store down to embed_text * Remove outputs from config * Remove multi-search notebook/docs * Add missing response_type in basic search API * Fix basic search context and id mapping * Fix v1 migration notebook * Fix query entity search tests * V3 docs and cleanup (microsoft#2100) * Remove community contrib notebooks * Add migration notebook and breaking changes page edits * Update/polish docs * Make model instance name configurable * Add vector schema updates to v3 migration notebook * Spellcheck * Bump smoke test runtimes * Remove document overwrite (microsoft#2101) * remove document overwrite from vector store configuration * remove document overwrite and refactor load documents method * fix test * fix test * fix test --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Unified factory (microsoft#2105) * Simplify Factory interface * Migrate CacheFactory to standard base class * Migrate LoggerFactory to standard base class * Migrate StorageFactory to standard base class * Migrate VectorStoreFactory to standard base class * Update vector store example notebook * Delete notebook outputs * Move default providers into factories * Move retry/limit tests into integ * Split language model factories * Set smoke test tpm/rpm * Fix factory integ tests * Add method to smoke test, switch text to 'fast' * Fix text smoke config for fast workflow * Add new workflows to text smoke test * Convert input readers to a proper factory * Remove covariates from fast smoke test * Update docs for input factory * Bump smoke runtime * Even longer runtime * min-csv timeout * Remove unnecessary lambdas * Prefix vector store (microsoft#2106) * add prefix to vector store configuration and removal of container name * docs updated * change prefix property name * change prefix property name * feedback implemented --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * fix for container name * Restructure project as monorepo. (microsoft#2111) * Restructure project as monorepo. * Fix formatting * Storage fixes and cleanup (microsoft#2118) * Fix pipeline recursion * Remove base_dir from storage.find * Remove max_count from storage.find * Remove prefix on storage integ test * Add base_dir in creation_date test * Wrap base_dir in Path * Use constants for input/update directories * Nov 2025 housekeeping (microsoft#2120) * Remove gensim sideload * Split CI build/type checks from unit tests * Thorough review of docs to align with v3 * Format * Fix version * Fix type * Graphrag config (microsoft#2119) * Add load_config to graphrag-common package. * Empty graph guards (microsoft#2126) * Remove networkx from graph_extractor and clean out redundancy * Bubble pipeline error to console * Remove embeddings optional new (microsoft#2128) * remove optional embeddings * fix test * fix tests * fix pipeline * fix test * fix test * fix test * fix tests --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Format * Add empty checks for NLP graphs (microsoft#2133) * Init command asks for models (microsoft#2137) * Add init prompting for models * Remove hard-coded model config validation * Switch to typer option prompt for full CLI use with models * Update getting started for init model input * Bump request timeout and overall smoke test timeout * Add graphrag-storage. (microsoft#2127) * Add graphrag-storage. * Python update (3.13) (microsoft#2149) * Update to python 3.14 as default, with range down to 3.10 * Fix enum value in query cli * Update pyarrow * Update py version for storage package * Remove 3.10 * add fastuuid * Update Python support to 3.11-3.14 with stricter dependency constraints - Set minimum Python version to 3.11 (removed 3.10 support) - Added support for Python 3.14 - Updated CI workflows: single-version jobs use 3.14, matrix jobs use 3.11 and 3.14 - Fixed license format to use SPDX-compatible format for Python 3.14 - Updated pyarrow to >=22.0.0 for Python 3.14 wheel support - Added explicit fastuuid~=0.14 and blis~=1.3 for Python 3.14 compatibility - Replaced all loose version constraints (>=) with compatible release (~=) for better lock file control - Applied stricter versioning to all packages: graphrag, graphrag-common, graphrag-storage, unified-search-app * update uv lock * Pin blis to ~=1.3.3 to ensure Python 3.14 wheel availability * Update uv lock * Update numpy to >=2.0.0 for Python 3.14 Windows compatibility Numpy 1.25.x has access violation issues on Python 3.14 Windows. Numpy 2.x has proper Python 3.14 support including Windows wheels. * update uv lock * Update pandas to >=2.3.0 for numpy 2.x compatibility Pandas 2.2.x was compiled against numpy 1.x and causes ABI incompatibility errors with numpy 2.x. Pandas 2.3.0+ supports numpy 2.x properly. * update uv.lock * Add scipy>=1.15.0 for numpy 2.x compatibility Scipy versions < 1.15.0 have C extensions built against numpy 1.x and are incompatible with numpy 2.x, causing dtype size errors. * update uv lock * Update Python support to 3.11-3.13 with compatible dependencies - Set Python version range to 3.11-3.13 (removed 3.14 support) - Updated CI workflows: single-version jobs use 3.13, matrix jobs use 3.11 and 3.13 - Dependencies optimized for Python 3.13 compatibility: - pyarrow~=22.0 (has Python 3.13 wheels) - numpy~=1.26 - pandas~=2.2 - blis~=1.0 - fastuuid~=0.13 - Applied stricter version constraints using ~= operator throughout - Updated uv.lock with resolved dependencies * Update numpy to 2.1+ and pandas to 2.3+ for Python 3.13 Windows compatibility Numpy 1.26.x causes access violations on Python 3.13 Windows. Numpy 2.1+ has proper Python 3.13 support with Windows wheels. Pandas 2.3+ is required for numpy 2.x compatibility. * update vsts.yml python version * Add GraphRAG Cache package. (microsoft#2153) * Add GraphRAG Cache package. * Fix a bunch of module comments and function visibility (microsoft#2154) * Issue microsoft#2004 fix (microsoft#2159) * fix issue microsoft#2004 using KeenhoChu idea in his PR * add unit test for dynamic community selection * add unit test for dynamic community selection implementing microsoft#2158 logic --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Mismatch between header in community report generation prompt examples and input data (id vs human_readable_id) (microsoft#2161) * fix issue microsoft#860 for mismatch in prompts and input * fix format --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Chunker factory (microsoft#2156) * Delete NoopTextSplitter * Delete unused check_token_limit * Add base chunking factory and migrate workflow to use it * Split apart chunker module * Co-locate chunking/splitting * Collapse token splitting functionality into one class/function * Restore create_base_text_units parameterization * Move Tokenizer base class to common package * Move pre-pending into chunkers * Streamline config * Fix defaults construction * Add prepending tests * Remove chunk_size_includes_metadata config * Revert ChunkingDocument interface * Move metadata prepending to a util * Move Tokenizer back to GR core * Fix tokenizer removal from chunker * Set defaults for chunking config * Move chunking to monorepo package * Format * Typo * Add ChunkResult model * Streamline chunking config * Add missing version updates for graphrag_chunking * Input factory (microsoft#2168) * Update input factory to match other factories * Move input config alongside input readers * Move file pattern logic into InputReader * Set encoding default * Clean up optional column configs * Combine structured data extraction * Remove pandas from input loading * Throw if empty documents * Add json lines (jsonl) input support * Store raw data * Fix merge imports * Move metadata handling entirely to chunking * Nicer automatic title * Typo * Add get_property utility for nested dictionary access with dot notation * Update structured_file_reader to use get_property utility * Extract input module into new graphrag-input monorepo package - Create new graphrag-input package with input loading utilities - Move InputConfig, InputFileType, InputReader, TextDocument, and file readers (CSV, JSON, JSONL, Text) - Add get_property utility for nested dictionary access with dot notation - Include hashing utility for document ID generation - Update all imports throughout codebase to use graphrag_input - Add package to workspace configuration and release tasks - Remove old graphrag.index.input module * Rename ChunkResult to TextChunk and add transformer support - Rename chunk_result.py to text_chunk.py with ChunkResult -> TextChunk - Add 'original' field to TextChunk to track pre-transform text - Add optional transform callback to chunker.chunk() method - Add add_metadata transformer for prepending metadata to chunks - Update create_chunk_results to apply transforms and populate original - Update sentence_chunker and token_chunker with transform support - Refactor create_base_text_units to use new transformer pattern - Rename pluck_metadata to get/collect methods on TextDocument * Back-compat comment * Align input config type name with other factory configs * Add MarkItDown support * Remove pattern default from MarkItDown reader * Remove plugins flag (implicit disabled) * Format * Update verb tests * Separate storage from input config * Add empty objects for NaN raw_data * Fix smoke tests * Fix BOM in csv smoke * Format * DRIFT fixes (microsoft#2171) * Use stable ids for community reports * Remove deprecated title from embedding flow * Remove embedding column from df loaders * Fix lancedb insertion * Add drift back to smoke tests * Fix mock embedder to match default embedding length * Fix DRIFT notebook * Push drift_k_followups through to prompt * Format * Vector package (microsoft#2172) * Extract graphrag-vectors package * Simplify vector factory usage and config defaults * Update factory integ initializers * Fix mock patch * Format * Register vector stores in tests * Set a default vector store name * Update vector readme * Remove impls from init * Move some validation into impls * Remove index_prefix * Move duplicate method to base class * Fix smoke vector config * Update index bug (microsoft#2173) * fix update index bug * blob storage bug fix --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Add GraphRAG LLM package. (microsoft#2174) * Update documentation for v3 release (microsoft#2176) update documentation for v3 release Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Graphrag llm cleanup (microsoft#2181) * Migration update (microsoft#2180) * fix formatting. --------- Co-authored-by: Nathan Evans <github@talkswithnumbers.com> Co-authored-by: gaudyb <85708998+gaudyb@users.noreply.github.com> Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> Co-authored-by: Andres Morales <86074752+andresmor-ms@users.noreply.github.com> * Release v3.0.0 (microsoft#2191) * Fix deps (microsoft#2193) * fix missing project urls * fix missing deps. * Release v3.0.1 (microsoft#2195) * add TableProvider to enable future row-by-row streaming (microsoft#2189) * write dataframe * changed some workflows * 1a * add fixed files * add versioning * add patch and remove utility * pr changes * Python 3.13 (microsoft#2208) * make graphrag-llm supports 3.13 * Semver --------- Co-authored-by: Deo <liangzhanzhao@metrodata.cn> Co-authored-by: Zhanzhao (Deo) Liang <liangzhanzhao1985@gmail.com> * update vector store example. (microsoft#2202) * Table factory (microsoft#2214) * Add table provider factory * Semver * Remove unnecessary response format check. (microsoft#2213) - Fixes: microsoft#2203 * add csv table provider (microsoft#2215) * add csv table provider * add in provider * add semver * change list_tables to list() * Add DataReader class for typed dataframe loading (microsoft#2220) * Add DataReader class for typed dataframe loading Introduce DataReader that wraps TableProvider and applies type coercion functions when loading dataframes from weakly-typed formats (e.g. CSV). - Add DataReader class with methods for each table type: entities, relationships, communities, community_reports, covariates, text_units, and documents - Add typed loading functions in dfs.py for community_reports, covariates, text_units, and documents (entities, relationships, communities already existed) - Integrate DataReader into all 17 indexing workflows replacing raw read_dataframe calls - Integrate DataReader into CLI query's _resolve_output_files for typed loading across all search types (global, local, drift, basic) - Export DataReader from data_model package __init__ * Fix column check * Add notebook example support for each package (microsoft#2205) * add notebook example support for each package * add notebook example support for each package * semversioner change * feedback implemented for notebooks * feedback implemented for notebooks * feedback implemented for notebooks --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Streamline workflows (microsoft#2225) * Move document ID, human_readable_id, and raw_data setup from create_final_documents into load workflows Consolidates core document field initialization (id string cast, human_readable_id index, raw_data default) into load_input_documents and load_update_documents so that create_final_documents only handles the text unit join. Also applies the same setup in the run_pipeline input_documents bypass paths. * Remove overzealous input document assignment * Semver * Format * Add async iterator support to InputReader and use in load workflows (microsoft#2226) * Add async iterator support to InputReader and use in load workflows InputReader now implements __aiter__ so it can be used as `async for doc in reader`. The core iteration logic is in _iterate_files(), and read_files() delegates to the iterator for batch loading. Both load_input_documents and load_update_documents workflows now use the async iterator with dataclasses.asdict for DataFrame construction. * Format * add memory profiling (microsoft#2227) * add profiling * add unit test for profiling * fix property name --------- Co-authored-by: Nathan Evans <github@talkswithnumbers.com> Co-authored-by: Derek Worthen <worthend.derek@gmail.com> Co-authored-by: Alonso Guevara <alonsog@microsoft.com> Co-authored-by: gaudyb <85708998+gaudyb@users.noreply.github.com> Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> Co-authored-by: Andres Morales <86074752+andresmor-ms@users.noreply.github.com> Co-authored-by: Dayenne Souza <ddesouza@microsoft.com> Co-authored-by: Deo <liangzhanzhao@metrodata.cn> Co-authored-by: Zhanzhao (Deo) Liang <liangzhanzhao1985@gmail.com> * update with review * update with review --------- Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: Nathan Evans <github@talkswithnumbers.com> Co-authored-by: Derek Worthen <worthend.derek@gmail.com> Co-authored-by: Alonso Guevara <alonsog@microsoft.com> Co-authored-by: gaudyb <85708998+gaudyb@users.noreply.github.com> Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> Co-authored-by: Andres Morales <86074752+andresmor-ms@users.noreply.github.com> Co-authored-by: Dayenne Souza <ddesouza@microsoft.com> Co-authored-by: Deo <liangzhanzhao@metrodata.cn> Co-authored-by: Zhanzhao (Deo) Liang <liangzhanzhao1985@gmail.com>
* Pin pandas (microsoft#2179) * Release v2.7.1 (microsoft#2186) * Release v2.7.1 (microsoft#2187) * Update Python publish workflow for PyPI (microsoft#2188) Debug publish workflow * V3/main (microsoft#2190) * Remove graph embedding and UMAP (microsoft#2048) * Remove umap/layout operation * Remove graph embedding * Bump unified-search to GR 2.5.0 * Remove graph vis from unified-search * Remove file filtering (microsoft#2050) * Remove document filtering * Semver * Fix integ tests * Fix file find tuple * Fix another dangling find tuple * Remove text unit grouping (microsoft#2052) * Remove text unit group_by_columns * Semver * Fix default token split test * Fix models in config test samples * Fix token length in context sort test * Fix document sort * Re-implement hierarchical Leiden (microsoft#2049) * Use graspologic-native hierarchical leiden * Re-implement largest_connected_component * Copy in modularity * Use graspologic-native directly in pyproject * Remove directed graph tests (we don't use this) * Semver * Remove graspologic dep * Use 4.1 and text-embedding-3-large as defaults * Update comment * Clean vector store (microsoft#2077) * clean vector store code * fix * fix launch.json --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Update v3/main missing config + functions (microsoft#2082) * reduce schema fields (microsoft#2089) * reduce schema fields * fix launch.json --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Remove strategy dicts (microsoft#2090) * Remove "strategy" from community reports config/workflow * Remove extraction strategy from extract_graph * Remove summarization strategy from extract_graph * Remove strategy from claim extraction * Strongly type prompt templates * Remove strategy from embed_text * Push hydrated params into community report workflows * Push hyrdated params into extract covariates * Push hydrated params into extract graph NLP * Push hydrated params into extract graph * Push hydrated params into text embeddings * Remove a few more low-level defaults * Semver * Remove configurable prompt delimiters * Update smoke tests * Remove fnllm (microsoft#2095) * Sort deps alpha * Remove multi search (microsoft#2093) * Remove multi-search from CLI * Remove multi-search from API * Flatten vector_store config * Push hydrated vector store down to embed_text * Remove outputs from config * Remove multi-search notebook/docs * Add missing response_type in basic search API * Fix basic search context and id mapping * Fix v1 migration notebook * Fix query entity search tests * V3 docs and cleanup (microsoft#2100) * Remove community contrib notebooks * Add migration notebook and breaking changes page edits * Update/polish docs * Make model instance name configurable * Add vector schema updates to v3 migration notebook * Spellcheck * Bump smoke test runtimes * Remove document overwrite (microsoft#2101) * remove document overwrite from vector store configuration * remove document overwrite and refactor load documents method * fix test * fix test * fix test --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Unified factory (microsoft#2105) * Simplify Factory interface * Migrate CacheFactory to standard base class * Migrate LoggerFactory to standard base class * Migrate StorageFactory to standard base class * Migrate VectorStoreFactory to standard base class * Update vector store example notebook * Delete notebook outputs * Move default providers into factories * Move retry/limit tests into integ * Split language model factories * Set smoke test tpm/rpm * Fix factory integ tests * Add method to smoke test, switch text to 'fast' * Fix text smoke config for fast workflow * Add new workflows to text smoke test * Convert input readers to a proper factory * Remove covariates from fast smoke test * Update docs for input factory * Bump smoke runtime * Even longer runtime * min-csv timeout * Remove unnecessary lambdas * Prefix vector store (microsoft#2106) * add prefix to vector store configuration and removal of container name * docs updated * change prefix property name * change prefix property name * feedback implemented --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * fix for container name * Restructure project as monorepo. (microsoft#2111) * Restructure project as monorepo. * Fix formatting * Storage fixes and cleanup (microsoft#2118) * Fix pipeline recursion * Remove base_dir from storage.find * Remove max_count from storage.find * Remove prefix on storage integ test * Add base_dir in creation_date test * Wrap base_dir in Path * Use constants for input/update directories * Nov 2025 housekeeping (microsoft#2120) * Remove gensim sideload * Split CI build/type checks from unit tests * Thorough review of docs to align with v3 * Format * Fix version * Fix type * Graphrag config (microsoft#2119) * Add load_config to graphrag-common package. * Empty graph guards (microsoft#2126) * Remove networkx from graph_extractor and clean out redundancy * Bubble pipeline error to console * Remove embeddings optional new (microsoft#2128) * remove optional embeddings * fix test * fix tests * fix pipeline * fix test * fix test * fix test * fix tests --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Format * Add empty checks for NLP graphs (microsoft#2133) * Init command asks for models (microsoft#2137) * Add init prompting for models * Remove hard-coded model config validation * Switch to typer option prompt for full CLI use with models * Update getting started for init model input * Bump request timeout and overall smoke test timeout * Add graphrag-storage. (microsoft#2127) * Add graphrag-storage. * Python update (3.13) (microsoft#2149) * Update to python 3.14 as default, with range down to 3.10 * Fix enum value in query cli * Update pyarrow * Update py version for storage package * Remove 3.10 * add fastuuid * Update Python support to 3.11-3.14 with stricter dependency constraints - Set minimum Python version to 3.11 (removed 3.10 support) - Added support for Python 3.14 - Updated CI workflows: single-version jobs use 3.14, matrix jobs use 3.11 and 3.14 - Fixed license format to use SPDX-compatible format for Python 3.14 - Updated pyarrow to >=22.0.0 for Python 3.14 wheel support - Added explicit fastuuid~=0.14 and blis~=1.3 for Python 3.14 compatibility - Replaced all loose version constraints (>=) with compatible release (~=) for better lock file control - Applied stricter versioning to all packages: graphrag, graphrag-common, graphrag-storage, unified-search-app * update uv lock * Pin blis to ~=1.3.3 to ensure Python 3.14 wheel availability * Update uv lock * Update numpy to >=2.0.0 for Python 3.14 Windows compatibility Numpy 1.25.x has access violation issues on Python 3.14 Windows. Numpy 2.x has proper Python 3.14 support including Windows wheels. * update uv lock * Update pandas to >=2.3.0 for numpy 2.x compatibility Pandas 2.2.x was compiled against numpy 1.x and causes ABI incompatibility errors with numpy 2.x. Pandas 2.3.0+ supports numpy 2.x properly. * update uv.lock * Add scipy>=1.15.0 for numpy 2.x compatibility Scipy versions < 1.15.0 have C extensions built against numpy 1.x and are incompatible with numpy 2.x, causing dtype size errors. * update uv lock * Update Python support to 3.11-3.13 with compatible dependencies - Set Python version range to 3.11-3.13 (removed 3.14 support) - Updated CI workflows: single-version jobs use 3.13, matrix jobs use 3.11 and 3.13 - Dependencies optimized for Python 3.13 compatibility: - pyarrow~=22.0 (has Python 3.13 wheels) - numpy~=1.26 - pandas~=2.2 - blis~=1.0 - fastuuid~=0.13 - Applied stricter version constraints using ~= operator throughout - Updated uv.lock with resolved dependencies * Update numpy to 2.1+ and pandas to 2.3+ for Python 3.13 Windows compatibility Numpy 1.26.x causes access violations on Python 3.13 Windows. Numpy 2.1+ has proper Python 3.13 support with Windows wheels. Pandas 2.3+ is required for numpy 2.x compatibility. * update vsts.yml python version * Add GraphRAG Cache package. (microsoft#2153) * Add GraphRAG Cache package. * Fix a bunch of module comments and function visibility (microsoft#2154) * Issue microsoft#2004 fix (microsoft#2159) * fix issue microsoft#2004 using KeenhoChu idea in his PR * add unit test for dynamic community selection * add unit test for dynamic community selection implementing microsoft#2158 logic --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Mismatch between header in community report generation prompt examples and input data (id vs human_readable_id) (microsoft#2161) * fix issue microsoft#860 for mismatch in prompts and input * fix format --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Chunker factory (microsoft#2156) * Delete NoopTextSplitter * Delete unused check_token_limit * Add base chunking factory and migrate workflow to use it * Split apart chunker module * Co-locate chunking/splitting * Collapse token splitting functionality into one class/function * Restore create_base_text_units parameterization * Move Tokenizer base class to common package * Move pre-pending into chunkers * Streamline config * Fix defaults construction * Add prepending tests * Remove chunk_size_includes_metadata config * Revert ChunkingDocument interface * Move metadata prepending to a util * Move Tokenizer back to GR core * Fix tokenizer removal from chunker * Set defaults for chunking config * Move chunking to monorepo package * Format * Typo * Add ChunkResult model * Streamline chunking config * Add missing version updates for graphrag_chunking * Input factory (microsoft#2168) * Update input factory to match other factories * Move input config alongside input readers * Move file pattern logic into InputReader * Set encoding default * Clean up optional column configs * Combine structured data extraction * Remove pandas from input loading * Throw if empty documents * Add json lines (jsonl) input support * Store raw data * Fix merge imports * Move metadata handling entirely to chunking * Nicer automatic title * Typo * Add get_property utility for nested dictionary access with dot notation * Update structured_file_reader to use get_property utility * Extract input module into new graphrag-input monorepo package - Create new graphrag-input package with input loading utilities - Move InputConfig, InputFileType, InputReader, TextDocument, and file readers (CSV, JSON, JSONL, Text) - Add get_property utility for nested dictionary access with dot notation - Include hashing utility for document ID generation - Update all imports throughout codebase to use graphrag_input - Add package to workspace configuration and release tasks - Remove old graphrag.index.input module * Rename ChunkResult to TextChunk and add transformer support - Rename chunk_result.py to text_chunk.py with ChunkResult -> TextChunk - Add 'original' field to TextChunk to track pre-transform text - Add optional transform callback to chunker.chunk() method - Add add_metadata transformer for prepending metadata to chunks - Update create_chunk_results to apply transforms and populate original - Update sentence_chunker and token_chunker with transform support - Refactor create_base_text_units to use new transformer pattern - Rename pluck_metadata to get/collect methods on TextDocument * Back-compat comment * Align input config type name with other factory configs * Add MarkItDown support * Remove pattern default from MarkItDown reader * Remove plugins flag (implicit disabled) * Format * Update verb tests * Separate storage from input config * Add empty objects for NaN raw_data * Fix smoke tests * Fix BOM in csv smoke * Format * DRIFT fixes (microsoft#2171) * Use stable ids for community reports * Remove deprecated title from embedding flow * Remove embedding column from df loaders * Fix lancedb insertion * Add drift back to smoke tests * Fix mock embedder to match default embedding length * Fix DRIFT notebook * Push drift_k_followups through to prompt * Format * Vector package (microsoft#2172) * Extract graphrag-vectors package * Simplify vector factory usage and config defaults * Update factory integ initializers * Fix mock patch * Format * Register vector stores in tests * Set a default vector store name * Update vector readme * Remove impls from init * Move some validation into impls * Remove index_prefix * Move duplicate method to base class * Fix smoke vector config * Update index bug (microsoft#2173) * fix update index bug * blob storage bug fix --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Add GraphRAG LLM package. (microsoft#2174) * Update documentation for v3 release (microsoft#2176) update documentation for v3 release Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Graphrag llm cleanup (microsoft#2181) * Migration update (microsoft#2180) * fix formatting. --------- Co-authored-by: Nathan Evans <github@talkswithnumbers.com> Co-authored-by: gaudyb <85708998+gaudyb@users.noreply.github.com> Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> Co-authored-by: Andres Morales <86074752+andresmor-ms@users.noreply.github.com> * Release v3.0.0 (microsoft#2191) * Fix deps (microsoft#2193) * fix missing project urls * fix missing deps. * Release v3.0.1 (microsoft#2195) * add TableProvider to enable future row-by-row streaming (microsoft#2189) * write dataframe * changed some workflows * 1a * add fixed files * add versioning * add patch and remove utility * pr changes * Python 3.13 (microsoft#2208) * make graphrag-llm supports 3.13 * Semver --------- Co-authored-by: Deo <liangzhanzhao@metrodata.cn> Co-authored-by: Zhanzhao (Deo) Liang <liangzhanzhao1985@gmail.com> * update vector store example. (microsoft#2202) * Table factory (microsoft#2214) * Add table provider factory * Semver * Remove unnecessary response format check. (microsoft#2213) - Fixes: microsoft#2203 * add csv table provider (microsoft#2215) * add csv table provider * add in provider * add semver * change list_tables to list() * Add DataReader class for typed dataframe loading (microsoft#2220) * Add DataReader class for typed dataframe loading Introduce DataReader that wraps TableProvider and applies type coercion functions when loading dataframes from weakly-typed formats (e.g. CSV). - Add DataReader class with methods for each table type: entities, relationships, communities, community_reports, covariates, text_units, and documents - Add typed loading functions in dfs.py for community_reports, covariates, text_units, and documents (entities, relationships, communities already existed) - Integrate DataReader into all 17 indexing workflows replacing raw read_dataframe calls - Integrate DataReader into CLI query's _resolve_output_files for typed loading across all search types (global, local, drift, basic) - Export DataReader from data_model package __init__ * Fix column check * Add notebook example support for each package (microsoft#2205) * add notebook example support for each package * add notebook example support for each package * semversioner change * feedback implemented for notebooks * feedback implemented for notebooks * feedback implemented for notebooks --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Streamline workflows (microsoft#2225) * Move document ID, human_readable_id, and raw_data setup from create_final_documents into load workflows Consolidates core document field initialization (id string cast, human_readable_id index, raw_data default) into load_input_documents and load_update_documents so that create_final_documents only handles the text unit join. Also applies the same setup in the run_pipeline input_documents bypass paths. * Remove overzealous input document assignment * Semver * Format * Add async iterator support to InputReader and use in load workflows (microsoft#2226) * Add async iterator support to InputReader and use in load workflows InputReader now implements __aiter__ so it can be used as `async for doc in reader`. The core iteration logic is in _iterate_files(), and read_files() delegates to the iterator for batch loading. Both load_input_documents and load_update_documents workflows now use the async iterator with dataclasses.asdict for DataFrame construction. * Format * add memory profiling (microsoft#2227) * add profiling * add unit test for profiling * fix property name --------- Co-authored-by: Nathan Evans <github@talkswithnumbers.com> Co-authored-by: Derek Worthen <worthend.derek@gmail.com> Co-authored-by: Alonso Guevara <alonsog@microsoft.com> Co-authored-by: gaudyb <85708998+gaudyb@users.noreply.github.com> Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> Co-authored-by: Andres Morales <86074752+andresmor-ms@users.noreply.github.com> Co-authored-by: Dayenne Souza <ddesouza@microsoft.com> Co-authored-by: Deo <liangzhanzhao@metrodata.cn> Co-authored-by: Zhanzhao (Deo) Liang <liangzhanzhao1985@gmail.com>
* Remove graph embedding and UMAP (microsoft#2048) * Remove umap/layout operation * Remove graph embedding * Bump unified-search to GR 2.5.0 * Remove graph vis from unified-search * Remove file filtering (microsoft#2050) * Remove document filtering * Semver * Fix integ tests * Fix file find tuple * Fix another dangling find tuple * Remove text unit grouping (microsoft#2052) * Remove text unit group_by_columns * Semver * Fix default token split test * Fix models in config test samples * Fix token length in context sort test * Fix document sort * Re-implement hierarchical Leiden (microsoft#2049) * Use graspologic-native hierarchical leiden * Re-implement largest_connected_component * Copy in modularity * Use graspologic-native directly in pyproject * Remove directed graph tests (we don't use this) * Semver * Remove graspologic dep * Use 4.1 and text-embedding-3-large as defaults * Update comment * Clean vector store (microsoft#2077) * clean vector store code * fix * fix launch.json --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Update v3/main missing config + functions (microsoft#2082) * reduce schema fields (microsoft#2089) * reduce schema fields * fix launch.json --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Remove strategy dicts (microsoft#2090) * Remove "strategy" from community reports config/workflow * Remove extraction strategy from extract_graph * Remove summarization strategy from extract_graph * Remove strategy from claim extraction * Strongly type prompt templates * Remove strategy from embed_text * Push hydrated params into community report workflows * Push hyrdated params into extract covariates * Push hydrated params into extract graph NLP * Push hydrated params into extract graph * Push hydrated params into text embeddings * Remove a few more low-level defaults * Semver * Remove configurable prompt delimiters * Update smoke tests * Remove fnllm (microsoft#2095) * Sort deps alpha * Remove multi search (microsoft#2093) * Remove multi-search from CLI * Remove multi-search from API * Flatten vector_store config * Push hydrated vector store down to embed_text * Remove outputs from config * Remove multi-search notebook/docs * Add missing response_type in basic search API * Fix basic search context and id mapping * Fix v1 migration notebook * Fix query entity search tests * V3 docs and cleanup (microsoft#2100) * Remove community contrib notebooks * Add migration notebook and breaking changes page edits * Update/polish docs * Make model instance name configurable * Add vector schema updates to v3 migration notebook * Spellcheck * Bump smoke test runtimes * Remove document overwrite (microsoft#2101) * remove document overwrite from vector store configuration * remove document overwrite and refactor load documents method * fix test * fix test * fix test --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Unified factory (microsoft#2105) * Simplify Factory interface * Migrate CacheFactory to standard base class * Migrate LoggerFactory to standard base class * Migrate StorageFactory to standard base class * Migrate VectorStoreFactory to standard base class * Update vector store example notebook * Delete notebook outputs * Move default providers into factories * Move retry/limit tests into integ * Split language model factories * Set smoke test tpm/rpm * Fix factory integ tests * Add method to smoke test, switch text to 'fast' * Fix text smoke config for fast workflow * Add new workflows to text smoke test * Convert input readers to a proper factory * Remove covariates from fast smoke test * Update docs for input factory * Bump smoke runtime * Even longer runtime * min-csv timeout * Remove unnecessary lambdas * Prefix vector store (microsoft#2106) * add prefix to vector store configuration and removal of container name * docs updated * change prefix property name * change prefix property name * feedback implemented --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * fix for container name * Restructure project as monorepo. (microsoft#2111) * Restructure project as monorepo. * Fix formatting * Storage fixes and cleanup (microsoft#2118) * Fix pipeline recursion * Remove base_dir from storage.find * Remove max_count from storage.find * Remove prefix on storage integ test * Add base_dir in creation_date test * Wrap base_dir in Path * Use constants for input/update directories * Nov 2025 housekeeping (microsoft#2120) * Remove gensim sideload * Split CI build/type checks from unit tests * Thorough review of docs to align with v3 * Format * Fix version * Fix type * Graphrag config (microsoft#2119) * Add load_config to graphrag-common package. * Empty graph guards (microsoft#2126) * Remove networkx from graph_extractor and clean out redundancy * Bubble pipeline error to console * Remove embeddings optional new (microsoft#2128) * remove optional embeddings * fix test * fix tests * fix pipeline * fix test * fix test * fix test * fix tests --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Format * Add empty checks for NLP graphs (microsoft#2133) * Init command asks for models (microsoft#2137) * Add init prompting for models * Remove hard-coded model config validation * Switch to typer option prompt for full CLI use with models * Update getting started for init model input * Bump request timeout and overall smoke test timeout * Add graphrag-storage. (microsoft#2127) * Add graphrag-storage. * Python update (3.13) (microsoft#2149) * Update to python 3.14 as default, with range down to 3.10 * Fix enum value in query cli * Update pyarrow * Update py version for storage package * Remove 3.10 * add fastuuid * Update Python support to 3.11-3.14 with stricter dependency constraints - Set minimum Python version to 3.11 (removed 3.10 support) - Added support for Python 3.14 - Updated CI workflows: single-version jobs use 3.14, matrix jobs use 3.11 and 3.14 - Fixed license format to use SPDX-compatible format for Python 3.14 - Updated pyarrow to >=22.0.0 for Python 3.14 wheel support - Added explicit fastuuid~=0.14 and blis~=1.3 for Python 3.14 compatibility - Replaced all loose version constraints (>=) with compatible release (~=) for better lock file control - Applied stricter versioning to all packages: graphrag, graphrag-common, graphrag-storage, unified-search-app * update uv lock * Pin blis to ~=1.3.3 to ensure Python 3.14 wheel availability * Update uv lock * Update numpy to >=2.0.0 for Python 3.14 Windows compatibility Numpy 1.25.x has access violation issues on Python 3.14 Windows. Numpy 2.x has proper Python 3.14 support including Windows wheels. * update uv lock * Update pandas to >=2.3.0 for numpy 2.x compatibility Pandas 2.2.x was compiled against numpy 1.x and causes ABI incompatibility errors with numpy 2.x. Pandas 2.3.0+ supports numpy 2.x properly. * update uv.lock * Add scipy>=1.15.0 for numpy 2.x compatibility Scipy versions < 1.15.0 have C extensions built against numpy 1.x and are incompatible with numpy 2.x, causing dtype size errors. * update uv lock * Update Python support to 3.11-3.13 with compatible dependencies - Set Python version range to 3.11-3.13 (removed 3.14 support) - Updated CI workflows: single-version jobs use 3.13, matrix jobs use 3.11 and 3.13 - Dependencies optimized for Python 3.13 compatibility: - pyarrow~=22.0 (has Python 3.13 wheels) - numpy~=1.26 - pandas~=2.2 - blis~=1.0 - fastuuid~=0.13 - Applied stricter version constraints using ~= operator throughout - Updated uv.lock with resolved dependencies * Update numpy to 2.1+ and pandas to 2.3+ for Python 3.13 Windows compatibility Numpy 1.26.x causes access violations on Python 3.13 Windows. Numpy 2.1+ has proper Python 3.13 support with Windows wheels. Pandas 2.3+ is required for numpy 2.x compatibility. * update vsts.yml python version * Add GraphRAG Cache package. (microsoft#2153) * Add GraphRAG Cache package. * Fix a bunch of module comments and function visibility (microsoft#2154) * Issue microsoft#2004 fix (microsoft#2159) * fix issue microsoft#2004 using KeenhoChu idea in his PR * add unit test for dynamic community selection * add unit test for dynamic community selection implementing microsoft#2158 logic --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Mismatch between header in community report generation prompt examples and input data (id vs human_readable_id) (microsoft#2161) * fix issue microsoft#860 for mismatch in prompts and input * fix format --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Chunker factory (microsoft#2156) * Delete NoopTextSplitter * Delete unused check_token_limit * Add base chunking factory and migrate workflow to use it * Split apart chunker module * Co-locate chunking/splitting * Collapse token splitting functionality into one class/function * Restore create_base_text_units parameterization * Move Tokenizer base class to common package * Move pre-pending into chunkers * Streamline config * Fix defaults construction * Add prepending tests * Remove chunk_size_includes_metadata config * Revert ChunkingDocument interface * Move metadata prepending to a util * Move Tokenizer back to GR core * Fix tokenizer removal from chunker * Set defaults for chunking config * Move chunking to monorepo package * Format * Typo * Add ChunkResult model * Streamline chunking config * Add missing version updates for graphrag_chunking * Input factory (microsoft#2168) * Update input factory to match other factories * Move input config alongside input readers * Move file pattern logic into InputReader * Set encoding default * Clean up optional column configs * Combine structured data extraction * Remove pandas from input loading * Throw if empty documents * Add json lines (jsonl) input support * Store raw data * Fix merge imports * Move metadata handling entirely to chunking * Nicer automatic title * Typo * Add get_property utility for nested dictionary access with dot notation * Update structured_file_reader to use get_property utility * Extract input module into new graphrag-input monorepo package - Create new graphrag-input package with input loading utilities - Move InputConfig, InputFileType, InputReader, TextDocument, and file readers (CSV, JSON, JSONL, Text) - Add get_property utility for nested dictionary access with dot notation - Include hashing utility for document ID generation - Update all imports throughout codebase to use graphrag_input - Add package to workspace configuration and release tasks - Remove old graphrag.index.input module * Rename ChunkResult to TextChunk and add transformer support - Rename chunk_result.py to text_chunk.py with ChunkResult -> TextChunk - Add 'original' field to TextChunk to track pre-transform text - Add optional transform callback to chunker.chunk() method - Add add_metadata transformer for prepending metadata to chunks - Update create_chunk_results to apply transforms and populate original - Update sentence_chunker and token_chunker with transform support - Refactor create_base_text_units to use new transformer pattern - Rename pluck_metadata to get/collect methods on TextDocument * Back-compat comment * Align input config type name with other factory configs * Add MarkItDown support * Remove pattern default from MarkItDown reader * Remove plugins flag (implicit disabled) * Format * Update verb tests * Separate storage from input config * Add empty objects for NaN raw_data * Fix smoke tests * Fix BOM in csv smoke * Format * DRIFT fixes (microsoft#2171) * Use stable ids for community reports * Remove deprecated title from embedding flow * Remove embedding column from df loaders * Fix lancedb insertion * Add drift back to smoke tests * Fix mock embedder to match default embedding length * Fix DRIFT notebook * Push drift_k_followups through to prompt * Format * Vector package (microsoft#2172) * Extract graphrag-vectors package * Simplify vector factory usage and config defaults * Update factory integ initializers * Fix mock patch * Format * Register vector stores in tests * Set a default vector store name * Update vector readme * Remove impls from init * Move some validation into impls * Remove index_prefix * Move duplicate method to base class * Fix smoke vector config * Update index bug (microsoft#2173) * fix update index bug * blob storage bug fix --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Add GraphRAG LLM package. (microsoft#2174) * Update documentation for v3 release (microsoft#2176) update documentation for v3 release Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Graphrag llm cleanup (microsoft#2181) * Migration update (microsoft#2180) * fix formatting. --------- Co-authored-by: Nathan Evans <github@talkswithnumbers.com> Co-authored-by: gaudyb <85708998+gaudyb@users.noreply.github.com> Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> Co-authored-by: Andres Morales <86074752+andresmor-ms@users.noreply.github.com>
Add GraphRAG LLM package.