Add GraphRAG LLM package. by dworthen · Pull Request #2174 · microsoft/graphrag

dworthen · 2026-01-14T13:13:49Z

Add GraphRAG LLM package.

Copilot

Pull request overview

This PR introduces the GraphRAG LLM package, which consolidates and refactors the language model infrastructure. The changes replace the existing graphrag.language_model module with a new graphrag-llm package, standardizing the interface for completion and embedding models across the codebase.

Changes:

Removed the existing language model implementation and replaced it with the new graphrag-llm package
Updated configuration models to use completion_model_id and embedding_model_id instead of generic model_id
Refactored all workflows and operations to use the new LLM interfaces (LLMCompletion and LLMEmbedding)

Reviewed changes

Copilot reviewed 257 out of 260 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
tests/verbs/util.py	Removed deprecated model configuration constants
tests/verbs/test_*.py	Updated test files to use new configuration utility functions
tests/unit/config/utils.py	Added new model config assertion helpers and updated default configs
tests/unit/config/test_config.py	Removed validation tests that are now handled by the LLM package
packages/graphrag/graphrag/config/models/*_config.py	Updated field names from `model_id` to `completion_model_id`/`embedding_model_id`
packages/graphrag/graphrag/index/workflows/*.py	Refactored to use new LLM creation functions and interfaces
packages/graphrag/graphrag/query/*/.py	Updated query components to use new LLM interfaces
packages/graphrag/graphrag/language_model/**	Removed old language model implementation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 251 out of 265 changed files in this pull request and generated no new comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

* Remove graph embedding and UMAP (#2048) * Remove umap/layout operation * Remove graph embedding * Bump unified-search to GR 2.5.0 * Remove graph vis from unified-search * Remove file filtering (#2050) * Remove document filtering * Semver * Fix integ tests * Fix file find tuple * Fix another dangling find tuple * Remove text unit grouping (#2052) * Remove text unit group_by_columns * Semver * Fix default token split test * Fix models in config test samples * Fix token length in context sort test * Fix document sort * Re-implement hierarchical Leiden (#2049) * Use graspologic-native hierarchical leiden * Re-implement largest_connected_component * Copy in modularity * Use graspologic-native directly in pyproject * Remove directed graph tests (we don't use this) * Semver * Remove graspologic dep * Use 4.1 and text-embedding-3-large as defaults * Update comment * Clean vector store (#2077) * clean vector store code * fix * fix launch.json --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Update v3/main missing config + functions (#2082) * reduce schema fields (#2089) * reduce schema fields * fix launch.json --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Remove strategy dicts (#2090) * Remove "strategy" from community reports config/workflow * Remove extraction strategy from extract_graph * Remove summarization strategy from extract_graph * Remove strategy from claim extraction * Strongly type prompt templates * Remove strategy from embed_text * Push hydrated params into community report workflows * Push hyrdated params into extract covariates * Push hydrated params into extract graph NLP * Push hydrated params into extract graph * Push hydrated params into text embeddings * Remove a few more low-level defaults * Semver * Remove configurable prompt delimiters * Update smoke tests * Remove fnllm (#2095) * Sort deps alpha * Remove multi search (#2093) * Remove multi-search from CLI * Remove multi-search from API * Flatten vector_store config * Push hydrated vector store down to embed_text * Remove outputs from config * Remove multi-search notebook/docs * Add missing response_type in basic search API * Fix basic search context and id mapping * Fix v1 migration notebook * Fix query entity search tests * V3 docs and cleanup (#2100) * Remove community contrib notebooks * Add migration notebook and breaking changes page edits * Update/polish docs * Make model instance name configurable * Add vector schema updates to v3 migration notebook * Spellcheck * Bump smoke test runtimes * Remove document overwrite (#2101) * remove document overwrite from vector store configuration * remove document overwrite and refactor load documents method * fix test * fix test * fix test --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Unified factory (#2105) * Simplify Factory interface * Migrate CacheFactory to standard base class * Migrate LoggerFactory to standard base class * Migrate StorageFactory to standard base class * Migrate VectorStoreFactory to standard base class * Update vector store example notebook * Delete notebook outputs * Move default providers into factories * Move retry/limit tests into integ * Split language model factories * Set smoke test tpm/rpm * Fix factory integ tests * Add method to smoke test, switch text to 'fast' * Fix text smoke config for fast workflow * Add new workflows to text smoke test * Convert input readers to a proper factory * Remove covariates from fast smoke test * Update docs for input factory * Bump smoke runtime * Even longer runtime * min-csv timeout * Remove unnecessary lambdas * Prefix vector store (#2106) * add prefix to vector store configuration and removal of container name * docs updated * change prefix property name * change prefix property name * feedback implemented --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * fix for container name * Restructure project as monorepo. (#2111) * Restructure project as monorepo. * Fix formatting * Storage fixes and cleanup (#2118) * Fix pipeline recursion * Remove base_dir from storage.find * Remove max_count from storage.find * Remove prefix on storage integ test * Add base_dir in creation_date test * Wrap base_dir in Path * Use constants for input/update directories * Nov 2025 housekeeping (#2120) * Remove gensim sideload * Split CI build/type checks from unit tests * Thorough review of docs to align with v3 * Format * Fix version * Fix type * Graphrag config (#2119) * Add load_config to graphrag-common package. * Empty graph guards (#2126) * Remove networkx from graph_extractor and clean out redundancy * Bubble pipeline error to console * Remove embeddings optional new (#2128) * remove optional embeddings * fix test * fix tests * fix pipeline * fix test * fix test * fix test * fix tests --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Format * Add empty checks for NLP graphs (#2133) * Init command asks for models (#2137) * Add init prompting for models * Remove hard-coded model config validation * Switch to typer option prompt for full CLI use with models * Update getting started for init model input * Bump request timeout and overall smoke test timeout * Add graphrag-storage. (#2127) * Add graphrag-storage. * Python update (3.13) (#2149) * Update to python 3.14 as default, with range down to 3.10 * Fix enum value in query cli * Update pyarrow * Update py version for storage package * Remove 3.10 * add fastuuid * Update Python support to 3.11-3.14 with stricter dependency constraints - Set minimum Python version to 3.11 (removed 3.10 support) - Added support for Python 3.14 - Updated CI workflows: single-version jobs use 3.14, matrix jobs use 3.11 and 3.14 - Fixed license format to use SPDX-compatible format for Python 3.14 - Updated pyarrow to >=22.0.0 for Python 3.14 wheel support - Added explicit fastuuid~=0.14 and blis~=1.3 for Python 3.14 compatibility - Replaced all loose version constraints (>=) with compatible release (~=) for better lock file control - Applied stricter versioning to all packages: graphrag, graphrag-common, graphrag-storage, unified-search-app * update uv lock * Pin blis to ~=1.3.3 to ensure Python 3.14 wheel availability * Update uv lock * Update numpy to >=2.0.0 for Python 3.14 Windows compatibility Numpy 1.25.x has access violation issues on Python 3.14 Windows. Numpy 2.x has proper Python 3.14 support including Windows wheels. * update uv lock * Update pandas to >=2.3.0 for numpy 2.x compatibility Pandas 2.2.x was compiled against numpy 1.x and causes ABI incompatibility errors with numpy 2.x. Pandas 2.3.0+ supports numpy 2.x properly. * update uv.lock * Add scipy>=1.15.0 for numpy 2.x compatibility Scipy versions < 1.15.0 have C extensions built against numpy 1.x and are incompatible with numpy 2.x, causing dtype size errors. * update uv lock * Update Python support to 3.11-3.13 with compatible dependencies - Set Python version range to 3.11-3.13 (removed 3.14 support) - Updated CI workflows: single-version jobs use 3.13, matrix jobs use 3.11 and 3.13 - Dependencies optimized for Python 3.13 compatibility: - pyarrow~=22.0 (has Python 3.13 wheels) - numpy~=1.26 - pandas~=2.2 - blis~=1.0 - fastuuid~=0.13 - Applied stricter version constraints using ~= operator throughout - Updated uv.lock with resolved dependencies * Update numpy to 2.1+ and pandas to 2.3+ for Python 3.13 Windows compatibility Numpy 1.26.x causes access violations on Python 3.13 Windows. Numpy 2.1+ has proper Python 3.13 support with Windows wheels. Pandas 2.3+ is required for numpy 2.x compatibility. * update vsts.yml python version * Add GraphRAG Cache package. (#2153) * Add GraphRAG Cache package. * Fix a bunch of module comments and function visibility (#2154) * Issue #2004 fix (#2159) * fix issue #2004 using KeenhoChu idea in his PR * add unit test for dynamic community selection * add unit test for dynamic community selection implementing #2158 logic --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Mismatch between header in community report generation prompt examples and input data (id vs human_readable_id) (#2161) * fix issue #860 for mismatch in prompts and input * fix format --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Chunker factory (#2156) * Delete NoopTextSplitter * Delete unused check_token_limit * Add base chunking factory and migrate workflow to use it * Split apart chunker module * Co-locate chunking/splitting * Collapse token splitting functionality into one class/function * Restore create_base_text_units parameterization * Move Tokenizer base class to common package * Move pre-pending into chunkers * Streamline config * Fix defaults construction * Add prepending tests * Remove chunk_size_includes_metadata config * Revert ChunkingDocument interface * Move metadata prepending to a util * Move Tokenizer back to GR core * Fix tokenizer removal from chunker * Set defaults for chunking config * Move chunking to monorepo package * Format * Typo * Add ChunkResult model * Streamline chunking config * Add missing version updates for graphrag_chunking * Input factory (#2168) * Update input factory to match other factories * Move input config alongside input readers * Move file pattern logic into InputReader * Set encoding default * Clean up optional column configs * Combine structured data extraction * Remove pandas from input loading * Throw if empty documents * Add json lines (jsonl) input support * Store raw data * Fix merge imports * Move metadata handling entirely to chunking * Nicer automatic title * Typo * Add get_property utility for nested dictionary access with dot notation * Update structured_file_reader to use get_property utility * Extract input module into new graphrag-input monorepo package - Create new graphrag-input package with input loading utilities - Move InputConfig, InputFileType, InputReader, TextDocument, and file readers (CSV, JSON, JSONL, Text) - Add get_property utility for nested dictionary access with dot notation - Include hashing utility for document ID generation - Update all imports throughout codebase to use graphrag_input - Add package to workspace configuration and release tasks - Remove old graphrag.index.input module * Rename ChunkResult to TextChunk and add transformer support - Rename chunk_result.py to text_chunk.py with ChunkResult -> TextChunk - Add 'original' field to TextChunk to track pre-transform text - Add optional transform callback to chunker.chunk() method - Add add_metadata transformer for prepending metadata to chunks - Update create_chunk_results to apply transforms and populate original - Update sentence_chunker and token_chunker with transform support - Refactor create_base_text_units to use new transformer pattern - Rename pluck_metadata to get/collect methods on TextDocument * Back-compat comment * Align input config type name with other factory configs * Add MarkItDown support * Remove pattern default from MarkItDown reader * Remove plugins flag (implicit disabled) * Format * Update verb tests * Separate storage from input config * Add empty objects for NaN raw_data * Fix smoke tests * Fix BOM in csv smoke * Format * DRIFT fixes (#2171) * Use stable ids for community reports * Remove deprecated title from embedding flow * Remove embedding column from df loaders * Fix lancedb insertion * Add drift back to smoke tests * Fix mock embedder to match default embedding length * Fix DRIFT notebook * Push drift_k_followups through to prompt * Format * Vector package (#2172) * Extract graphrag-vectors package * Simplify vector factory usage and config defaults * Update factory integ initializers * Fix mock patch * Format * Register vector stores in tests * Set a default vector store name * Update vector readme * Remove impls from init * Move some validation into impls * Remove index_prefix * Move duplicate method to base class * Fix smoke vector config * Update index bug (#2173) * fix update index bug * blob storage bug fix --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Add GraphRAG LLM package. (#2174) * Update documentation for v3 release (#2176) update documentation for v3 release Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Graphrag llm cleanup (#2181) * Migration update (#2180) * fix formatting. --------- Co-authored-by: Nathan Evans <github@talkswithnumbers.com> Co-authored-by: gaudyb <85708998+gaudyb@users.noreply.github.com> Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> Co-authored-by: Andres Morales <86074752+andresmor-ms@users.noreply.github.com>

* Pin pandas (microsoft#2179) * Release v2.7.1 (microsoft#2186) * Release v2.7.1 (microsoft#2187) * Update Python publish workflow for PyPI (microsoft#2188) Debug publish workflow * V3/main (microsoft#2190) * Remove graph embedding and UMAP (microsoft#2048) * Remove umap/layout operation * Remove graph embedding * Bump unified-search to GR 2.5.0 * Remove graph vis from unified-search * Remove file filtering (microsoft#2050) * Remove document filtering * Semver * Fix integ tests * Fix file find tuple * Fix another dangling find tuple * Remove text unit grouping (microsoft#2052) * Remove text unit group_by_columns * Semver * Fix default token split test * Fix models in config test samples * Fix token length in context sort test * Fix document sort * Re-implement hierarchical Leiden (microsoft#2049) * Use graspologic-native hierarchical leiden * Re-implement largest_connected_component * Copy in modularity * Use graspologic-native directly in pyproject * Remove directed graph tests (we don't use this) * Semver * Remove graspologic dep * Use 4.1 and text-embedding-3-large as defaults * Update comment * Clean vector store (microsoft#2077) * clean vector store code * fix * fix launch.json --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Update v3/main missing config + functions (microsoft#2082) * reduce schema fields (microsoft#2089) * reduce schema fields * fix launch.json --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Remove strategy dicts (microsoft#2090) * Remove "strategy" from community reports config/workflow * Remove extraction strategy from extract_graph * Remove summarization strategy from extract_graph * Remove strategy from claim extraction * Strongly type prompt templates * Remove strategy from embed_text * Push hydrated params into community report workflows * Push hyrdated params into extract covariates * Push hydrated params into extract graph NLP * Push hydrated params into extract graph * Push hydrated params into text embeddings * Remove a few more low-level defaults * Semver * Remove configurable prompt delimiters * Update smoke tests * Remove fnllm (microsoft#2095) * Sort deps alpha * Remove multi search (microsoft#2093) * Remove multi-search from CLI * Remove multi-search from API * Flatten vector_store config * Push hydrated vector store down to embed_text * Remove outputs from config * Remove multi-search notebook/docs * Add missing response_type in basic search API * Fix basic search context and id mapping * Fix v1 migration notebook * Fix query entity search tests * V3 docs and cleanup (microsoft#2100) * Remove community contrib notebooks * Add migration notebook and breaking changes page edits * Update/polish docs * Make model instance name configurable * Add vector schema updates to v3 migration notebook * Spellcheck * Bump smoke test runtimes * Remove document overwrite (microsoft#2101) * remove document overwrite from vector store configuration * remove document overwrite and refactor load documents method * fix test * fix test * fix test --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Unified factory (microsoft#2105) * Simplify Factory interface * Migrate CacheFactory to standard base class * Migrate LoggerFactory to standard base class * Migrate StorageFactory to standard base class * Migrate VectorStoreFactory to standard base class * Update vector store example notebook * Delete notebook outputs * Move default providers into factories * Move retry/limit tests into integ * Split language model factories * Set smoke test tpm/rpm * Fix factory integ tests * Add method to smoke test, switch text to 'fast' * Fix text smoke config for fast workflow * Add new workflows to text smoke test * Convert input readers to a proper factory * Remove covariates from fast smoke test * Update docs for input factory * Bump smoke runtime * Even longer runtime * min-csv timeout * Remove unnecessary lambdas * Prefix vector store (microsoft#2106) * add prefix to vector store configuration and removal of container name * docs updated * change prefix property name * change prefix property name * feedback implemented --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * fix for container name * Restructure project as monorepo. (microsoft#2111) * Restructure project as monorepo. * Fix formatting * Storage fixes and cleanup (microsoft#2118) * Fix pipeline recursion * Remove base_dir from storage.find * Remove max_count from storage.find * Remove prefix on storage integ test * Add base_dir in creation_date test * Wrap base_dir in Path * Use constants for input/update directories * Nov 2025 housekeeping (microsoft#2120) * Remove gensim sideload * Split CI build/type checks from unit tests * Thorough review of docs to align with v3 * Format * Fix version * Fix type * Graphrag config (microsoft#2119) * Add load_config to graphrag-common package. * Empty graph guards (microsoft#2126) * Remove networkx from graph_extractor and clean out redundancy * Bubble pipeline error to console * Remove embeddings optional new (microsoft#2128) * remove optional embeddings * fix test * fix tests * fix pipeline * fix test * fix test * fix test * fix tests --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Format * Add empty checks for NLP graphs (microsoft#2133) * Init command asks for models (microsoft#2137) * Add init prompting for models * Remove hard-coded model config validation * Switch to typer option prompt for full CLI use with models * Update getting started for init model input * Bump request timeout and overall smoke test timeout * Add graphrag-storage. (microsoft#2127) * Add graphrag-storage. * Python update (3.13) (microsoft#2149) * Update to python 3.14 as default, with range down to 3.10 * Fix enum value in query cli * Update pyarrow * Update py version for storage package * Remove 3.10 * add fastuuid * Update Python support to 3.11-3.14 with stricter dependency constraints - Set minimum Python version to 3.11 (removed 3.10 support) - Added support for Python 3.14 - Updated CI workflows: single-version jobs use 3.14, matrix jobs use 3.11 and 3.14 - Fixed license format to use SPDX-compatible format for Python 3.14 - Updated pyarrow to >=22.0.0 for Python 3.14 wheel support - Added explicit fastuuid~=0.14 and blis~=1.3 for Python 3.14 compatibility - Replaced all loose version constraints (>=) with compatible release (~=) for better lock file control - Applied stricter versioning to all packages: graphrag, graphrag-common, graphrag-storage, unified-search-app * update uv lock * Pin blis to ~=1.3.3 to ensure Python 3.14 wheel availability * Update uv lock * Update numpy to >=2.0.0 for Python 3.14 Windows compatibility Numpy 1.25.x has access violation issues on Python 3.14 Windows. Numpy 2.x has proper Python 3.14 support including Windows wheels. * update uv lock * Update pandas to >=2.3.0 for numpy 2.x compatibility Pandas 2.2.x was compiled against numpy 1.x and causes ABI incompatibility errors with numpy 2.x. Pandas 2.3.0+ supports numpy 2.x properly. * update uv.lock * Add scipy>=1.15.0 for numpy 2.x compatibility Scipy versions < 1.15.0 have C extensions built against numpy 1.x and are incompatible with numpy 2.x, causing dtype size errors. * update uv lock * Update Python support to 3.11-3.13 with compatible dependencies - Set Python version range to 3.11-3.13 (removed 3.14 support) - Updated CI workflows: single-version jobs use 3.13, matrix jobs use 3.11 and 3.13 - Dependencies optimized for Python 3.13 compatibility: - pyarrow~=22.0 (has Python 3.13 wheels) - numpy~=1.26 - pandas~=2.2 - blis~=1.0 - fastuuid~=0.13 - Applied stricter version constraints using ~= operator throughout - Updated uv.lock with resolved dependencies * Update numpy to 2.1+ and pandas to 2.3+ for Python 3.13 Windows compatibility Numpy 1.26.x causes access violations on Python 3.13 Windows. Numpy 2.1+ has proper Python 3.13 support with Windows wheels. Pandas 2.3+ is required for numpy 2.x compatibility. * update vsts.yml python version * Add GraphRAG Cache package. (microsoft#2153) * Add GraphRAG Cache package. * Fix a bunch of module comments and function visibility (microsoft#2154) * Issue microsoft#2004 fix (microsoft#2159) * fix issue microsoft#2004 using KeenhoChu idea in his PR * add unit test for dynamic community selection * add unit test for dynamic community selection implementing microsoft#2158 logic --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Mismatch between header in community report generation prompt examples and input data (id vs human_readable_id) (microsoft#2161) * fix issue microsoft#860 for mismatch in prompts and input * fix format --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Chunker factory (microsoft#2156) * Delete NoopTextSplitter * Delete unused check_token_limit * Add base chunking factory and migrate workflow to use it * Split apart chunker module * Co-locate chunking/splitting * Collapse token splitting functionality into one class/function * Restore create_base_text_units parameterization * Move Tokenizer base class to common package * Move pre-pending into chunkers * Streamline config * Fix defaults construction * Add prepending tests * Remove chunk_size_includes_metadata config * Revert ChunkingDocument interface * Move metadata prepending to a util * Move Tokenizer back to GR core * Fix tokenizer removal from chunker * Set defaults for chunking config * Move chunking to monorepo package * Format * Typo * Add ChunkResult model * Streamline chunking config * Add missing version updates for graphrag_chunking * Input factory (microsoft#2168) * Update input factory to match other factories * Move input config alongside input readers * Move file pattern logic into InputReader * Set encoding default * Clean up optional column configs * Combine structured data extraction * Remove pandas from input loading * Throw if empty documents * Add json lines (jsonl) input support * Store raw data * Fix merge imports * Move metadata handling entirely to chunking * Nicer automatic title * Typo * Add get_property utility for nested dictionary access with dot notation * Update structured_file_reader to use get_property utility * Extract input module into new graphrag-input monorepo package - Create new graphrag-input package with input loading utilities - Move InputConfig, InputFileType, InputReader, TextDocument, and file readers (CSV, JSON, JSONL, Text) - Add get_property utility for nested dictionary access with dot notation - Include hashing utility for document ID generation - Update all imports throughout codebase to use graphrag_input - Add package to workspace configuration and release tasks - Remove old graphrag.index.input module * Rename ChunkResult to TextChunk and add transformer support - Rename chunk_result.py to text_chunk.py with ChunkResult -> TextChunk - Add 'original' field to TextChunk to track pre-transform text - Add optional transform callback to chunker.chunk() method - Add add_metadata transformer for prepending metadata to chunks - Update create_chunk_results to apply transforms and populate original - Update sentence_chunker and token_chunker with transform support - Refactor create_base_text_units to use new transformer pattern - Rename pluck_metadata to get/collect methods on TextDocument * Back-compat comment * Align input config type name with other factory configs * Add MarkItDown support * Remove pattern default from MarkItDown reader * Remove plugins flag (implicit disabled) * Format * Update verb tests * Separate storage from input config * Add empty objects for NaN raw_data * Fix smoke tests * Fix BOM in csv smoke * Format * DRIFT fixes (microsoft#2171) * Use stable ids for community reports * Remove deprecated title from embedding flow * Remove embedding column from df loaders * Fix lancedb insertion * Add drift back to smoke tests * Fix mock embedder to match default embedding length * Fix DRIFT notebook * Push drift_k_followups through to prompt * Format * Vector package (microsoft#2172) * Extract graphrag-vectors package * Simplify vector factory usage and config defaults * Update factory integ initializers * Fix mock patch * Format * Register vector stores in tests * Set a default vector store name * Update vector readme * Remove impls from init * Move some validation into impls * Remove index_prefix * Move duplicate method to base class * Fix smoke vector config * Update index bug (microsoft#2173) * fix update index bug * blob storage bug fix --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Add GraphRAG LLM package. (microsoft#2174) * Update documentation for v3 release (microsoft#2176) update documentation for v3 release Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Graphrag llm cleanup (microsoft#2181) * Migration update (microsoft#2180) * fix formatting. --------- Co-authored-by: Nathan Evans <github@talkswithnumbers.com> Co-authored-by: gaudyb <85708998+gaudyb@users.noreply.github.com> Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> Co-authored-by: Andres Morales <86074752+andresmor-ms@users.noreply.github.com> * Release v3.0.0 (microsoft#2191) * Fix deps (microsoft#2193) * fix missing project urls * fix missing deps. * Release v3.0.1 (microsoft#2195) * add TableProvider to enable future row-by-row streaming (microsoft#2189) * write dataframe * changed some workflows * 1a * add fixed files * add versioning * add patch and remove utility * pr changes * Python 3.13 (microsoft#2208) * make graphrag-llm supports 3.13 * Semver --------- Co-authored-by: Deo <liangzhanzhao@metrodata.cn> Co-authored-by: Zhanzhao (Deo) Liang <liangzhanzhao1985@gmail.com> * update vector store example. (microsoft#2202) * Table factory (microsoft#2214) * Add table provider factory * Semver * Remove unnecessary response format check. (microsoft#2213) - Fixes: microsoft#2203 * add csv table provider (microsoft#2215) * add csv table provider * add in provider * add semver * change list_tables to list() * Add DataReader class for typed dataframe loading (microsoft#2220) * Add DataReader class for typed dataframe loading Introduce DataReader that wraps TableProvider and applies type coercion functions when loading dataframes from weakly-typed formats (e.g. CSV). - Add DataReader class with methods for each table type: entities, relationships, communities, community_reports, covariates, text_units, and documents - Add typed loading functions in dfs.py for community_reports, covariates, text_units, and documents (entities, relationships, communities already existed) - Integrate DataReader into all 17 indexing workflows replacing raw read_dataframe calls - Integrate DataReader into CLI query's _resolve_output_files for typed loading across all search types (global, local, drift, basic) - Export DataReader from data_model package __init__ * Fix column check * Add notebook example support for each package (microsoft#2205) * add notebook example support for each package * add notebook example support for each package * semversioner change * feedback implemented for notebooks * feedback implemented for notebooks * feedback implemented for notebooks --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Streamline workflows (microsoft#2225) * Move document ID, human_readable_id, and raw_data setup from create_final_documents into load workflows Consolidates core document field initialization (id string cast, human_readable_id index, raw_data default) into load_input_documents and load_update_documents so that create_final_documents only handles the text unit join. Also applies the same setup in the run_pipeline input_documents bypass paths. * Remove overzealous input document assignment * Semver * Format * Add async iterator support to InputReader and use in load workflows (microsoft#2226) * Add async iterator support to InputReader and use in load workflows InputReader now implements __aiter__ so it can be used as `async for doc in reader`. The core iteration logic is in _iterate_files(), and read_files() delegates to the iterator for batch loading. Both load_input_documents and load_update_documents workflows now use the async iterator with dataclasses.asdict for DataFrame construction. * Format * add memory profiling (microsoft#2227) * add profiling * add unit test for profiling * fix property name --------- Co-authored-by: Nathan Evans <github@talkswithnumbers.com> Co-authored-by: Derek Worthen <worthend.derek@gmail.com> Co-authored-by: Alonso Guevara <alonsog@microsoft.com> Co-authored-by: gaudyb <85708998+gaudyb@users.noreply.github.com> Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> Co-authored-by: Andres Morales <86074752+andresmor-ms@users.noreply.github.com> Co-authored-by: Dayenne Souza <ddesouza@microsoft.com> Co-authored-by: Deo <liangzhanzhao@metrodata.cn> Co-authored-by: Zhanzhao (Deo) Liang <liangzhanzhao1985@gmail.com>

@takanori-ugai

* initial version * update version * implements of CommunityDetection * add vetctor search * improve community detection * add Sample program * fix the prompts * update community summarization * improvements * Use structured AiServices * small improvements * Query (#2) * small improvements * add initial version of query part * add advanced methods * update prompts * update * add global and drift mode. * drift mode and global mode improvement * improve the query part * gettting closer * small implement * parameterized * Sample query program * small improvements * add reading parquest. * update * improvement * update * CLI improvement * default values * add question generator * getting closer * get closer * Add prograss logging * update based on review * improvements based on review * 📝 Add docstrings to `query` Docstrings generation was requested by @takanori-ugai. * #2 (comment) The following files were modified: * `kotlin/src/main/kotlin/com/microsoft/graphrag/SampleIndexer.kt` * `kotlin/src/main/kotlin/com/microsoft/graphrag/SampleQueries.kt` * `kotlin/src/main/kotlin/com/microsoft/graphrag/cli/GraphRagCli.kt` * `kotlin/src/main/kotlin/com/microsoft/graphrag/index/CommunityReportWorkflow.kt` * `kotlin/src/main/kotlin/com/microsoft/graphrag/index/EmbedWorkflow.kt` * `kotlin/src/main/kotlin/com/microsoft/graphrag/index/ExtractGraphWorkflow.kt` * `kotlin/src/main/kotlin/com/microsoft/graphrag/index/LocalVectorStore.kt` * `kotlin/src/main/kotlin/com/microsoft/graphrag/index/PipelineTypes.kt` * `kotlin/src/main/kotlin/com/microsoft/graphrag/index/RunPipeline.kt` * `kotlin/src/main/kotlin/com/microsoft/graphrag/index/StateCodec.kt` * `kotlin/src/main/kotlin/com/microsoft/graphrag/index/Workflows.kt` * `kotlin/src/main/kotlin/com/microsoft/graphrag/logger/Progress.kt` * `kotlin/src/main/kotlin/com/microsoft/graphrag/query/AdvancedQueryEngines.kt` * `kotlin/src/main/kotlin/com/microsoft/graphrag/query/BasicQueryEngine.kt` * `kotlin/src/main/kotlin/com/microsoft/graphrag/query/BasicSearchContextBuilder.kt` * `kotlin/src/main/kotlin/com/microsoft/graphrag/query/CollectingQueryCallbacks.kt` * `kotlin/src/main/kotlin/com/microsoft/graphrag/query/ContextRecords.kt` * `kotlin/src/main/kotlin/com/microsoft/graphrag/query/DriftSearchEngine.kt` * `kotlin/src/main/kotlin/com/microsoft/graphrag/query/GlobalSearchEngine.kt` * `kotlin/src/main/kotlin/com/microsoft/graphrag/query/LocalSearchContextBuilder.kt` * `kotlin/src/main/kotlin/com/microsoft/graphrag/query/NameUtils.kt` * `kotlin/src/main/kotlin/com/microsoft/graphrag/query/QueryCallbacks.kt` * `kotlin/src/main/kotlin/com/microsoft/graphrag/query/QueryConfigLoader.kt` * `kotlin/src/main/kotlin/com/microsoft/graphrag/query/QueryIndexLoader.kt` * `kotlin/src/main/kotlin/com/microsoft/graphrag/query/QuestionGen.kt` --------- Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> * improvement based on review * fix some warnings * rest of implementation * add AGENTS.md * Catch up (#4) * Pin pandas (microsoft#2179) * Release v2.7.1 (microsoft#2186) * Release v2.7.1 (microsoft#2187) * Update Python publish workflow for PyPI (microsoft#2188) Debug publish workflow * V3/main (microsoft#2190) * Remove graph embedding and UMAP (microsoft#2048) * Remove umap/layout operation * Remove graph embedding * Bump unified-search to GR 2.5.0 * Remove graph vis from unified-search * Remove file filtering (microsoft#2050) * Remove document filtering * Semver * Fix integ tests * Fix file find tuple * Fix another dangling find tuple * Remove text unit grouping (microsoft#2052) * Remove text unit group_by_columns * Semver * Fix default token split test * Fix models in config test samples * Fix token length in context sort test * Fix document sort * Re-implement hierarchical Leiden (microsoft#2049) * Use graspologic-native hierarchical leiden * Re-implement largest_connected_component * Copy in modularity * Use graspologic-native directly in pyproject * Remove directed graph tests (we don't use this) * Semver * Remove graspologic dep * Use 4.1 and text-embedding-3-large as defaults * Update comment * Clean vector store (microsoft#2077) * clean vector store code * fix * fix launch.json --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Update v3/main missing config + functions (microsoft#2082) * reduce schema fields (microsoft#2089) * reduce schema fields * fix launch.json --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Remove strategy dicts (microsoft#2090) * Remove "strategy" from community reports config/workflow * Remove extraction strategy from extract_graph * Remove summarization strategy from extract_graph * Remove strategy from claim extraction * Strongly type prompt templates * Remove strategy from embed_text * Push hydrated params into community report workflows * Push hyrdated params into extract covariates * Push hydrated params into extract graph NLP * Push hydrated params into extract graph * Push hydrated params into text embeddings * Remove a few more low-level defaults * Semver * Remove configurable prompt delimiters * Update smoke tests * Remove fnllm (microsoft#2095) * Sort deps alpha * Remove multi search (microsoft#2093) * Remove multi-search from CLI * Remove multi-search from API * Flatten vector_store config * Push hydrated vector store down to embed_text * Remove outputs from config * Remove multi-search notebook/docs * Add missing response_type in basic search API * Fix basic search context and id mapping * Fix v1 migration notebook * Fix query entity search tests * V3 docs and cleanup (microsoft#2100) * Remove community contrib notebooks * Add migration notebook and breaking changes page edits * Update/polish docs * Make model instance name configurable * Add vector schema updates to v3 migration notebook * Spellcheck * Bump smoke test runtimes * Remove document overwrite (microsoft#2101) * remove document overwrite from vector store configuration * remove document overwrite and refactor load documents method * fix test * fix test * fix test --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Unified factory (microsoft#2105) * Simplify Factory interface * Migrate CacheFactory to standard base class * Migrate LoggerFactory to standard base class * Migrate StorageFactory to standard base class * Migrate VectorStoreFactory to standard base class * Update vector store example notebook * Delete notebook outputs * Move default providers into factories * Move retry/limit tests into integ * Split language model factories * Set smoke test tpm/rpm * Fix factory integ tests * Add method to smoke test, switch text to 'fast' * Fix text smoke config for fast workflow * Add new workflows to text smoke test * Convert input readers to a proper factory * Remove covariates from fast smoke test * Update docs for input factory * Bump smoke runtime * Even longer runtime * min-csv timeout * Remove unnecessary lambdas * Prefix vector store (microsoft#2106) * add prefix to vector store configuration and removal of container name * docs updated * change prefix property name * change prefix property name * feedback implemented --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * fix for container name * Restructure project as monorepo. (microsoft#2111) * Restructure project as monorepo. * Fix formatting * Storage fixes and cleanup (microsoft#2118) * Fix pipeline recursion * Remove base_dir from storage.find * Remove max_count from storage.find * Remove prefix on storage integ test * Add base_dir in creation_date test * Wrap base_dir in Path * Use constants for input/update directories * Nov 2025 housekeeping (microsoft#2120) * Remove gensim sideload * Split CI build/type checks from unit tests * Thorough review of docs to align with v3 * Format * Fix version * Fix type * Graphrag config (microsoft#2119) * Add load_config to graphrag-common package. * Empty graph guards (microsoft#2126) * Remove networkx from graph_extractor and clean out redundancy * Bubble pipeline error to console * Remove embeddings optional new (microsoft#2128) * remove optional embeddings * fix test * fix tests * fix pipeline * fix test * fix test * fix test * fix tests --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Format * Add empty checks for NLP graphs (microsoft#2133) * Init command asks for models (microsoft#2137) * Add init prompting for models * Remove hard-coded model config validation * Switch to typer option prompt for full CLI use with models * Update getting started for init model input * Bump request timeout and overall smoke test timeout * Add graphrag-storage. (microsoft#2127) * Add graphrag-storage. * Python update (3.13) (microsoft#2149) * Update to python 3.14 as default, with range down to 3.10 * Fix enum value in query cli * Update pyarrow * Update py version for storage package * Remove 3.10 * add fastuuid * Update Python support to 3.11-3.14 with stricter dependency constraints - Set minimum Python version to 3.11 (removed 3.10 support) - Added support for Python 3.14 - Updated CI workflows: single-version jobs use 3.14, matrix jobs use 3.11 and 3.14 - Fixed license format to use SPDX-compatible format for Python 3.14 - Updated pyarrow to >=22.0.0 for Python 3.14 wheel support - Added explicit fastuuid~=0.14 and blis~=1.3 for Python 3.14 compatibility - Replaced all loose version constraints (>=) with compatible release (~=) for better lock file control - Applied stricter versioning to all packages: graphrag, graphrag-common, graphrag-storage, unified-search-app * update uv lock * Pin blis to ~=1.3.3 to ensure Python 3.14 wheel availability * Update uv lock * Update numpy to >=2.0.0 for Python 3.14 Windows compatibility Numpy 1.25.x has access violation issues on Python 3.14 Windows. Numpy 2.x has proper Python 3.14 support including Windows wheels. * update uv lock * Update pandas to >=2.3.0 for numpy 2.x compatibility Pandas 2.2.x was compiled against numpy 1.x and causes ABI incompatibility errors with numpy 2.x. Pandas 2.3.0+ supports numpy 2.x properly. * update uv.lock * Add scipy>=1.15.0 for numpy 2.x compatibility Scipy versions < 1.15.0 have C extensions built against numpy 1.x and are incompatible with numpy 2.x, causing dtype size errors. * update uv lock * Update Python support to 3.11-3.13 with compatible dependencies - Set Python version range to 3.11-3.13 (removed 3.14 support) - Updated CI workflows: single-version jobs use 3.13, matrix jobs use 3.11 and 3.13 - Dependencies optimized for Python 3.13 compatibility: - pyarrow~=22.0 (has Python 3.13 wheels) - numpy~=1.26 - pandas~=2.2 - blis~=1.0 - fastuuid~=0.13 - Applied stricter version constraints using ~= operator throughout - Updated uv.lock with resolved dependencies * Update numpy to 2.1+ and pandas to 2.3+ for Python 3.13 Windows compatibility Numpy 1.26.x causes access violations on Python 3.13 Windows. Numpy 2.1+ has proper Python 3.13 support with Windows wheels. Pandas 2.3+ is required for numpy 2.x compatibility. * update vsts.yml python version * Add GraphRAG Cache package. (microsoft#2153) * Add GraphRAG Cache package. * Fix a bunch of module comments and function visibility (microsoft#2154) * Issue microsoft#2004 fix (microsoft#2159) * fix issue microsoft#2004 using KeenhoChu idea in his PR * add unit test for dynamic community selection * add unit test for dynamic community selection implementing microsoft#2158 logic --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Mismatch between header in community report generation prompt examples and input data (id vs human_readable_id) (microsoft#2161) * fix issue microsoft#860 for mismatch in prompts and input * fix format --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Chunker factory (microsoft#2156) * Delete NoopTextSplitter * Delete unused check_token_limit * Add base chunking factory and migrate workflow to use it * Split apart chunker module * Co-locate chunking/splitting * Collapse token splitting functionality into one class/function * Restore create_base_text_units parameterization * Move Tokenizer base class to common package * Move pre-pending into chunkers * Streamline config * Fix defaults construction * Add prepending tests * Remove chunk_size_includes_metadata config * Revert ChunkingDocument interface * Move metadata prepending to a util * Move Tokenizer back to GR core * Fix tokenizer removal from chunker * Set defaults for chunking config * Move chunking to monorepo package * Format * Typo * Add ChunkResult model * Streamline chunking config * Add missing version updates for graphrag_chunking * Input factory (microsoft#2168) * Update input factory to match other factories * Move input config alongside input readers * Move file pattern logic into InputReader * Set encoding default * Clean up optional column configs * Combine structured data extraction * Remove pandas from input loading * Throw if empty documents * Add json lines (jsonl) input support * Store raw data * Fix merge imports * Move metadata handling entirely to chunking * Nicer automatic title * Typo * Add get_property utility for nested dictionary access with dot notation * Update structured_file_reader to use get_property utility * Extract input module into new graphrag-input monorepo package - Create new graphrag-input package with input loading utilities - Move InputConfig, InputFileType, InputReader, TextDocument, and file readers (CSV, JSON, JSONL, Text) - Add get_property utility for nested dictionary access with dot notation - Include hashing utility for document ID generation - Update all imports throughout codebase to use graphrag_input - Add package to workspace configuration and release tasks - Remove old graphrag.index.input module * Rename ChunkResult to TextChunk and add transformer support - Rename chunk_result.py to text_chunk.py with ChunkResult -> TextChunk - Add 'original' field to TextChunk to track pre-transform text - Add optional transform callback to chunker.chunk() method - Add add_metadata transformer for prepending metadata to chunks - Update create_chunk_results to apply transforms and populate original - Update sentence_chunker and token_chunker with transform support - Refactor create_base_text_units to use new transformer pattern - Rename pluck_metadata to get/collect methods on TextDocument * Back-compat comment * Align input config type name with other factory configs * Add MarkItDown support * Remove pattern default from MarkItDown reader * Remove plugins flag (implicit disabled) * Format * Update verb tests * Separate storage from input config * Add empty objects for NaN raw_data * Fix smoke tests * Fix BOM in csv smoke * Format * DRIFT fixes (microsoft#2171) * Use stable ids for community reports * Remove deprecated title from embedding flow * Remove embedding column from df loaders * Fix lancedb insertion * Add drift back to smoke tests * Fix mock embedder to match default embedding length * Fix DRIFT notebook * Push drift_k_followups through to prompt * Format * Vector package (microsoft#2172) * Extract graphrag-vectors package * Simplify vector factory usage and config defaults * Update factory integ initializers * Fix mock patch * Format * Register vector stores in tests * Set a default vector store name * Update vector readme * Remove impls from init * Move some validation into impls * Remove index_prefix * Move duplicate method to base class * Fix smoke vector config * Update index bug (microsoft#2173) * fix update index bug * blob storage bug fix --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Add GraphRAG LLM package. (microsoft#2174) * Update documentation for v3 release (microsoft#2176) update documentation for v3 release Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Graphrag llm cleanup (microsoft#2181) * Migration update (microsoft#2180) * fix formatting. --------- Co-authored-by: Nathan Evans <github@talkswithnumbers.com> Co-authored-by: gaudyb <85708998+gaudyb@users.noreply.github.com> Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> Co-authored-by: Andres Morales <86074752+andresmor-ms@users.noreply.github.com> * Release v3.0.0 (microsoft#2191) * Fix deps (microsoft#2193) * fix missing project urls * fix missing deps. * Release v3.0.1 (microsoft#2195) * add TableProvider to enable future row-by-row streaming (microsoft#2189) * write dataframe * changed some workflows * 1a * add fixed files * add versioning * add patch and remove utility * pr changes * Python 3.13 (microsoft#2208) * make graphrag-llm supports 3.13 * Semver --------- Co-authored-by: Deo <liangzhanzhao@metrodata.cn> Co-authored-by: Zhanzhao (Deo) Liang <liangzhanzhao1985@gmail.com> * update vector store example. (microsoft#2202) * Table factory (microsoft#2214) * Add table provider factory * Semver * Remove unnecessary response format check. (microsoft#2213) - Fixes: microsoft#2203 * add csv table provider (microsoft#2215) * add csv table provider * add in provider * add semver * change list_tables to list() * Add DataReader class for typed dataframe loading (microsoft#2220) * Add DataReader class for typed dataframe loading Introduce DataReader that wraps TableProvider and applies type coercion functions when loading dataframes from weakly-typed formats (e.g. CSV). - Add DataReader class with methods for each table type: entities, relationships, communities, community_reports, covariates, text_units, and documents - Add typed loading functions in dfs.py for community_reports, covariates, text_units, and documents (entities, relationships, communities already existed) - Integrate DataReader into all 17 indexing workflows replacing raw read_dataframe calls - Integrate DataReader into CLI query's _resolve_output_files for typed loading across all search types (global, local, drift, basic) - Export DataReader from data_model package __init__ * Fix column check * Add notebook example support for each package (microsoft#2205) * add notebook example support for each package * add notebook example support for each package * semversioner change * feedback implemented for notebooks * feedback implemented for notebooks * feedback implemented for notebooks --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Streamline workflows (microsoft#2225) * Move document ID, human_readable_id, and raw_data setup from create_final_documents into load workflows Consolidates core document field initialization (id string cast, human_readable_id index, raw_data default) into load_input_documents and load_update_documents so that create_final_documents only handles the text unit join. Also applies the same setup in the run_pipeline input_documents bypass paths. * Remove overzealous input document assignment * Semver * Format * Add async iterator support to InputReader and use in load workflows (microsoft#2226) * Add async iterator support to InputReader and use in load workflows InputReader now implements __aiter__ so it can be used as `async for doc in reader`. The core iteration logic is in _iterate_files(), and read_files() delegates to the iterator for batch loading. Both load_input_documents and load_update_documents workflows now use the async iterator with dataclasses.asdict for DataFrame construction. * Format * add memory profiling (microsoft#2227) * add profiling * add unit test for profiling * fix property name --------- Co-authored-by: Nathan Evans <github@talkswithnumbers.com> Co-authored-by: Derek Worthen <worthend.derek@gmail.com> Co-authored-by: Alonso Guevara <alonsog@microsoft.com> Co-authored-by: gaudyb <85708998+gaudyb@users.noreply.github.com> Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> Co-authored-by: Andres Morales <86074752+andresmor-ms@users.noreply.github.com> Co-authored-by: Dayenne Souza <ddesouza@microsoft.com> Co-authored-by: Deo <liangzhanzhao@metrodata.cn> Co-authored-by: Zhanzhao (Deo) Liang <liangzhanzhao1985@gmail.com> * update with review * update with review --------- Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by: Nathan Evans <github@talkswithnumbers.com> Co-authored-by: Derek Worthen <worthend.derek@gmail.com> Co-authored-by: Alonso Guevara <alonsog@microsoft.com> Co-authored-by: gaudyb <85708998+gaudyb@users.noreply.github.com> Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> Co-authored-by: Andres Morales <86074752+andresmor-ms@users.noreply.github.com> Co-authored-by: Dayenne Souza <ddesouza@microsoft.com> Co-authored-by: Deo <liangzhanzhao@metrodata.cn> Co-authored-by: Zhanzhao (Deo) Liang <liangzhanzhao1985@gmail.com>

* Pin pandas (microsoft#2179) * Release v2.7.1 (microsoft#2186) * Release v2.7.1 (microsoft#2187) * Update Python publish workflow for PyPI (microsoft#2188) Debug publish workflow * V3/main (microsoft#2190) * Remove graph embedding and UMAP (microsoft#2048) * Remove umap/layout operation * Remove graph embedding * Bump unified-search to GR 2.5.0 * Remove graph vis from unified-search * Remove file filtering (microsoft#2050) * Remove document filtering * Semver * Fix integ tests * Fix file find tuple * Fix another dangling find tuple * Remove text unit grouping (microsoft#2052) * Remove text unit group_by_columns * Semver * Fix default token split test * Fix models in config test samples * Fix token length in context sort test * Fix document sort * Re-implement hierarchical Leiden (microsoft#2049) * Use graspologic-native hierarchical leiden * Re-implement largest_connected_component * Copy in modularity * Use graspologic-native directly in pyproject * Remove directed graph tests (we don't use this) * Semver * Remove graspologic dep * Use 4.1 and text-embedding-3-large as defaults * Update comment * Clean vector store (microsoft#2077) * clean vector store code * fix * fix launch.json --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Update v3/main missing config + functions (microsoft#2082) * reduce schema fields (microsoft#2089) * reduce schema fields * fix launch.json --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Remove strategy dicts (microsoft#2090) * Remove "strategy" from community reports config/workflow * Remove extraction strategy from extract_graph * Remove summarization strategy from extract_graph * Remove strategy from claim extraction * Strongly type prompt templates * Remove strategy from embed_text * Push hydrated params into community report workflows * Push hyrdated params into extract covariates * Push hydrated params into extract graph NLP * Push hydrated params into extract graph * Push hydrated params into text embeddings * Remove a few more low-level defaults * Semver * Remove configurable prompt delimiters * Update smoke tests * Remove fnllm (microsoft#2095) * Sort deps alpha * Remove multi search (microsoft#2093) * Remove multi-search from CLI * Remove multi-search from API * Flatten vector_store config * Push hydrated vector store down to embed_text * Remove outputs from config * Remove multi-search notebook/docs * Add missing response_type in basic search API * Fix basic search context and id mapping * Fix v1 migration notebook * Fix query entity search tests * V3 docs and cleanup (microsoft#2100) * Remove community contrib notebooks * Add migration notebook and breaking changes page edits * Update/polish docs * Make model instance name configurable * Add vector schema updates to v3 migration notebook * Spellcheck * Bump smoke test runtimes * Remove document overwrite (microsoft#2101) * remove document overwrite from vector store configuration * remove document overwrite and refactor load documents method * fix test * fix test * fix test --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Unified factory (microsoft#2105) * Simplify Factory interface * Migrate CacheFactory to standard base class * Migrate LoggerFactory to standard base class * Migrate StorageFactory to standard base class * Migrate VectorStoreFactory to standard base class * Update vector store example notebook * Delete notebook outputs * Move default providers into factories * Move retry/limit tests into integ * Split language model factories * Set smoke test tpm/rpm * Fix factory integ tests * Add method to smoke test, switch text to 'fast' * Fix text smoke config for fast workflow * Add new workflows to text smoke test * Convert input readers to a proper factory * Remove covariates from fast smoke test * Update docs for input factory * Bump smoke runtime * Even longer runtime * min-csv timeout * Remove unnecessary lambdas * Prefix vector store (microsoft#2106) * add prefix to vector store configuration and removal of container name * docs updated * change prefix property name * change prefix property name * feedback implemented --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * fix for container name * Restructure project as monorepo. (microsoft#2111) * Restructure project as monorepo. * Fix formatting * Storage fixes and cleanup (microsoft#2118) * Fix pipeline recursion * Remove base_dir from storage.find * Remove max_count from storage.find * Remove prefix on storage integ test * Add base_dir in creation_date test * Wrap base_dir in Path * Use constants for input/update directories * Nov 2025 housekeeping (microsoft#2120) * Remove gensim sideload * Split CI build/type checks from unit tests * Thorough review of docs to align with v3 * Format * Fix version * Fix type * Graphrag config (microsoft#2119) * Add load_config to graphrag-common package. * Empty graph guards (microsoft#2126) * Remove networkx from graph_extractor and clean out redundancy * Bubble pipeline error to console * Remove embeddings optional new (microsoft#2128) * remove optional embeddings * fix test * fix tests * fix pipeline * fix test * fix test * fix test * fix tests --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Format * Add empty checks for NLP graphs (microsoft#2133) * Init command asks for models (microsoft#2137) * Add init prompting for models * Remove hard-coded model config validation * Switch to typer option prompt for full CLI use with models * Update getting started for init model input * Bump request timeout and overall smoke test timeout * Add graphrag-storage. (microsoft#2127) * Add graphrag-storage. * Python update (3.13) (microsoft#2149) * Update to python 3.14 as default, with range down to 3.10 * Fix enum value in query cli * Update pyarrow * Update py version for storage package * Remove 3.10 * add fastuuid * Update Python support to 3.11-3.14 with stricter dependency constraints - Set minimum Python version to 3.11 (removed 3.10 support) - Added support for Python 3.14 - Updated CI workflows: single-version jobs use 3.14, matrix jobs use 3.11 and 3.14 - Fixed license format to use SPDX-compatible format for Python 3.14 - Updated pyarrow to >=22.0.0 for Python 3.14 wheel support - Added explicit fastuuid~=0.14 and blis~=1.3 for Python 3.14 compatibility - Replaced all loose version constraints (>=) with compatible release (~=) for better lock file control - Applied stricter versioning to all packages: graphrag, graphrag-common, graphrag-storage, unified-search-app * update uv lock * Pin blis to ~=1.3.3 to ensure Python 3.14 wheel availability * Update uv lock * Update numpy to >=2.0.0 for Python 3.14 Windows compatibility Numpy 1.25.x has access violation issues on Python 3.14 Windows. Numpy 2.x has proper Python 3.14 support including Windows wheels. * update uv lock * Update pandas to >=2.3.0 for numpy 2.x compatibility Pandas 2.2.x was compiled against numpy 1.x and causes ABI incompatibility errors with numpy 2.x. Pandas 2.3.0+ supports numpy 2.x properly. * update uv.lock * Add scipy>=1.15.0 for numpy 2.x compatibility Scipy versions < 1.15.0 have C extensions built against numpy 1.x and are incompatible with numpy 2.x, causing dtype size errors. * update uv lock * Update Python support to 3.11-3.13 with compatible dependencies - Set Python version range to 3.11-3.13 (removed 3.14 support) - Updated CI workflows: single-version jobs use 3.13, matrix jobs use 3.11 and 3.13 - Dependencies optimized for Python 3.13 compatibility: - pyarrow~=22.0 (has Python 3.13 wheels) - numpy~=1.26 - pandas~=2.2 - blis~=1.0 - fastuuid~=0.13 - Applied stricter version constraints using ~= operator throughout - Updated uv.lock with resolved dependencies * Update numpy to 2.1+ and pandas to 2.3+ for Python 3.13 Windows compatibility Numpy 1.26.x causes access violations on Python 3.13 Windows. Numpy 2.1+ has proper Python 3.13 support with Windows wheels. Pandas 2.3+ is required for numpy 2.x compatibility. * update vsts.yml python version * Add GraphRAG Cache package. (microsoft#2153) * Add GraphRAG Cache package. * Fix a bunch of module comments and function visibility (microsoft#2154) * Issue microsoft#2004 fix (microsoft#2159) * fix issue microsoft#2004 using KeenhoChu idea in his PR * add unit test for dynamic community selection * add unit test for dynamic community selection implementing microsoft#2158 logic --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Mismatch between header in community report generation prompt examples and input data (id vs human_readable_id) (microsoft#2161) * fix issue microsoft#860 for mismatch in prompts and input * fix format --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Chunker factory (microsoft#2156) * Delete NoopTextSplitter * Delete unused check_token_limit * Add base chunking factory and migrate workflow to use it * Split apart chunker module * Co-locate chunking/splitting * Collapse token splitting functionality into one class/function * Restore create_base_text_units parameterization * Move Tokenizer base class to common package * Move pre-pending into chunkers * Streamline config * Fix defaults construction * Add prepending tests * Remove chunk_size_includes_metadata config * Revert ChunkingDocument interface * Move metadata prepending to a util * Move Tokenizer back to GR core * Fix tokenizer removal from chunker * Set defaults for chunking config * Move chunking to monorepo package * Format * Typo * Add ChunkResult model * Streamline chunking config * Add missing version updates for graphrag_chunking * Input factory (microsoft#2168) * Update input factory to match other factories * Move input config alongside input readers * Move file pattern logic into InputReader * Set encoding default * Clean up optional column configs * Combine structured data extraction * Remove pandas from input loading * Throw if empty documents * Add json lines (jsonl) input support * Store raw data * Fix merge imports * Move metadata handling entirely to chunking * Nicer automatic title * Typo * Add get_property utility for nested dictionary access with dot notation * Update structured_file_reader to use get_property utility * Extract input module into new graphrag-input monorepo package - Create new graphrag-input package with input loading utilities - Move InputConfig, InputFileType, InputReader, TextDocument, and file readers (CSV, JSON, JSONL, Text) - Add get_property utility for nested dictionary access with dot notation - Include hashing utility for document ID generation - Update all imports throughout codebase to use graphrag_input - Add package to workspace configuration and release tasks - Remove old graphrag.index.input module * Rename ChunkResult to TextChunk and add transformer support - Rename chunk_result.py to text_chunk.py with ChunkResult -> TextChunk - Add 'original' field to TextChunk to track pre-transform text - Add optional transform callback to chunker.chunk() method - Add add_metadata transformer for prepending metadata to chunks - Update create_chunk_results to apply transforms and populate original - Update sentence_chunker and token_chunker with transform support - Refactor create_base_text_units to use new transformer pattern - Rename pluck_metadata to get/collect methods on TextDocument * Back-compat comment * Align input config type name with other factory configs * Add MarkItDown support * Remove pattern default from MarkItDown reader * Remove plugins flag (implicit disabled) * Format * Update verb tests * Separate storage from input config * Add empty objects for NaN raw_data * Fix smoke tests * Fix BOM in csv smoke * Format * DRIFT fixes (microsoft#2171) * Use stable ids for community reports * Remove deprecated title from embedding flow * Remove embedding column from df loaders * Fix lancedb insertion * Add drift back to smoke tests * Fix mock embedder to match default embedding length * Fix DRIFT notebook * Push drift_k_followups through to prompt * Format * Vector package (microsoft#2172) * Extract graphrag-vectors package * Simplify vector factory usage and config defaults * Update factory integ initializers * Fix mock patch * Format * Register vector stores in tests * Set a default vector store name * Update vector readme * Remove impls from init * Move some validation into impls * Remove index_prefix * Move duplicate method to base class * Fix smoke vector config * Update index bug (microsoft#2173) * fix update index bug * blob storage bug fix --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Add GraphRAG LLM package. (microsoft#2174) * Update documentation for v3 release (microsoft#2176) update documentation for v3 release Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Graphrag llm cleanup (microsoft#2181) * Migration update (microsoft#2180) * fix formatting. --------- Co-authored-by: Nathan Evans <github@talkswithnumbers.com> Co-authored-by: gaudyb <85708998+gaudyb@users.noreply.github.com> Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> Co-authored-by: Andres Morales <86074752+andresmor-ms@users.noreply.github.com> * Release v3.0.0 (microsoft#2191) * Fix deps (microsoft#2193) * fix missing project urls * fix missing deps. * Release v3.0.1 (microsoft#2195) * add TableProvider to enable future row-by-row streaming (microsoft#2189) * write dataframe * changed some workflows * 1a * add fixed files * add versioning * add patch and remove utility * pr changes * Python 3.13 (microsoft#2208) * make graphrag-llm supports 3.13 * Semver --------- Co-authored-by: Deo <liangzhanzhao@metrodata.cn> Co-authored-by: Zhanzhao (Deo) Liang <liangzhanzhao1985@gmail.com> * update vector store example. (microsoft#2202) * Table factory (microsoft#2214) * Add table provider factory * Semver * Remove unnecessary response format check. (microsoft#2213) - Fixes: microsoft#2203 * add csv table provider (microsoft#2215) * add csv table provider * add in provider * add semver * change list_tables to list() * Add DataReader class for typed dataframe loading (microsoft#2220) * Add DataReader class for typed dataframe loading Introduce DataReader that wraps TableProvider and applies type coercion functions when loading dataframes from weakly-typed formats (e.g. CSV). - Add DataReader class with methods for each table type: entities, relationships, communities, community_reports, covariates, text_units, and documents - Add typed loading functions in dfs.py for community_reports, covariates, text_units, and documents (entities, relationships, communities already existed) - Integrate DataReader into all 17 indexing workflows replacing raw read_dataframe calls - Integrate DataReader into CLI query's _resolve_output_files for typed loading across all search types (global, local, drift, basic) - Export DataReader from data_model package __init__ * Fix column check * Add notebook example support for each package (microsoft#2205) * add notebook example support for each package * add notebook example support for each package * semversioner change * feedback implemented for notebooks * feedback implemented for notebooks * feedback implemented for notebooks --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Streamline workflows (microsoft#2225) * Move document ID, human_readable_id, and raw_data setup from create_final_documents into load workflows Consolidates core document field initialization (id string cast, human_readable_id index, raw_data default) into load_input_documents and load_update_documents so that create_final_documents only handles the text unit join. Also applies the same setup in the run_pipeline input_documents bypass paths. * Remove overzealous input document assignment * Semver * Format * Add async iterator support to InputReader and use in load workflows (microsoft#2226) * Add async iterator support to InputReader and use in load workflows InputReader now implements __aiter__ so it can be used as `async for doc in reader`. The core iteration logic is in _iterate_files(), and read_files() delegates to the iterator for batch loading. Both load_input_documents and load_update_documents workflows now use the async iterator with dataclasses.asdict for DataFrame construction. * Format * add memory profiling (microsoft#2227) * add profiling * add unit test for profiling * fix property name --------- Co-authored-by: Nathan Evans <github@talkswithnumbers.com> Co-authored-by: Derek Worthen <worthend.derek@gmail.com> Co-authored-by: Alonso Guevara <alonsog@microsoft.com> Co-authored-by: gaudyb <85708998+gaudyb@users.noreply.github.com> Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> Co-authored-by: Andres Morales <86074752+andresmor-ms@users.noreply.github.com> Co-authored-by: Dayenne Souza <ddesouza@microsoft.com> Co-authored-by: Deo <liangzhanzhao@metrodata.cn> Co-authored-by: Zhanzhao (Deo) Liang <liangzhanzhao1985@gmail.com>

* Remove graph embedding and UMAP (microsoft#2048) * Remove umap/layout operation * Remove graph embedding * Bump unified-search to GR 2.5.0 * Remove graph vis from unified-search * Remove file filtering (microsoft#2050) * Remove document filtering * Semver * Fix integ tests * Fix file find tuple * Fix another dangling find tuple * Remove text unit grouping (microsoft#2052) * Remove text unit group_by_columns * Semver * Fix default token split test * Fix models in config test samples * Fix token length in context sort test * Fix document sort * Re-implement hierarchical Leiden (microsoft#2049) * Use graspologic-native hierarchical leiden * Re-implement largest_connected_component * Copy in modularity * Use graspologic-native directly in pyproject * Remove directed graph tests (we don't use this) * Semver * Remove graspologic dep * Use 4.1 and text-embedding-3-large as defaults * Update comment * Clean vector store (microsoft#2077) * clean vector store code * fix * fix launch.json --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Update v3/main missing config + functions (microsoft#2082) * reduce schema fields (microsoft#2089) * reduce schema fields * fix launch.json --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Remove strategy dicts (microsoft#2090) * Remove "strategy" from community reports config/workflow * Remove extraction strategy from extract_graph * Remove summarization strategy from extract_graph * Remove strategy from claim extraction * Strongly type prompt templates * Remove strategy from embed_text * Push hydrated params into community report workflows * Push hyrdated params into extract covariates * Push hydrated params into extract graph NLP * Push hydrated params into extract graph * Push hydrated params into text embeddings * Remove a few more low-level defaults * Semver * Remove configurable prompt delimiters * Update smoke tests * Remove fnllm (microsoft#2095) * Sort deps alpha * Remove multi search (microsoft#2093) * Remove multi-search from CLI * Remove multi-search from API * Flatten vector_store config * Push hydrated vector store down to embed_text * Remove outputs from config * Remove multi-search notebook/docs * Add missing response_type in basic search API * Fix basic search context and id mapping * Fix v1 migration notebook * Fix query entity search tests * V3 docs and cleanup (microsoft#2100) * Remove community contrib notebooks * Add migration notebook and breaking changes page edits * Update/polish docs * Make model instance name configurable * Add vector schema updates to v3 migration notebook * Spellcheck * Bump smoke test runtimes * Remove document overwrite (microsoft#2101) * remove document overwrite from vector store configuration * remove document overwrite and refactor load documents method * fix test * fix test * fix test --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Unified factory (microsoft#2105) * Simplify Factory interface * Migrate CacheFactory to standard base class * Migrate LoggerFactory to standard base class * Migrate StorageFactory to standard base class * Migrate VectorStoreFactory to standard base class * Update vector store example notebook * Delete notebook outputs * Move default providers into factories * Move retry/limit tests into integ * Split language model factories * Set smoke test tpm/rpm * Fix factory integ tests * Add method to smoke test, switch text to 'fast' * Fix text smoke config for fast workflow * Add new workflows to text smoke test * Convert input readers to a proper factory * Remove covariates from fast smoke test * Update docs for input factory * Bump smoke runtime * Even longer runtime * min-csv timeout * Remove unnecessary lambdas * Prefix vector store (microsoft#2106) * add prefix to vector store configuration and removal of container name * docs updated * change prefix property name * change prefix property name * feedback implemented --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * fix for container name * Restructure project as monorepo. (microsoft#2111) * Restructure project as monorepo. * Fix formatting * Storage fixes and cleanup (microsoft#2118) * Fix pipeline recursion * Remove base_dir from storage.find * Remove max_count from storage.find * Remove prefix on storage integ test * Add base_dir in creation_date test * Wrap base_dir in Path * Use constants for input/update directories * Nov 2025 housekeeping (microsoft#2120) * Remove gensim sideload * Split CI build/type checks from unit tests * Thorough review of docs to align with v3 * Format * Fix version * Fix type * Graphrag config (microsoft#2119) * Add load_config to graphrag-common package. * Empty graph guards (microsoft#2126) * Remove networkx from graph_extractor and clean out redundancy * Bubble pipeline error to console * Remove embeddings optional new (microsoft#2128) * remove optional embeddings * fix test * fix tests * fix pipeline * fix test * fix test * fix test * fix tests --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Format * Add empty checks for NLP graphs (microsoft#2133) * Init command asks for models (microsoft#2137) * Add init prompting for models * Remove hard-coded model config validation * Switch to typer option prompt for full CLI use with models * Update getting started for init model input * Bump request timeout and overall smoke test timeout * Add graphrag-storage. (microsoft#2127) * Add graphrag-storage. * Python update (3.13) (microsoft#2149) * Update to python 3.14 as default, with range down to 3.10 * Fix enum value in query cli * Update pyarrow * Update py version for storage package * Remove 3.10 * add fastuuid * Update Python support to 3.11-3.14 with stricter dependency constraints - Set minimum Python version to 3.11 (removed 3.10 support) - Added support for Python 3.14 - Updated CI workflows: single-version jobs use 3.14, matrix jobs use 3.11 and 3.14 - Fixed license format to use SPDX-compatible format for Python 3.14 - Updated pyarrow to >=22.0.0 for Python 3.14 wheel support - Added explicit fastuuid~=0.14 and blis~=1.3 for Python 3.14 compatibility - Replaced all loose version constraints (>=) with compatible release (~=) for better lock file control - Applied stricter versioning to all packages: graphrag, graphrag-common, graphrag-storage, unified-search-app * update uv lock * Pin blis to ~=1.3.3 to ensure Python 3.14 wheel availability * Update uv lock * Update numpy to >=2.0.0 for Python 3.14 Windows compatibility Numpy 1.25.x has access violation issues on Python 3.14 Windows. Numpy 2.x has proper Python 3.14 support including Windows wheels. * update uv lock * Update pandas to >=2.3.0 for numpy 2.x compatibility Pandas 2.2.x was compiled against numpy 1.x and causes ABI incompatibility errors with numpy 2.x. Pandas 2.3.0+ supports numpy 2.x properly. * update uv.lock * Add scipy>=1.15.0 for numpy 2.x compatibility Scipy versions < 1.15.0 have C extensions built against numpy 1.x and are incompatible with numpy 2.x, causing dtype size errors. * update uv lock * Update Python support to 3.11-3.13 with compatible dependencies - Set Python version range to 3.11-3.13 (removed 3.14 support) - Updated CI workflows: single-version jobs use 3.13, matrix jobs use 3.11 and 3.13 - Dependencies optimized for Python 3.13 compatibility: - pyarrow~=22.0 (has Python 3.13 wheels) - numpy~=1.26 - pandas~=2.2 - blis~=1.0 - fastuuid~=0.13 - Applied stricter version constraints using ~= operator throughout - Updated uv.lock with resolved dependencies * Update numpy to 2.1+ and pandas to 2.3+ for Python 3.13 Windows compatibility Numpy 1.26.x causes access violations on Python 3.13 Windows. Numpy 2.1+ has proper Python 3.13 support with Windows wheels. Pandas 2.3+ is required for numpy 2.x compatibility. * update vsts.yml python version * Add GraphRAG Cache package. (microsoft#2153) * Add GraphRAG Cache package. * Fix a bunch of module comments and function visibility (microsoft#2154) * Issue microsoft#2004 fix (microsoft#2159) * fix issue microsoft#2004 using KeenhoChu idea in his PR * add unit test for dynamic community selection * add unit test for dynamic community selection implementing microsoft#2158 logic --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Mismatch between header in community report generation prompt examples and input data (id vs human_readable_id) (microsoft#2161) * fix issue microsoft#860 for mismatch in prompts and input * fix format --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Chunker factory (microsoft#2156) * Delete NoopTextSplitter * Delete unused check_token_limit * Add base chunking factory and migrate workflow to use it * Split apart chunker module * Co-locate chunking/splitting * Collapse token splitting functionality into one class/function * Restore create_base_text_units parameterization * Move Tokenizer base class to common package * Move pre-pending into chunkers * Streamline config * Fix defaults construction * Add prepending tests * Remove chunk_size_includes_metadata config * Revert ChunkingDocument interface * Move metadata prepending to a util * Move Tokenizer back to GR core * Fix tokenizer removal from chunker * Set defaults for chunking config * Move chunking to monorepo package * Format * Typo * Add ChunkResult model * Streamline chunking config * Add missing version updates for graphrag_chunking * Input factory (microsoft#2168) * Update input factory to match other factories * Move input config alongside input readers * Move file pattern logic into InputReader * Set encoding default * Clean up optional column configs * Combine structured data extraction * Remove pandas from input loading * Throw if empty documents * Add json lines (jsonl) input support * Store raw data * Fix merge imports * Move metadata handling entirely to chunking * Nicer automatic title * Typo * Add get_property utility for nested dictionary access with dot notation * Update structured_file_reader to use get_property utility * Extract input module into new graphrag-input monorepo package - Create new graphrag-input package with input loading utilities - Move InputConfig, InputFileType, InputReader, TextDocument, and file readers (CSV, JSON, JSONL, Text) - Add get_property utility for nested dictionary access with dot notation - Include hashing utility for document ID generation - Update all imports throughout codebase to use graphrag_input - Add package to workspace configuration and release tasks - Remove old graphrag.index.input module * Rename ChunkResult to TextChunk and add transformer support - Rename chunk_result.py to text_chunk.py with ChunkResult -> TextChunk - Add 'original' field to TextChunk to track pre-transform text - Add optional transform callback to chunker.chunk() method - Add add_metadata transformer for prepending metadata to chunks - Update create_chunk_results to apply transforms and populate original - Update sentence_chunker and token_chunker with transform support - Refactor create_base_text_units to use new transformer pattern - Rename pluck_metadata to get/collect methods on TextDocument * Back-compat comment * Align input config type name with other factory configs * Add MarkItDown support * Remove pattern default from MarkItDown reader * Remove plugins flag (implicit disabled) * Format * Update verb tests * Separate storage from input config * Add empty objects for NaN raw_data * Fix smoke tests * Fix BOM in csv smoke * Format * DRIFT fixes (microsoft#2171) * Use stable ids for community reports * Remove deprecated title from embedding flow * Remove embedding column from df loaders * Fix lancedb insertion * Add drift back to smoke tests * Fix mock embedder to match default embedding length * Fix DRIFT notebook * Push drift_k_followups through to prompt * Format * Vector package (microsoft#2172) * Extract graphrag-vectors package * Simplify vector factory usage and config defaults * Update factory integ initializers * Fix mock patch * Format * Register vector stores in tests * Set a default vector store name * Update vector readme * Remove impls from init * Move some validation into impls * Remove index_prefix * Move duplicate method to base class * Fix smoke vector config * Update index bug (microsoft#2173) * fix update index bug * blob storage bug fix --------- Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Add GraphRAG LLM package. (microsoft#2174) * Update documentation for v3 release (microsoft#2176) update documentation for v3 release Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> * Graphrag llm cleanup (microsoft#2181) * Migration update (microsoft#2180) * fix formatting. --------- Co-authored-by: Nathan Evans <github@talkswithnumbers.com> Co-authored-by: gaudyb <85708998+gaudyb@users.noreply.github.com> Co-authored-by: Gaudy Blanco <gaudy-microsoft@MacBook-Pro-m4-Gaudy-For-Work.local> Co-authored-by: Andres Morales <86074752+andresmor-ms@users.noreply.github.com>

dworthen requested a review from a team as a code owner January 14, 2026 13:13

Add GraphRAG LLM package.

ab68b9f

dworthen force-pushed the graphrag-llm branch from a3dd470 to ab68b9f Compare January 15, 2026 17:30

dworthen requested a review from Copilot January 15, 2026 17:34

Copilot AI reviewed Jan 15, 2026

View reviewed changes

Comment thread tests/unit/graphrag_factory/test_factory.py

Comment thread tests/verbs/test_extract_covariates.py

Comment thread tests/verbs/test_create_community_reports.py

Comment thread packages/graphrag/graphrag/config/models/embed_text_config.py

dworthen added 9 commits January 15, 2026 13:19

disable retries by default.

8cbb7f6

add LLMCompletionResponse.content

ef7d6a7

add LLMEmbeddingResponse utilities

4bd00d0

update validation.

8f2a923

fix test

2f8d37a

update dependencies

f0d400a

fix checks

1e82072

update validation

e400fee

Merge branch 'v3/main' into graphrag-llm

b529577

AlonsoGuevara requested a review from Copilot January 20, 2026 18:10

Copilot AI reviewed Jan 20, 2026

View reviewed changes

AlonsoGuevara approved these changes Jan 21, 2026

View reviewed changes

dworthen merged commit c0a06ba into v3/main Jan 22, 2026
14 checks passed

dworthen deleted the graphrag-llm branch January 22, 2026 18:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add GraphRAG LLM package.#2174

Add GraphRAG LLM package.#2174
dworthen merged 10 commits intov3/mainfrom
graphrag-llm

dworthen commented Jan 14, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

dworthen commented Jan 14, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants