Skip to content
This repository was archived by the owner on Dec 28, 2025. It is now read-only.

Refactor: Refactor Scheduler to Support Dynamic Workflow Scheduling and Pipeline Pooling#46

Open
weijinglin wants to merge 9 commits intomainfrom
agentic
Open

Refactor: Refactor Scheduler to Support Dynamic Workflow Scheduling and Pipeline Pooling#46
weijinglin wants to merge 9 commits intomainfrom
agentic

Conversation

@weijinglin
Copy link
Collaborator

@weijinglin weijinglin commented Sep 13, 2025

This PR refactors the Scheduler class to introduce a more flexible and extensible workflow scheduling mechanism. The main changes include:

  • Introduced a pipeline pool using a dictionary to manage different workflow types (e.g., build_vector_index, graph_extract), each with its own GPipelineManager, flow, prepare, and post-processing functions.
  • Added a schedule_flow method to dynamically select and execute workflows based on the flow name, supporting pipeline reuse and resource management.
  • Refactored the build_vector_index and graph_extract flows to separate preparation, execution, and post-processing logic, improving modularity and maintainability.
  • Updated related utility functions (graph_index_utils.py, vector_index_utils.py) to use the new schedule_flow interface.
  • Improved error handling and logging for schema parsing and pipeline execution.

These changes lay the foundation for supporting more complex and agentic workflows in the future, while also improving the efficiency and scalability of the current pipeline execution framework.

Summary by CodeRabbit

  • 新功能
    • 引入基于节点+调度器的可复用工作流:图抽取与构建向量索引两条流程可一键运行并返回结果
    • 增加节点化组件以支持分段、信息抽取、属性图抽取、模式管理与向量索引构建
    • 提供统一的 LLM/Embedding 工厂与状态模型,支持英文分词与文档级分割
  • 改进
    • 更严格的 JSON/模式校验;线程安全的上下文与结果序列化;日志与错误提示更清晰
  • 杂务
    • 新增依赖 pycgraph(仅 Linux)

@coderabbitai
Copy link

coderabbitai bot commented Sep 13, 2025

Walkthrough

该变更引入基于节点与调度器的工作流体系:新增状态对象与上下文初始化工具、多个节点化算子(模式管理/校验、分块、信息抽取、属性图抽取、向量索引构建)、嵌入/LLM 工厂函数、两条流程实现(图抽取、向量索引构建)以及统一的 Scheduler 管理与调度;并在 pyproject 中添加 pycgraph 依赖源(Git)。

Changes

Cohort / File(s) Change Summary
依赖与来源配置
hugegraph-llm/pyproject.toml
新增依赖 pycgraph,并在 [tool.uv.sources] 中配置基于 git 的源(指向 https://github.com/ChunelFeng/CGraph.git,subdirectory "python",revision "main",并带有 Linux 平台 marker)。
工作流状态与初始化
.../state/__init__.py, .../state/ai_state.py, .../operators/util.py
新增 WkFlowInput / WkFlowState(reset/setup/to_json 等方法);新增 init_context(obj) -> CStatus 用于从节点参数初始化 contextwk_input
模式校验与集成节点
.../operators/common_op/check_schema.py
把模式校验拆成多个辅助函数,新增 check_typeCheckSchema 类及 CheckSchemaNode(GNode) 节点化实现;增强错误处理与对缺失属性的自动补齐与持久化到 schema/context。
文档分块(节点化)
.../operators/document_op/chunk_split.py
新增 ChunkSplitNode(GNode),支持语言感知分隔符(含 LANGUAGE_EN)、多种分割类型(含 SPLIT_TYPE_DOCUMENT),节点实现做输入校验、上下文锁写入;保留原有 ChunkSplit 类。
HugeGraph 模式管理(节点化)
.../operators/hugegraph_op/schema_manager.py
新增 SchemaManagerNode(GNode);节点路径中基于 wk_input.graph_name 构建客户端并拉取 schema,验证并在上下文中写入 schemasimple_schema,并提供 get_result
向量索引构建(节点化)
.../operators/index_op/build_vector_index.py
新增 BuildVectorIndexNode(GNode),节点路径初始化 context、并行计算 chunk 向量、更新并写回 VectorIndex;保留并微调原 BuildVectorIndex 的格式。
信息抽取(节点化)
.../operators/llm_op/info_extract.py
新增 InfoExtractNode(GNode),基于 WkFlowState/WkFlowInput 的节点化抽取流程,集成 get_chat_llm/llm_settings,上下文驱动存储与长 ID 过滤;保留并调整原 InfoExtract 相关逻辑与签名。
属性图抽取(节点化)
.../operators/llm_op/property_graph_extract.py
新增 PropertyGraphExtractNode(GNode),节点化流程并加强 JSON 严格校验与标签过滤;引入上下文与 LLM 工厂,旧路径保留。
向量与 LLM 工厂
.../models/embeddings/init_embedding.py, .../models/llms/init_llm.py
新增 get_embedding(llm_settings)model_map 映射;新增 get_chat_llm, get_extract_llm, get_text2gql_llm 工厂函数与 LLMs 对应实例方法,按类型选择并构建具体客户端/嵌入实现。
工具:图与向量索引
.../utils/graph_index_utils.py, .../utils/vector_index_utils.py
引入 SchedulerSingleton 并新增基于 scheduler 的入口:extract_graph(调度器路径)并保留原 extract_graph_originbuild_vector_index 改为通过调度器调度流程,移除直接 KgBuilder/Embeddings 的直连构建路径。
流程与调度
.../flows/__init__.py, .../flows/common.py, .../flows/build_vector_index.py, .../flows/graph_extract.py, .../flows/scheduler.py
新增抽象 BaseFlow;新增 BuildVectorIndexFlowGraphExtractFlow(负责组装 GPipeline、注册状态与节点依赖);新增 SchedulerSchedulerSingleton,实现基于 GPipeline 的调度、复用与错误/生命周期管理。

Sequence Diagram(s)

sequenceDiagram
    autonumber
    actor Client
    participant Scheduler
    participant Flow as GraphExtractFlow
    participant Pipe as GPipeline
    participant S as SchemaNode
    participant CS as ChunkSplitNode
    participant GE as ExtractNode

    Client->>Scheduler: schedule_flow("graph_extract", schema, texts, example_prompt, type)
    Scheduler->>Flow: build_flow(...)
    Flow->>Pipe: 创建与注册参数/节点
    Note right of Flow: 根据输入选择\nSchemaManagerNode 或 CheckSchemaNode\n并选择 InfoExtractNode 或 PropertyGraphExtractNode
    Scheduler->>Pipe: init()/run()
    Pipe->>S: run()
    S-->>Pipe: CStatus
    Pipe->>CS: run()
    CS-->>Pipe: CStatus (chunks 写入 context)
    Pipe->>GE: run()
    GE-->>Pipe: CStatus
    Scheduler->>Flow: post_deal(pipeline)
    Flow-->>Scheduler: {vertices, edges}
    Scheduler-->>Client: 返回结果
Loading
sequenceDiagram
    autonumber
    actor Client
    participant Scheduler
    participant Flow as BuildVectorIndexFlow
    participant Pipe as GPipeline
    participant CS as ChunkSplitNode
    participant BV as BuildVectorIndexNode
    participant Store as VectorIndex

    Client->>Scheduler: schedule_flow("build_vector_index", texts)
    Scheduler->>Flow: build_flow(texts)
    Flow->>Pipe: 注册 ChunkSplitNode -> BuildVectorIndexNode
    Scheduler->>Pipe: init()/run()
    Pipe->>CS: run() 分块
    CS-->>Pipe: CStatus (chunks 写入 context)
    Pipe->>BV: run() 并行生成嵌入
    BV->>Store: write_to_disk()
    BV-->>Pipe: CStatus
    Scheduler->>Flow: post_deal(pipeline)
    Flow-->>Scheduler: wkflow_state JSON
    Scheduler-->>Client: 返回结果
Loading

Estimated code review effort

🎯 4 (复杂) | ⏱️ ~75 minutes

Poem

我把流程串成线,节点排成队,
拆文分块落袋,向量写成碑。
调度轻击小鼓点,管弦起相随,
模式过筛细又快,结果稳且归。
(=^・ェ・^=) 兔子敲键,提交去玩喜。

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed 标题明确反映了本次变更的核心:重构 Scheduler 以支持动态工作流调度和流水线池化,描述具体且与 PR 目标一致,便于同事快速识别主要改动;仅有轻微冗余("Refactor: Refactor"),但不影响可读性或准确性。
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch agentic

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

  • Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
  • Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link

@codecov-ai-reviewer review

@github-actions github-actions bot added the llm label Sep 13, 2025
@codecov-ai

This comment has been minimized.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 11

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
hugegraph-llm/src/hugegraph_llm/operators/hugegraph_op/schema_manager.py (1)

39-64: simple_schema 方法重复

SchemaManagerSchemaManagerNode 类中的 simple_schema 方法完全相同。

建议提取为模块级函数或基类方法:

+def simplify_schema(schema: Dict[str, Any]) -> Dict[str, Any]:
+    """简化 schema 结构,只保留必要字段"""
+    mini_schema = {}
+    
+    if "vertexlabels" in schema:
+        mini_schema["vertexlabels"] = []
+        for vertex in schema["vertexlabels"]:
+            new_vertex = {
+                key: vertex[key]
+                for key in ["id", "name", "properties"]
+                if key in vertex
+            }
+            mini_schema["vertexlabels"].append(new_vertex)
+    
+    if "edgelabels" in schema:
+        mini_schema["edgelabels"] = []
+        for edge in schema["edgelabels"]:
+            new_edge = {
+                key: edge[key]
+                for key in ["name", "source_label", "target_label", "properties"]
+                if key in edge
+            }
+            mini_schema["edgelabels"].append(new_edge)
+    
+    return mini_schema

 class SchemaManager:
     def simple_schema(self, schema: Dict[str, Any]) -> Dict[str, Any]:
-        # 删除重复代码
+        return simplify_schema(schema)

 class SchemaManagerNode(GNode):
     def simple_schema(self, schema: Dict[str, Any]) -> Dict[str, Any]:
-        # 删除重复代码
+        return simplify_schema(schema)

Also applies to: 101-126

♻️ Duplicate comments (14)
hugegraph-llm/src/hugegraph_llm/operators/document_op/chunk_split.py (1)

33-87: 节点类设计存在代码重复问题

ChunkSplitNode 类中的方法 _get_separators_get_text_splitter 和核心逻辑与下面的 ChunkSplit 类完全重复,违反了 DRY 原则。

建议提取公共逻辑到基类或工具函数中:

+class ChunkSplitBase:
+    """文本分割的基础逻辑"""
+    
+    def _get_separators(self, language: str) -> List[str]:
+        if language == LANGUAGE_ZH:
+            return ["\n\n", "\n", "。", ",", ""]
+        if language == LANGUAGE_EN:
+            return ["\n\n", "\n", ".", ",", " ", ""]
+        raise ValueError("language must be zh or en")
+    
+    def _get_text_splitter(self, split_type: str, separators: List[str]):
+        if split_type == SPLIT_TYPE_DOCUMENT:
+            return lambda text: [text]
+        if split_type == SPLIT_TYPE_PARAGRAPH:
+            return RecursiveCharacterTextSplitter(
+                chunk_size=500, chunk_overlap=30, separators=separators
+            ).split_text
+        if split_type == SPLIT_TYPE_SENTENCE:
+            return RecursiveCharacterTextSplitter(
+                chunk_size=50, chunk_overlap=0, separators=separators
+            ).split_text
+        raise ValueError("Type must be paragraph, sentence, html or markdown")

-class ChunkSplitNode(GNode):
+class ChunkSplitNode(GNode, ChunkSplitBase):
     # ... 删除重复的 _get_separators 和 _get_text_splitter 方法

-class ChunkSplit:
+class ChunkSplit(ChunkSplitBase):
     # ... 删除重复的 _get_separators 和 _get_text_splitter 方法
hugegraph-llm/src/hugegraph_llm/operators/common_op/check_schema.py (2)

40-177: 严重的代码重复问题

CheckSchemaCheckSchemaNode 类之间存在大量重复代码,包括所有的验证和处理方法。这严重违反了 DRY 原则。

建议创建一个共享的基类或 mixin:

+class SchemaValidatorMixin:
+    """Schema 验证和处理的共享逻辑"""
+    
+    def _validate_schema(self, schema: Dict[str, Any]) -> None:
+        check_type(schema, dict, "Input data is not a dictionary.")
+        if "vertexlabels" not in schema or "edgelabels" not in schema:
+            log_and_raise("Input data does not contain 'vertexlabels' or 'edgelabels'.")
+        check_type(
+            schema["vertexlabels"], list, "'vertexlabels' in input data is not a list."
+        )
+        check_type(
+            schema["edgelabels"], list, "'edgelabels' in input data is not a list."
+        )
+    
+    # ... 其他共享方法 ...

-class CheckSchema:
+class CheckSchema(SchemaValidatorMixin):
     def __init__(self, data: Dict[str, Any]):
         self.result = None
         self.data = data
-    
-    # 删除所有重复的验证方法

-class CheckSchemaNode(GNode):
+class CheckSchemaNode(GNode, SchemaValidatorMixin):
     context: WkFlowState = None
     wk_input: WkFlowInput = None
-    
-    # 删除所有重复的验证方法

Also applies to: 179-333


26-27: PyCGraph 导入问题需要解决

Pipeline 错误显示 PyCGraph 模块中没有 GNodeCStatus。这是一个阻塞性问题。

验证 PyCGraph 的正确导入路径:

#!/bin/bash
# 检查项目中是否有 PyCGraph 的定义或别名
fd -e py "PyCGraph" --exec grep -l "class GNode\|class CStatus" {} \;

# 检查是否应该从其他模块导入
rg -n "GNode|CStatus" --type py -B2 -A2 | grep -E "^[^:]+\.py.*from|^[^:]+\.py.*import"
hugegraph-llm/src/hugegraph_llm/operators/index_op/build_vector_index.py (1)

35-35: PyCGraph 导入错误

与其他文件一样,PyCGraph 模块导入存在问题。

hugegraph-llm/src/hugegraph_llm/operators/hugegraph_op/schema_manager.py (1)

24-24: PyCGraph 导入问题

与其他文件一致,需要解决 PyCGraph 模块的导入问题。

hugegraph-llm/src/hugegraph_llm/operators/llm_op/info_extract.py (3)

225-225: 修复 Pylint 警告:属性定义在 init 之外

与 PropertyGraphExtractNode 相同的问题。

在类定义中添加属性声明:

 class InfoExtractNode(GNode):
     context: WkFlowState = None
     wk_input: WkFlowInput = None
+    llm = None
+    example_prompt: str = None

Also applies to: 228-228


324-327: 优化 call_count 的更新逻辑

与 PropertyGraphExtractNode 中相同的优化建议。

-        if self.context.call_count:
-            self.context.call_count += len(chunks)
-        else:
-            self.context.call_count = len(chunks)
+        self.context.call_count = (self.context.call_count or 0) + len(chunks)

28-28: 修复 PyCGraph 模块导入错误

与 property_graph_extract.py 相同的导入问题。

hugegraph-llm/src/hugegraph_llm/operators/scheduler.py (6)

18-18: 修复 PyCGraph 模块导入错误

与其他文件相同的导入问题。需要验证 PyCGraph 的正确安装和导入。


218-225: 单例模式实现非线程安全

当前的单例实现在多线程环境下可能创建多个实例。

建议使用线程安全的实现:

+import threading
+
 class SchedulerSingleton:
     _instance = None
+    _lock = threading.Lock()

     @classmethod
     def get_instance(cls):
         if cls._instance is None:
-            cls._instance = Scheduler()
+            with cls._lock:
+                if cls._instance is None:
+                    cls._instance = Scheduler()
         return cls._instance

149-152: 错误处理不一致

错误处理返回字符串但实际上没有返回,使错误处理无效。

         except json.JSONDecodeError:
             log.error("Invalid JSON format in schema. Please check it again.")
-            return (
+            raise ValueError(
                 "ERROR: Invalid JSON format in schema. Please check it carefully."
             )

61-90: 改进错误处理的一致性

schedule_flow 方法的错误处理返回字符串消息,这使得调用者难以区分成功和失败的情况。

建议使用异常或返回带有状态码的结构化响应:

from typing import Union, Tuple

def schedule_flow(self, flow: str, *args, **kwargs) -> Union[dict, Tuple[bool, str]]:
    if flow not in self.pipeline_pool:
        raise ValueError(f"Unsupported workflow: {flow}")
    
    # ... 其他代码 ...
    
    if status.isErr():
        raise RuntimeError(f"Error in flow execution: {status.getInfo()}")

134-156: 修复不一致的返回语句

graph_extract_prepare 方法有时返回错误字符串,有时不返回任何内容,这会导致调用者难以处理。

建议统一返回类型或使用异常:

     def graph_extract_prepare(
         self, prepared_input: WkFlowInput, schema, texts, example_prompt, extract_type
-    ):
+    ) -> None:
         # prepare input data
         prepared_input.texts = texts
         prepared_input.language = "zh"
         prepared_input.split_type = "document"
         prepared_input.example_prompt = example_prompt
         prepared_input.schema = schema
         schema = schema.strip()
         if schema.startswith("{"):
             try:
                 schema = json.loads(schema)
                 prepared_input.schema = schema
             except json.JSONDecodeError:
                 log.error("Invalid JSON format in schema. Please check it again.")
-                return (
-                    "ERROR: Invalid JSON format in schema. Please check it carefully."
-                )
+                raise ValueError(
+                    "Invalid JSON format in schema. Please check it carefully."
+                )
         else:
             log.info("Get schema '%s' from graphdb.", schema)
             prepared_input.graph_name = schema
-        return

177-179: 修复异常抛出错误

不应该抛出字符串,应该抛出异常对象。

         except json.JSONDecodeError:
             log.error("Invalid JSON format in schema. Please check it again.")
-            raise (
+            raise ValueError(
                 "ERROR: Invalid JSON format in schema. Please check it carefully."
             )
🧹 Nitpick comments (13)
hugegraph-llm/src/hugegraph_llm/state/__init__.py (1)

1-17: 导出公共类型,优化包内引用路径

为便于外部模块 from hugegraph_llm.state import WkFlowInput, WkFlowState,建议在此处显式导出。

+# 对外导出常用状态类型
+from .ai_state import WkFlowInput, WkFlowState
+
+__all__ = ["WkFlowInput", "WkFlowState"]
hugegraph-llm/src/hugegraph_llm/demo/rag_demo/vector_graph_block.py (1)

376-382: 保留旧实现注释便于回退,但可考虑提供 UI 级切换

当前通过注释切换 extract_graph_origin,建议提供一个 Checkbox/Dropdown 以在 UI 中动态切换新旧两条路径,避免改代码再部署。

示例(仅供参考,需在适当位置新增控件并传递布尔值):

-        graph_extract_bt.click(
-            extract_graph,
+        use_legacy = gr.Checkbox(label="Use legacy extractor", value=False)
+        graph_extract_bt.click(
+            fn=lambda *args: (extract_graph_origin if use_legacy.value else extract_graph)(*args),
             inputs=[input_file, input_text, input_schema, info_extract_template],
             outputs=[out],
         )
hugegraph-llm/src/hugegraph_llm/utils/vector_index_utils.py (1)

108-114: 对 schedule_flow 结果做最小友好化处理(可选)

schedule_flow 在不支持/失败时返回字符串,建议在此转换为用户可读的 Gradio 错误或统一成功 JSON。

-    return scheduler.schedule_flow("build_vector_index", texts)
+    res = scheduler.schedule_flow("build_vector_index", texts)
+    if isinstance(res, str) and res.lower().startswith("error"):
+        raise gr.Error(res)
+    return res
hugegraph-llm/src/hugegraph_llm/operators/document_op/chunk_split.py (1)

49-51: 属性在 __init__ 外定义

Pylint 警告显示 textsseparatorstext_splitter 属性在 __init__ 方法外定义。虽然这在 Node 架构中可能是有意为之,但建议在类级别声明这些属性以提高代码清晰度。

 class ChunkSplitNode(GNode):
+    texts: List[str] = None
+    separators: List[str] = None
+    text_splitter = None
+    
     def init(self):
         return init_context(self)
hugegraph-llm/src/hugegraph_llm/operators/common_op/check_schema.py (1)

189-189: 属性在初始化外定义

data 属性在 node_init 方法中定义而不是在 __init__ 中。虽然这在节点架构中可能是设计使然,但应该在类级别声明。

 class CheckSchemaNode(GNode):
     context: WkFlowState = None
     wk_input: WkFlowInput = None
+    data: Dict[str, Any] = None
hugegraph-llm/src/hugegraph_llm/operators/index_op/build_vector_index.py (2)

46-54: 属性定义问题

多个属性在 node_init 中定义而不是在类初始化时定义。

 class BuildVectorIndexNode(GNode):
     context: WkFlowState = None
     wk_input: WkFlowInput = None
+    embedding: BaseEmbedding = None
+    folder_name: str = None
+    index_dir: str = None
+    filename_prefix: str = None
+    vector_index: VectorIndex = None

72-73: TODO 注释需要跟踪

代码中有一个 TODO 注释提到需要使用 async_get_texts_embedding 替代单个同步方法。

这个 TODO 表明需要优化异步处理。需要我帮您创建一个 issue 来跟踪这个改进吗?

hugegraph-llm/src/hugegraph_llm/operators/hugegraph_op/schema_manager.py (2)

134-134: 使用 raise 而不是 Exception

应该抛出更具体的异常类型而不是通用的 Exception

-            raise Exception(f"Can not get {self.graph_name}'s schema from HugeGraph!")
+            raise ValueError(f"Can not get {self.graph_name}'s schema from HugeGraph!")

90-98: 属性在初始化外定义

graph_nameclientschema 属性在 node_init 中定义。

 class SchemaManagerNode(GNode):
     context: WkFlowState = None
     wk_input: WkFlowInput = None
+    graph_name: str = None
+    client: PyHugeClient = None
+    schema = None
hugegraph-llm/src/hugegraph_llm/operators/llm_op/property_graph_extract.py (3)

74-76: 简化集合差值计算

可以直接使用集合运算,代码更简洁。

-            non_nullable_keys = set(
-                properties_map[item_type][label]["properties"]
-            ).difference(set(properties_map[item_type][label]["nullable_keys"]))
+            non_nullable_keys = (
+                set(properties_map[item_type][label]["properties"]) 
+                - set(properties_map[item_type][label]["nullable_keys"])
+            )

191-191: 修复 Pylint 警告:属性定义在 init 之外

NECESSARY_ITEM_KEYSllmexample_prompt 属性在 initnode_init 方法中定义,而不是在 __init__ 中。虽然这是 GNode 的设计模式,但建议添加类级别的类型注解来消除警告。

在类定义中添加属性声明:

 class PropertyGraphExtractNode(GNode):
     context: WkFlowState = None
     wk_input: WkFlowInput = None
+    NECESSARY_ITEM_KEYS: set = None  
+    llm = None
+    example_prompt: str = None

Also applies to: 195-195, 198-198


232-235: 优化 call_count 的更新逻辑

可以使用更简洁的方式更新 call_count。

-        if self.context.call_count:
-            self.context.call_count += len(chunks)
-        else:
-            self.context.call_count = len(chunks)
+        self.context.call_count = (self.context.call_count or 0) + len(chunks)
hugegraph-llm/src/hugegraph_llm/operators/scheduler.py (1)

107-107: 删除无用的 return 语句

函数末尾的空 return 语句是不必要的。

     def build_vector_index_prepare(self, prepared_input: WkFlowInput, texts):
         prepared_input.texts = texts
         prepared_input.language = "zh"
         prepared_input.split_type = "paragraph"
-        return
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9e76c2a and 226d805.

📒 Files selected for processing (14)
  • hugegraph-llm/pyproject.toml (2 hunks)
  • hugegraph-llm/src/hugegraph_llm/demo/rag_demo/vector_graph_block.py (1 hunks)
  • hugegraph-llm/src/hugegraph_llm/operators/common_op/check_schema.py (2 hunks)
  • hugegraph-llm/src/hugegraph_llm/operators/document_op/chunk_split.py (2 hunks)
  • hugegraph-llm/src/hugegraph_llm/operators/hugegraph_op/schema_manager.py (3 hunks)
  • hugegraph-llm/src/hugegraph_llm/operators/index_op/build_vector_index.py (1 hunks)
  • hugegraph-llm/src/hugegraph_llm/operators/llm_op/info_extract.py (6 hunks)
  • hugegraph-llm/src/hugegraph_llm/operators/llm_op/property_graph_extract.py (6 hunks)
  • hugegraph-llm/src/hugegraph_llm/operators/scheduler.py (1 hunks)
  • hugegraph-llm/src/hugegraph_llm/operators/util.py (1 hunks)
  • hugegraph-llm/src/hugegraph_llm/state/__init__.py (1 hunks)
  • hugegraph-llm/src/hugegraph_llm/state/ai_state.py (1 hunks)
  • hugegraph-llm/src/hugegraph_llm/utils/graph_index_utils.py (9 hunks)
  • hugegraph-llm/src/hugegraph_llm/utils/vector_index_utils.py (3 hunks)
🧰 Additional context used
🧠 Learnings (4)
📚 Learning: 2025-05-27T06:55:13.779Z
Learnt from: cgwer
PR: hugegraph/hugegraph-ai#10
File: hugegraph-python-client/pyproject.toml:0-0
Timestamp: 2025-05-27T06:55:13.779Z
Learning: The hugegraph-python-client is a component within the hugegraph-ai project repository (apache/incubator-hugegraph-ai), not a standalone repository. When reviewing project URLs in pyproject.toml files within this project, they should point to the main hugegraph-ai repository.

Applied to files:

  • hugegraph-llm/pyproject.toml
📚 Learning: 2025-08-18T13:20:30.343Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:61-63
Timestamp: 2025-08-18T13:20:30.343Z
Learning: NLTKHelper in hugegraph-llm uses lazy loading for stopwords and calls nltk.corpus.stopwords.words(lang) directly with the provided language parameter. It does not preload both English and Chinese stopwords - each language is loaded on first access. The lang parameter must match NLTK's expected language codes ("english", "chinese") or it will fail.

Applied to files:

  • hugegraph-llm/src/hugegraph_llm/operators/document_op/chunk_split.py
📚 Learning: 2025-06-25T09:50:06.213Z
Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:124-137
Timestamp: 2025-06-25T09:50:06.213Z
Learning: Language-specific prompt attributes (answer_prompt_CN, answer_prompt_EN, extract_graph_prompt_CN, extract_graph_prompt_EN, gremlin_generate_prompt_CN, gremlin_generate_prompt_EN, keywords_extract_prompt_CN, keywords_extract_prompt_EN, doc_input_text_CN, doc_input_text_EN) are defined in the PromptConfig class in hugegraph-llm/src/hugegraph_llm/config/prompt_config.py, which inherits from BasePromptConfig, making these attributes accessible in the parent class methods.

Applied to files:

  • hugegraph-llm/src/hugegraph_llm/operators/llm_op/property_graph_extract.py
  • hugegraph-llm/src/hugegraph_llm/operators/llm_op/info_extract.py
📚 Learning: 2025-06-25T09:45:10.751Z
Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:100-116
Timestamp: 2025-06-25T09:45:10.751Z
Learning: In hugegraph-llm BasePromptConfig class, llm_settings is a runtime property that is loaded from config through dependency injection during object initialization, not a static class attribute. Static analysis tools may flag this as missing but it's intentional design.

Applied to files:

  • hugegraph-llm/src/hugegraph_llm/operators/llm_op/property_graph_extract.py
🧬 Code graph analysis (11)
hugegraph-llm/src/hugegraph_llm/operators/util.py (7)
hugegraph-llm/src/hugegraph_llm/config/llm_config.py (1)
  • LLMConfig (25-81)
hugegraph-llm/src/hugegraph_llm/models/embeddings/litellm.py (1)
  • LiteLLMEmbedding (32-93)
hugegraph-llm/src/hugegraph_llm/models/embeddings/ollama.py (1)
  • OllamaEmbedding (25-71)
hugegraph-llm/src/hugegraph_llm/models/embeddings/openai.py (1)
  • OpenAIEmbedding (24-84)
hugegraph-llm/src/hugegraph_llm/models/llms/ollama.py (1)
  • OllamaClient (29-154)
hugegraph-llm/src/hugegraph_llm/models/llms/openai.py (1)
  • OpenAIClient (34-226)
hugegraph-llm/src/hugegraph_llm/models/llms/litellm.py (1)
  • LiteLLMClient (34-191)
hugegraph-llm/src/hugegraph_llm/state/ai_state.py (2)
vermeer-python-client/src/pyvermeer/structure/task_data.py (1)
  • graph_name (102-104)
hugegraph-llm/src/hugegraph_llm/operators/hugegraph_op/schema_manager.py (2)
  • simple_schema (39-64)
  • simple_schema (101-126)
hugegraph-llm/src/hugegraph_llm/operators/scheduler.py (7)
hugegraph-llm/src/hugegraph_llm/operators/common_op/check_schema.py (4)
  • CheckSchemaNode (179-332)
  • init (183-184)
  • run (45-60)
  • run (192-210)
hugegraph-llm/src/hugegraph_llm/operators/document_op/chunk_split.py (4)
  • ChunkSplitNode (33-86)
  • init (34-35)
  • run (74-86)
  • run (122-131)
hugegraph-llm/src/hugegraph_llm/operators/hugegraph_op/schema_manager.py (4)
  • SchemaManagerNode (79-147)
  • init (83-84)
  • run (66-76)
  • run (128-141)
hugegraph-llm/src/hugegraph_llm/operators/index_op/build_vector_index.py (4)
  • BuildVectorIndexNode (38-77)
  • init (42-43)
  • run (59-77)
  • run (94-105)
hugegraph-llm/src/hugegraph_llm/state/ai_state.py (3)
  • WkFlowState (45-87)
  • WkFlowInput (21-42)
  • to_json (75-87)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/info_extract.py (4)
  • InfoExtractNode (217-353)
  • init (221-222)
  • run (166-190)
  • run (294-330)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/property_graph_extract.py (4)
  • PropertyGraphExtractNode (186-299)
  • init (190-192)
  • run (96-121)
  • run (201-237)
hugegraph-llm/src/hugegraph_llm/utils/graph_index_utils.py (3)
hugegraph-llm/src/hugegraph_llm/operators/scheduler.py (3)
  • SchedulerSingleton (218-225)
  • get_instance (222-225)
  • schedule_flow (61-90)
hugegraph-llm/src/hugegraph_llm/operators/kg_construction_task.py (5)
  • KgBuilder (39-113)
  • fetch_graph_data (58-60)
  • run (106-109)
  • chunk_split (62-69)
  • extract_info (71-78)
hugegraph-llm/src/hugegraph_llm/utils/vector_index_utils.py (1)
  • read_documents (34-60)
hugegraph-llm/src/hugegraph_llm/operators/index_op/build_vector_index.py (4)
hugegraph-llm/src/hugegraph_llm/utils/embedding_utils.py (3)
  • get_embeddings_parallel (33-73)
  • get_filename_prefix (76-83)
  • get_index_folder_name (86-91)
hugegraph-llm/src/hugegraph_llm/operators/util.py (2)
  • get_embedding (32-52)
  • init_context (26-29)
hugegraph-llm/src/hugegraph_llm/state/ai_state.py (2)
  • WkFlowInput (21-42)
  • WkFlowState (45-87)
hugegraph-llm/src/hugegraph_llm/indices/vector_index.py (4)
  • VectorIndex (32-153)
  • from_index_file (40-80)
  • add (95-102)
  • to_index_file (82-93)
hugegraph-llm/src/hugegraph_llm/operators/document_op/chunk_split.py (3)
hugegraph-llm/src/hugegraph_llm/operators/util.py (1)
  • init_context (26-29)
hugegraph-llm/src/hugegraph_llm/operators/common_op/check_schema.py (4)
  • init (183-184)
  • node_init (186-190)
  • run (45-60)
  • run (192-210)
hugegraph-llm/src/hugegraph_llm/operators/index_op/build_vector_index.py (4)
  • init (42-43)
  • node_init (45-57)
  • run (59-77)
  • run (94-105)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/property_graph_extract.py (3)
hugegraph-llm/src/hugegraph_llm/models/llms/base.py (2)
  • BaseLLM (22-74)
  • generate (26-31)
hugegraph-llm/src/hugegraph_llm/operators/util.py (2)
  • get_chat_llm (55-76)
  • init_context (26-29)
hugegraph-llm/src/hugegraph_llm/state/ai_state.py (2)
  • WkFlowState (45-87)
  • WkFlowInput (21-42)
hugegraph-llm/src/hugegraph_llm/operators/hugegraph_op/schema_manager.py (6)
hugegraph-llm/src/hugegraph_llm/operators/util.py (1)
  • init_context (26-29)
hugegraph-llm/src/hugegraph_llm/state/ai_state.py (3)
  • WkFlowInput (21-42)
  • WkFlowState (45-87)
  • to_json (75-87)
hugegraph-python-client/src/pyhugegraph/client.py (3)
  • PyHugeClient (48-101)
  • schema (61-62)
  • graph (69-70)
hugegraph-llm/src/hugegraph_llm/operators/common_op/check_schema.py (5)
  • init (183-184)
  • node_init (186-190)
  • run (45-60)
  • run (192-210)
  • get_result (328-332)
hugegraph-llm/src/hugegraph_llm/demo/rag_demo/text2gremlin_block.py (1)
  • simple_schema (209-226)
hugegraph-python-client/src/pyhugegraph/api/schema.py (1)
  • getSchema (67-70)
hugegraph-llm/src/hugegraph_llm/operators/common_op/check_schema.py (2)
hugegraph-llm/src/hugegraph_llm/operators/util.py (1)
  • init_context (26-29)
hugegraph-llm/src/hugegraph_llm/state/ai_state.py (3)
  • WkFlowInput (21-42)
  • WkFlowState (45-87)
  • to_json (75-87)
hugegraph-llm/src/hugegraph_llm/utils/vector_index_utils.py (5)
hugegraph-llm/src/hugegraph_llm/operators/scheduler.py (3)
  • SchedulerSingleton (218-225)
  • get_instance (222-225)
  • schedule_flow (61-90)
hugegraph-llm/src/hugegraph_llm/utils/embedding_utils.py (2)
  • get_filename_prefix (76-83)
  • get_index_folder_name (86-91)
hugegraph-llm/src/hugegraph_llm/models/embeddings/init_embedding.py (2)
  • Embeddings (25-49)
  • get_embedding (29-49)
hugegraph-llm/src/hugegraph_llm/indices/vector_index.py (3)
  • VectorIndex (32-153)
  • from_index_file (40-80)
  • clean (136-153)
hugegraph-llm/src/hugegraph_llm/operators/kg_construction_task.py (1)
  • build_vector_index (92-94)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/info_extract.py (4)
hugegraph-llm/src/hugegraph_llm/document/chunk_split.py (1)
  • ChunkSplitter (23-56)
hugegraph-llm/src/hugegraph_llm/models/llms/base.py (2)
  • BaseLLM (22-74)
  • generate (26-31)
hugegraph-llm/src/hugegraph_llm/operators/util.py (2)
  • get_chat_llm (55-76)
  • init_context (26-29)
hugegraph-llm/src/hugegraph_llm/state/ai_state.py (2)
  • WkFlowInput (21-42)
  • WkFlowState (45-87)
🪛 GitHub Actions: Pylint
hugegraph-llm/src/hugegraph_llm/operators/util.py

[error] 16-16: E0611: No name 'CStatus' in module 'PyCGraph' (no-name-in-module)

hugegraph-llm/src/hugegraph_llm/state/ai_state.py

[error] 16-16: E0611: No name 'GParam' in module 'PyCGraph' (no-name-in-module)


[error] 16-16: E0611: No name 'CStatus' in module 'PyCGraph' (no-name-in-module)


[warning] 32-32: R1711: Useless return at end of function or method (useless-return)


[warning] 18-18: C0411: standard import "from typing import Union, List" should be placed before "from PyCGraph import GParam, CStatus" (wrong-import-order)

hugegraph-llm/src/hugegraph_llm/operators/scheduler.py

[error] 18-18: E0611: No name 'GPipeline' in module 'PyCGraph' (no-name-in-module)


[error] 18-18: E0611: No name 'GPipelineManager' in module 'PyCGraph' (no-name-in-module)


[warning] 69-69: R1705: Unnecessary else after return (no-else-return)


[warning] 98-98: R1705: Unnecessary elif after return (no-else-return)


[warning] 107-107: R1711: Useless return at end of function or method (useless-return)


[warning] 134-134: R1710: Either all return statements in a function should return an expression, or none of them should. (inconsistent-return-statements)


[warning] 177-177: W0707: Consider re-raising with from to preserve exception (raise-missing-from)


[error] 177-177: E0702: Raising a string is not allowed (raising-bad-type)


[warning] 26-26: C0411: Standard import issue (wrong-import-order)


[warning] 30-30: C0411: Import order issue (wrong-import-order)


[warning] 31-31: C0411: Import order issue (wrong-import-order)

hugegraph-llm/src/hugegraph_llm/operators/index_op/build_vector_index.py

[error] 35-35: E0611: No name 'GNode' in module 'PyCGraph' (no-name-in-module)


[error] 35-35: E0611: No name 'CStatus' in module 'PyCGraph' (no-name-in-module)


[warning] 46-46: W0201: Attribute 'embedding' defined outside init (attribute-defined-outside-init)


[warning] 47-47: W0201: Attribute 'folder_name' defined outside init (attribute-defined-outside-init)


[warning] 50-50: W0201: Attribute 'index_dir' defined outside init (attribute-defined-outside-init)


[warning] 51-51: W0201: Attribute 'filename_prefix' defined outside init (attribute-defined-outside-init)


[warning] 54-54: W0201: Attribute 'vector_index' defined outside init (attribute-defined-outside-init)

hugegraph-llm/src/hugegraph_llm/operators/document_op/chunk_split.py

[error] 23-23: E0611: No name 'GNode' in module 'PyCGraph' (no-name-in-module)


[error] 23-23: E0611: No name 'CStatus' in module 'PyCGraph' (no-name-in-module)


[warning] 49-49: W0201: Attribute 'texts' defined outside init (attribute-defined-outside-init)


[warning] 50-50: W0201: Attribute 'separators' defined outside init (attribute-defined-outside-init)


[warning] 51-51: W0201: Attribute 'text_splitter' defined outside init (attribute-defined-outside-init)

hugegraph-llm/src/hugegraph_llm/operators/llm_op/property_graph_extract.py

[error] 31-31: E0611: No name 'GNode' in module 'PyCGraph' (no-name-in-module)


[error] 31-31: E0611: No name 'CStatus' in module 'PyCGraph' (no-name-in-module)


[warning] 191-191: W0201: Attribute 'NECESSARY_ITEM_KEYS' defined outside init (attribute-defined-outside-init)


[warning] 195-195: W0201: Attribute 'llm' defined outside init (attribute-defined-outside-init)


[warning] 198-198: W0201: Attribute 'example_prompt' defined outside init (attribute-defined-outside-init)

hugegraph-llm/src/hugegraph_llm/operators/hugegraph_op/schema_manager.py

[error] 24-24: E0611: No name 'GNode' in module 'PyCGraph' (no-name-in-module)


[error] 24-24: E0611: No name 'CStatus' in module 'PyCGraph' (no-name-in-module)


[warning] 90-90: W0201: Attribute 'graph_name' defined outside init (attribute-defined-outside-init)


[warning] 91-91: W0201: Attribute 'client' defined outside init (attribute-defined-outside-init)


[warning] 98-98: W0201: Attribute 'schema' defined outside init (attribute-defined-outside-init)

hugegraph-llm/src/hugegraph_llm/operators/common_op/check_schema.py

[error] 26-26: E0611: No name 'GNode' in module 'PyCGraph' (no-name-in-module)


[error] 26-26: E0611: No name 'CStatus' in module 'PyCGraph' (no-name-in-module)


[warning] 189-189: W0201: Attribute 'data' defined outside init (attribute-defined-outside-init)


[warning] 27-27: C0412: Imports from package hugegraph_llm are not grouped (ungrouped-imports)

hugegraph-llm/src/hugegraph_llm/operators/llm_op/info_extract.py

[error] 28-28: E0611: No name 'GNode' in module 'PyCGraph' (no-name-in-module)


[error] 28-28: E0611: No name 'CStatus' in module 'PyCGraph' (no-name-in-module)


[warning] 225-225: W0201: Attribute 'llm' defined outside init (attribute-defined-outside-init)


[warning] 228-228: W0201: Attribute 'example_prompt' defined outside init (attribute-defined-outside-init)

🔇 Additional comments (13)
hugegraph-llm/pyproject.toml (1)

61-62: 将 pycgraph 改为带平台标记且固定版本的 PEP 508 直接依赖

位置:hugegraph-llm/pyproject.toml(dependencies 段,约行 61–62)

  • 说明:当前直接写 "pycgraph" 会在未配置 uv 源或非 Linux 平台导致解析/安装失败,并与仓内仅 Linux 的源映射不一致,破坏可移植性与可复现性。
  • 建议:改为 PEP 508 直接引用,添加平台标记并锁定到不可变提交或 tag。

建议修改(将 替换为实际不可变提交):

-    "pycgraph",
+    "pycgraph @ git+https://github.com/ChunelFeng/CGraph.git@<commit-sha>#subdirectory=python; sys_platform == 'linux'",

可选:若为按需功能依赖,移入 extras 并在相关模块使用延迟/可选导入以降低默认安装负担。

hugegraph-llm/src/hugegraph_llm/state/ai_state.py (3)

59-73: setup 一致性

与上文类型修改保持一致,setup 置空为 None 的处理已一致,无需额外改动。


75-87: to_json 实现简洁有效

仅导出实例级且非 None 的属性,符合预期序列化语义。


16-19: 修复 PyCGraph 导入失败与导入顺序(typing 放在第三方之前)

仓库中发现两处从 PyCGraph 导入导致 CI 报 E0611:hugegraph-llm/src/hugegraph_llm/state/ai_state.py:16,hugegraph-llm/src/hugegraph_llm/operators/util.py:16。确认实际安装包名(PyPI 常为小写 pycgraph)后,将导入改为 pycgraph 并把 typing 的导入移到第三方导入之前;operators/util.py 中的 CStatus 也要同步替换。

hugegraph-llm/src/hugegraph_llm/operators/util.py (2)

26-30: init_context 返回值与语义 OK

初始化上下文与输入,返回 CStatus() 一致。


16-16: 确认并统一 CStatus 的导入(PyCGraph → pycgraph)

hugegraph-llm/src/hugegraph_llm/operators/util.py 第16行为 from PyCGraph import CStatus;仓库中未发现其它 pycgraph/PyCGraph 引用。确认实际安装包名并统一导入;若包名为小写 pycgraph,替换为:

-from PyCGraph import CStatus
+from pycgraph import CStatus  # noqa: E0611
hugegraph-llm/src/hugegraph_llm/utils/vector_index_utils.py (2)

27-31: 引入 SchedulerSingleton 正确对齐新调度架构

与 PR 目标一致。


72-91: 增加索引缺失时的健壮性保护(防止 AttributeError)

from_index_file 在索引文件缺失或返回空索引对象时,index 可能为 None,直接访问 .d/.ntotal 会导致 AttributeError;建议使用 getattr 提供默认值并安全计算 properties 长度。
位置:hugegraph-llm/src/hugegraph_llm/utils/vector_index_utils.py 行 72-91

-    return json.dumps(
-        {
-            "embed_dim": chunk_vector_index.index.d,
-            "vector_info": {
-                "chunk_vector_num": chunk_vector_index.index.ntotal,
-                "graph_vid_vector_num": graph_vid_vector_index.index.ntotal,
-                "graph_properties_vector_num": len(chunk_vector_index.properties),
-            },
-        },
-        ensure_ascii=False,
-        indent=2,
-    )
+    chunk_idx = getattr(chunk_vector_index, "index", None)
+    graph_idx = getattr(graph_vid_vector_index, "index", None)
+    embed_dim = getattr(chunk_idx, "d", 0)
+    chunk_num = getattr(chunk_idx, "ntotal", 0)
+    graph_num = getattr(graph_idx, "ntotal", 0)
+    props_num = len(chunk_vector_index.properties) if getattr(chunk_vector_index, "properties", None) else 0
+    return json.dumps(
+        {
+            "embed_dim": embed_dim,
+            "vector_info": {
+                "chunk_vector_num": chunk_num,
+                "graph_vid_vector_num": graph_num,
+                "graph_properties_vector_num": props_num,
+            },
+        },
+        ensure_ascii=False,
+        indent=2,
+    )
hugegraph-llm/src/hugegraph_llm/operators/document_op/chunk_split.py (2)

27-28: 新增的语言和分割类型常量很好

添加 LANGUAGE_ENSPLIT_TYPE_DOCUMENT 常量提升了代码的可读性和可维护性。


22-23: 验证运行时 PyCGraph 是否导出 GNode 与 CStatus

发现:hugegraph-llm/pyproject.toml 中声明了 pycgraph;仓库中以下文件均从 PyCGraph 导入 GNode/CStatus(可能引发运行时导入错误):hugegraph-llm/src/hugegraph_llm/state/ai_state.py, hugegraph-llm/src/hugegraph_llm/operators/util.py, hugegraph-llm/src/hugegraph_llm/operators/index_op/build_vector_index.py, hugegraph-llm/src/hugegraph_llm/operators/llm_op/property_graph_extract.py, hugegraph-llm/src/hugegraph_llm/operators/hugegraph_op/schema_manager.py, hugegraph-llm/src/hugegraph_llm/operators/llm_op/info_extract.py, hugegraph-llm/src/hugegraph_llm/operators/document_op/chunk_split.py, hugegraph-llm/src/hugegraph_llm/operators/common_op/check_schema.py。

在 CI 或运行环境执行以下命令验证包与导出符号:python -c "import PyCGraph; print('GNode' in dir(PyCGraph), 'CStatus' in dir(PyCGraph), PyCGraph.file)" && (pip show pycgraph || pip show PyCGraph)

hugegraph-llm/src/hugegraph_llm/utils/graph_index_utils.py (2)

25-25: 新增的调度器集成路径设计良好

引入 SchedulerSingleton 并添加新的 extract_graph 函数,将图提取委托给调度器系统,这种设计支持了工作流的复用和资源管理。

Also applies to: 139-139


99-135: 原始函数重命名保留了向后兼容性

将原始的 extract_graph 重命名为 extract_graph_origin 是个好做法,保留了原有的 KgBuilder 路径供比较和回退。错误处理和日志记录也得到了改进。

hugegraph-llm/src/hugegraph_llm/operators/index_op/build_vector_index.py (1)

38-77: BuildVectorIndexNode 实现良好

新的节点类设计良好,正确使用了上下文锁来保护共享状态,错误处理也很完善。


[tool.uv.sources]
hugegraph-python-client = { workspace = true }
pycgraph = { git = "https://github.com/ChunelFeng/CGraph.git", subdirectory = "python", rev = "main", marker = "sys_platform == 'linux'" }
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

[tool.uv.sources] 的 Linux-only pycgraph 源会造成分发不一致;建议删除或与上方依赖声明对齐

  • 若采用上方 PEP 508 直引方案,应删除此行,避免 uv 与 pip 解析路径不一致。
  • 若暂时保留 uv 源,至少也需在 dependencies 侧为 "pycgraph" 增加相同的 sys_platform == 'linux' 标记,并将 rev 从不稳定的 main 固定到具体 commit/tag,保证可复现构建与合规审计。

建议直接移除此映射(配合上方修改):

 [tool.uv.sources]
 hugegraph-python-client = { workspace = true }
-pycgraph = { git = "https://github.com/ChunelFeng/CGraph.git", subdirectory = "python", rev = "main", marker = "sys_platform == 'linux'"  }

附:请确认 CGraph 的许可证与 ASF 发布合规(需要在 NOTICE/第三方依赖清单中体现)。我可以帮忙自动获取上游仓库的许可证与最新提交信息:


🏁 Script executed:

#!/bin/bash
# 获取上游许可证与推荐固定提交信息
set -euo pipefail
curl -s https://api.github.com/repos/ChunelFeng/CGraph/license | jq '.license.spdx_id, .html_url'
curl -s "https://api.github.com/repos/ChunelFeng/CGraph/commits?sha=main&per_page=1" | jq -r '.[0].sha, .[0].commit.message'

Length of output: 207


移除或对齐 Linux-only pycgraph 源并固定 rev(必需)

tool.uv.sources 中对 pycgraph 的 Linux-only git 源会导致 uv 与 pip/PEP 508 解析不一致。优先从 hugegraph-llm/pyproject.toml 中移除该行;若保留,则必须在 dependencies 中为 "pycgraph" 增加相同的 marker 且将 rev 固定为具体 commit:04fc967f58622be04f2e9bcb1d7b8b7ac27e1d8c(已核验)。CGraph 许可证为 MIT(https://github.com/ChunelFeng/CGraph/blob/main/LICENSE),需在 NOTICE/第三方依赖清单中列明。

位置:hugegraph-llm/pyproject.toml(约第89行)

 [tool.uv.sources]
 hugegraph-python-client = { workspace = true }
-pycgraph = { git = "https://github.com/ChunelFeng/CGraph.git", subdirectory = "python", rev = "main", marker = "sys_platform == 'linux'"  }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
pycgraph = { git = "https://github.com/ChunelFeng/CGraph.git", subdirectory = "python", rev = "main", marker = "sys_platform == 'linux'" }
[tool.uv.sources]
hugegraph-python-client = { workspace = true }
🤖 Prompt for AI Agents
In hugegraph-llm/pyproject.toml around line 89, the tool.uv.sources entry for
pycgraph is a Linux-only git source causing resolver inconsistency; either
remove that pycgraph line from tool.uv.sources entirely, or if you must keep it
add an identical marker to the dependencies entry for "pycgraph" and change rev
to the fixed commit 04fc967f58622be04f2e9bcb1d7b8b7ac27e1d8c (pin exact commit),
and also add CGraph (MIT) to NOTICE/third-party dependencies list per licensing
requirements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 11

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
hugegraph-llm/src/hugegraph_llm/operators/index_op/build_vector_index.py (1)

95-106: 必须修复:多处同步路径直接调用 asyncio.run(会在已有事件循环中抛 RuntimeError)

扫描结果 — 需改动的位置:

  • hugegraph-llm/src/hugegraph_llm/operators/index_op/build_semantic_index.py:行 63
  • hugegraph-llm/src/hugegraph_llm/operators/index_op/build_gremlin_example_index.py:行 42
  • hugegraph-llm/src/hugegraph_llm/operators/index_op/build_vector_index.py:行 74、102
  • hugegraph-llm/src/hugegraph_llm/operators/index_op/gremlin_example_index_query.py:行 67
  • hugegraph-llm/src/hugegraph_llm/operators/llm_op/answer_synthesize.py:行 74

建议修复(二选一):

  • 优先方案:将上述调用链改为异步(async def / await),由上层负责运行事件循环。
  • 备选方案:保留同步接口时实现单点“安全同步封装”(run_coroutine_sync),在无运行循环时用 asyncio.run,在存在运行循环时通过独立事件循环线程 + asyncio.run_coroutine_threadsafe 同步等待结果(避免直接在各处调用 asyncio.run 或使用 nest_asyncio)。

不要在多个位置重复直接调用 asyncio.run;统一修复或抽成公共工具。

hugegraph-llm/src/hugegraph_llm/utils/vector_index_utils.py (1)

110-112: 修复:禁止在调度器中以字符串抛出异常 —— 改为抛出明确的 Exception 类型

hugegraph-llm/src/hugegraph_llm/operators/scheduler.py 存在 raise "…"(行 60、71、74),Python 3 会导致 TypeError。请将字符串抛出替换为明确的异常并包含上下文信息。

建议修改(示例 diff):

- if flow not in self.pipeline_pool:
-     raise "Unsupported workflow"
+ if flow not in self.pipeline_pool:
+     raise ValueError(f"Unsupported workflow: {flow}")
@@
- if status.isErr():
-     raise "Error in flow init"
+ if status.isErr():
+     raise RuntimeError(f"Error in flow init: {status.getInfo()}")
@@
- if status.isErr():
-     raise "Error in flow execution"
+ if status.isErr():
+     raise RuntimeError(f"Error in flow execution: {status.getInfo()}")

位置:hugegraph-llm/src/hugegraph_llm/operators/scheduler.py(lines ~58–76,确切问题在 60、71、74)。

♻️ Duplicate comments (6)
hugegraph-llm/src/hugegraph_llm/models/llms/init_llm.py (3)

49-70: 同上:重命名形参并替换引用,顺便改用 ValueError

-def get_extract_llm(llm_settings: LLMConfig):
-    if llm_settings.extract_llm_type == "openai":
+def get_extract_llm(config: LLMConfig):
+    if config.extract_llm_type == "openai":
         return OpenAIClient(
-            api_key=llm_settings.openai_extract_api_key,
-            api_base=llm_settings.openai_extract_api_base,
-            model_name=llm_settings.openai_extract_language_model,
-            max_tokens=llm_settings.openai_extract_tokens,
+            api_key=config.openai_extract_api_key,
+            api_base=config.openai_extract_api_base,
+            model_name=config.openai_extract_language_model,
+            max_tokens=config.openai_extract_tokens,
         )
-    if llm_settings.extract_llm_type == "ollama/local":
+    if config.extract_llm_type == "ollama/local":
         return OllamaClient(
-            model=llm_settings.ollama_extract_language_model,
-            host=llm_settings.ollama_extract_host,
-            port=llm_settings.ollama_extract_port,
+            model=config.ollama_extract_language_model,
+            host=config.ollama_extract_host,
+            port=config.ollama_extract_port,
         )
-    if llm_settings.extract_llm_type == "litellm":
+    if config.extract_llm_type == "litellm":
         return LiteLLMClient(
-            api_key=llm_settings.litellm_extract_api_key,
-            api_base=llm_settings.litellm_extract_api_base,
-            model_name=llm_settings.litellm_extract_language_model,
-            max_tokens=llm_settings.litellm_extract_tokens,
+            api_key=config.litellm_extract_api_key,
+            api_base=config.litellm_extract_api_base,
+            model_name=config.litellm_extract_language_model,
+            max_tokens=config.litellm_extract_tokens,
         )
-    raise Exception("extract llm type is not supported !")
+    raise ValueError(f"Unsupported extract llm type: {config.extract_llm_type}")

73-94: 同上:重命名形参并替换引用,顺便改用 ValueError

-def get_text2gql_llm(llm_settings: LLMConfig):
-    if llm_settings.text2gql_llm_type == "openai":
+def get_text2gql_llm(config: LLMConfig):
+    if config.text2gql_llm_type == "openai":
         return OpenAIClient(
-            api_key=llm_settings.openai_text2gql_api_key,
-            api_base=llm_settings.openai_text2gql_api_base,
-            model_name=llm_settings.openai_text2gql_language_model,
-            max_tokens=llm_settings.openai_text2gql_tokens,
+            api_key=config.openai_text2gql_api_key,
+            api_base=config.openai_text2gql_api_base,
+            model_name=config.openai_text2gql_language_model,
+            max_tokens=config.openai_text2gql_tokens,
         )
-    if llm_settings.text2gql_llm_type == "ollama/local":
+    if config.text2gql_llm_type == "ollama/local":
         return OllamaClient(
-            model=llm_settings.ollama_text2gql_language_model,
-            host=llm_settings.ollama_text2gql_host,
-            port=llm_settings.ollama_text2gql_port,
+            model=config.ollama_text2gql_language_model,
+            host=config.ollama_text2gql_host,
+            port=config.ollama_text2gql_port,
         )
-    if llm_settings.text2gql_llm_type == "litellm":
+    if config.text2gql_llm_type == "litellm":
         return LiteLLMClient(
-            api_key=llm_settings.litellm_text2gql_api_key,
-            api_base=llm_settings.litellm_text2gql_api_base,
-            model_name=llm_settings.litellm_text2gql_language_model,
-            max_tokens=llm_settings.litellm_text2gql_tokens,
+            api_key=config.litellm_text2gql_api_key,
+            api_base=config.litellm_text2gql_api_base,
+            model_name=config.litellm_text2gql_language_model,
+            max_tokens=config.litellm_text2gql_tokens,
         )
-    raise Exception("text2gql llm type is not supported !")
+    raise ValueError(f"Unsupported text2gql llm type: {config.text2gql_llm_type}")

124-124: 替换通用 Exception:更语义化的 ValueError,并包含具体类型值

保持与先前评审一致,便于上层做配置错误处理与提示。

-        raise Exception("chat llm type is not supported !")
+        raise ValueError(f"Unsupported chat llm type: {self.chat_llm_type}")
@@
-        raise Exception("extract llm type is not supported !")
+        raise ValueError(f"Unsupported extract llm type: {self.extract_llm_type}")
@@
-        raise Exception("text2gql llm type is not supported !")
+        raise ValueError(f"Unsupported text2gql llm type: {self.text2gql_llm_type}")

Also applies to: 147-147, 170-170

hugegraph-llm/src/hugegraph_llm/models/embeddings/init_embedding.py (1)

52-52: 替换通用 Exception 为 ValueError,携带具体类型值

与既有评审一致,利于上层处理配置错误。

-    raise Exception("embedding type is not supported !")
+    raise ValueError(f"Unsupported embedding type: {config.embedding_type}")
hugegraph-llm/src/hugegraph_llm/operators/document_op/chunk_split.py (1)

120-120: 异常消息不一致:删除未支持的 html/markdown(重复反馈)

旧实现仍提示 “html or markdown”,与当前能力不符。

建议:

-        raise ValueError("Type must be paragraph, sentence, html or markdown")
+        raise ValueError("Type must be document, paragraph or sentence")
hugegraph-llm/src/hugegraph_llm/operators/scheduler.py (1)

58-76: 异常处理:禁止抛出字符串;统一异常类型与信息

多处使用 raise "...",CI 报 E0702,且返回/异常风格不一致。建议统一使用具体异常(ValueError/RuntimeError)并包含 status 信息。

     def schedule_flow(self, flow: str, *args, **kwargs):
         if flow not in self.pipeline_pool:
-            raise "Unsupported workflow"
+            raise ValueError(f"Unsupported workflow: {flow}")
         manager = self.pipeline_pool[flow]["manager"]
         flow_func = self.pipeline_pool[flow]["flow_func"]
         prepare_func = self.pipeline_pool[flow]["prepare_func"]
         post_func = self.pipeline_pool[flow]["post_func"]
         pipeline = manager.fetch()
         if pipeline is None:
             # call coresponding flow_func to create new workflow
             pipeline = flow_func(*args, **kwargs)
             status = pipeline.init()
             if status.isErr():
-                raise "Error in flow init"
+                raise RuntimeError(f"Error in flow init: {status.getInfo()}")
             status = pipeline.run()
             if status.isErr():
-                raise "Error in flow execution"
+                raise RuntimeError(f"Error in flow execution: {status.getInfo()}")
             res = post_func(pipeline)
             manager.add(pipeline)
             return res
🧹 Nitpick comments (18)
hugegraph-llm/src/hugegraph_llm/operators/util.py (2)

19-22: 补充最小文档与类型注解以提升可读性

obj 增加类型注解与简短 docstring,便于上层理解入参契约(需具备 getGParamWithNoEmpty)。

-def init_context(obj) -> CStatus:
-    obj.context = obj.getGParamWithNoEmpty("wkflow_state")
-    obj.wk_input = obj.getGParamWithNoEmpty("wkflow_input")
-    return CStatus()
+def init_context(obj) -> CStatus:
+    """从 GParam 初始化工作流上下文与输入;要求 obj 提供 getGParamWithNoEmpty。"""
+    obj.context = obj.getGParamWithNoEmpty("wkflow_state")
+    obj.wk_input = obj.getGParamWithNoEmpty("wkflow_input")
+    return CStatus()

16-16: 确认 PyCGraph 依赖并统一导入封装

仓库中多处直接从 PyCGraph 导入,先确认实际包名/安装来源(PyPI/私有包/本地模块);建议将导入封装到单一适配模块以集中容错与替换。
出现位置(示例):

  • hugegraph-llm/src/hugegraph_llm/state/ai_state.py: from PyCGraph import GParam, CStatus (line 16)
  • hugegraph-llm/src/hugegraph_llm/operators/scheduler.py: from PyCGraph import GPipeline, GPipelineManager (line 19)
  • hugegraph-llm/src/hugegraph_llm/operators/llm_op/info_extract.py: from PyCGraph import GNode, CStatus (line 29)
  • hugegraph-llm/src/hugegraph_llm/operators/llm_op/property_graph_extract.py: from PyCGraph import GNode, CStatus (line 32)
  • hugegraph-llm/src/hugegraph_llm/operators/hugegraph_op/schema_manager.py: from PyCGraph import GNode, CStatus (line 24)
  • hugegraph-llm/src/hugegraph_llm/operators/index_op/build_vector_index.py: from PyCGraph import GNode, CStatus (line 36)
  • hugegraph-llm/src/hugegraph_llm/operators/document_op/chunk_split.py: from PyCGraph import GNode, CStatus (line 23)
  • hugegraph-llm/src/hugegraph_llm/operators/common_op/check_schema.py: from PyCGraph import GNode, CStatus (line 26)
  • hugegraph-llm/src/hugegraph_llm/operators/util.py: from PyCGraph import CStatus (line 16)

建议实现统一适配模块(例如 hugegraph_llm.external.pycgraph),在该模块处理导入兼容性/异常,并在 requirements.txt 或 pyproject.toml 中明确依赖来源。

hugegraph-llm/src/hugegraph_llm/models/llms/init_llm.py (2)

25-46: 避免遮蔽全局 llm_settings:重命名形参以修复 Pylint W0621

将工厂函数的形参从 llm_settings 改为 config,并相应替换内部引用。

-def get_chat_llm(llm_settings: LLMConfig):
-    if llm_settings.chat_llm_type == "openai":
+def get_chat_llm(config: LLMConfig):
+    if config.chat_llm_type == "openai":
         return OpenAIClient(
-            api_key=llm_settings.openai_chat_api_key,
-            api_base=llm_settings.openai_chat_api_base,
-            model_name=llm_settings.openai_chat_language_model,
-            max_tokens=llm_settings.openai_chat_tokens,
+            api_key=config.openai_chat_api_key,
+            api_base=config.openai_chat_api_base,
+            model_name=config.openai_chat_language_model,
+            max_tokens=config.openai_chat_tokens,
         )
-    if llm_settings.chat_llm_type == "ollama/local":
+    if config.chat_llm_type == "ollama/local":
         return OllamaClient(
-            model=llm_settings.ollama_chat_language_model,
-            host=llm_settings.ollama_chat_host,
-            port=llm_settings.ollama_chat_port,
+            model=config.ollama_chat_language_model,
+            host=config.ollama_chat_host,
+            port=config.ollama_chat_port,
         )
-    if llm_settings.chat_llm_type == "litellm":
+    if config.chat_llm_type == "litellm":
         return LiteLLMClient(
-            api_key=llm_settings.litellm_chat_api_key,
-            api_base=llm_settings.litellm_chat_api_base,
-            model_name=llm_settings.litellm_chat_language_model,
-            max_tokens=llm_settings.litellm_chat_tokens,
+            api_key=config.litellm_chat_api_key,
+            api_base=config.litellm_chat_api_base,
+            model_name=config.litellm_chat_language_model,
+            max_tokens=config.litellm_chat_tokens,
         )
-    raise Exception("chat llm type is not supported !")
+    raise ValueError(f"Unsupported chat llm type: {config.chat_llm_type}")

33-38: Ollama 分支缺少必填项校验(model 可能为 None)

ollama_*_language_model 为空时会构造无效客户端,建议显式校验并抛出配置错误。

-    if config.chat_llm_type == "ollama/local":
+    if config.chat_llm_type == "ollama/local":
+        if not config.ollama_chat_language_model:
+            raise ValueError("ollama chat model is required (ollama_chat_language_model).")
         return OllamaClient(
             model=config.ollama_chat_language_model,
             host=config.ollama_chat_host,
             port=config.ollama_chat_port,
         )

(extract/text2gql 分支同理可加)

hugegraph-llm/src/hugegraph_llm/models/embeddings/init_embedding.py (1)

32-51: 避免遮蔽全局 llm_settings:重命名形参并修复引用;同时补上返回类型注解

-def get_embedding(llm_settings: LLMConfig):
-    if llm_settings.embedding_type == "openai":
+from typing import Union
+def get_embedding(config: LLMConfig) -> Union[OpenAIEmbedding, OllamaEmbedding, LiteLLMEmbedding]:
+    if config.embedding_type == "openai":
         return OpenAIEmbedding(
-            model_name=llm_settings.openai_embedding_model,
-            api_key=llm_settings.openai_embedding_api_key,
-            api_base=llm_settings.openai_embedding_api_base,
+            model_name=config.openai_embedding_model,
+            api_key=config.openai_embedding_api_key,
+            api_base=config.openai_embedding_api_base,
         )
-    if llm_settings.embedding_type == "ollama/local":
+    if config.embedding_type == "ollama/local":
         return OllamaEmbedding(
-            model_name=llm_settings.ollama_embedding_model,
-            host=llm_settings.ollama_embedding_host,
-            port=llm_settings.ollama_embedding_port,
+            model_name=config.ollama_embedding_model,
+            host=config.ollama_embedding_host,
+            port=config.ollama_embedding_port,
         )
-    if llm_settings.embedding_type == "litellm":
+    if config.embedding_type == "litellm":
         return LiteLLMEmbedding(
-            model_name=llm_settings.litellm_embedding_model,
-            api_key=llm_settings.litellm_embedding_api_key,
-            api_base=llm_settings.litellm_embedding_api_base,
+            model_name=config.litellm_embedding_model,
+            api_key=config.litellm_embedding_api_key,
+            api_base=config.litellm_embedding_api_base,
         )
hugegraph-llm/src/hugegraph_llm/utils/vector_index_utils.py (1)

53-55: 统一用户提示语气与标点

建议去掉感叹号并补句号,保持与其他错误提示一致。

-                raise gr.Error(
-                    "PDF will be supported later! Try to upload text/docx now"
-                )
+                raise gr.Error("PDF will be supported later. Please upload txt/docx for now.")
hugegraph-llm/src/hugegraph_llm/operators/index_op/build_vector_index.py (2)

39-58: 消除 W0201:在 init 中声明并初始化属性

将运行期赋值的属性提前声明,便于静态检查与可读性。

-class BuildVectorIndexNode(GNode):
-    context: WkFlowState = None
-    wk_input: WkFlowInput = None
+class BuildVectorIndexNode(GNode):
+    context: WkFlowState = None
+    wk_input: WkFlowInput = None
+    def __init__(self) -> None:
+        super().__init__()
+        self.embedding = None
+        self.folder_name = None
+        self.index_dir = None
+        self.filename_prefix = None
+        self.vector_index = None

76-78: 微调:直接用真值判断更简洁

功能等价,读起来更直观。

-        if len(chunks_embedding) > 0:
+        if chunks_embedding:
             self.vector_index.add(chunks_embedding, chunks)
             self.vector_index.to_index_file(self.index_dir, self.filename_prefix)
hugegraph-llm/src/hugegraph_llm/operators/document_op/chunk_split.py (1)

33-53: 消除属性在 init 外定义的警告(W0201)

为 node 成员提供显式初始化,避免运行期未赋值风险并通过 Pylint。

应用如下补丁:

 class ChunkSplitNode(GNode):
+    def __init__(self):
+        super().__init__()
+        self.texts: List[str] | None = None
+        self.separators: List[str] | None = None
+        self.text_splitter = None
     def init(self):
         return init_context(self)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/info_extract.py (3)

225-231: 示例提示非必填时直接失败,降低可用性

建议将 example_prompt 设为可选:缺省时记录告警并继续执行。

 def node_init(self):
     self.llm = get_chat_llm(llm_settings)
-    if self.wk_input.example_prompt is None:
-        return CStatus(-1, "Error occurs when prepare for workflow input")
-    self.example_prompt = self.wk_input.example_prompt
-    return CStatus()
+    if self.wk_input.example_prompt is None:
+        log.warning("example_prompt is None; continue without prefix.")
+        self.example_prompt = None
+        return CStatus()
+    self.example_prompt = self.wk_input.example_prompt
+    return CStatus()

314-319: 日志文案仍含“Legacy”,与 Node 实现不符

建议统一去掉“[Legacy]”标签,避免误导。

-            log.debug(
-                "[Legacy] %s input: %s \n output:%s",
-                self.__class__.__name__,
-                sentence,
-                proceeded_chunk,
-            )
+            log.debug("%s input: %s \n output:%s", self.__class__.__name__, sentence, proceeded_chunk)

218-221: 为 Pylint 提示添加显式初始化

消除 W0201(llm/example_prompt 在 init 外定义)。

 class InfoExtractNode(GNode):
     context: WkFlowState = None
     wk_input: WkFlowInput = None
+
+    def __init__(self):
+        super().__init__()
+        self.llm: Optional[BaseLLM] = None
+        self.example_prompt: Optional[str] = None
hugegraph-llm/src/hugegraph_llm/operators/llm_op/property_graph_extract.py (3)

132-139: JSON 提取正则过于贪婪,易跨越多段文本

使用非贪婪匹配更稳妥,或实现括号配对解析。

-        json_match = re.search(r"({.*})", text, re.DOTALL)
+        json_match = re.search(r"({.*?})", text, re.DOTALL)

上述修改需同时应用于两个实现处。

Also applies to: 248-255


195-201: 示例提示不应为硬性要求

与 InfoExtractNode 保持一致:缺省时继续执行并记录告警。

 def node_init(self):
     self.llm = get_chat_llm(llm_settings)
-    if self.wk_input.example_prompt is None:
-        return CStatus(-1, "Error occurs when prepare for workflow input")
-    self.example_prompt = self.wk_input.example_prompt
-    return CStatus()
+    if self.wk_input.example_prompt is None:
+        log.warning("example_prompt is None; continue without prefix.")
+        self.example_prompt = None
+        return CStatus()
+    self.example_prompt = self.wk_input.example_prompt
+    return CStatus()

187-194: 为 Pylint 显式初始化成员

消除 W0201(llm/example_prompt)。

 class PropertyGraphExtractNode(GNode):
     context: WkFlowState = None
     wk_input: WkFlowInput = None
 
+    def __init__(self):
+        super().__init__()
+        self.llm: Optional[BaseLLM] = None
+        self.example_prompt: Optional[str] = None
     def init(self):
         self.NECESSARY_ITEM_KEYS = {"label", "type", "properties"}  # pylint: disable=invalid-name
         return init_context(self)
hugegraph-llm/src/hugegraph_llm/operators/scheduler.py (3)

82-88: no-else-return 可读性改进

去掉无意义的 else 块,简化控制流。

-        else:
-            # fetch pipeline & prepare input for flow
-            prepared_input = pipeline.getGParamWithNoEmpty("wkflow_input")
-            prepare_func(prepared_input, *args, **kwargs)
-            status = pipeline.run()
-            if status.isErr():
-                raise f"Error in flow execution {status.getInfo()}"
-            res = post_func(pipeline)
-            manager.release(pipeline)
-            return res
+        # fetch pipeline & prepare input for flow
+        prepared_input = pipeline.getGParamWithNoEmpty("wkflow_input")
+        prepare_func(prepared_input, *args, **kwargs)
+        status = pipeline.run()
+        if status.isErr():
+            raise RuntimeError(f"Error in flow execution: {status.getInfo()}")
+        res = post_func(pipeline)
+        manager.release(pipeline)
+        return res

89-104: _import_schema 风格与异常类型统一

去除 elif after return 的冗余;抛出明确异常类型。

     def _import_schema(
         self,
         from_hugegraph=None,
         from_extraction=None,
         from_user_defined=None,
     ):
         if from_hugegraph:
             return SchemaManagerNode()
-        elif from_user_defined:
+        if from_user_defined:
             return CheckSchemaNode()
-        elif from_extraction:
-            raise NotImplementedError("Not implemented yet")
-        else:
-            raise ValueError("No input data / invalid schema type")
+        if from_extraction:
+            raise NotImplementedError("Schema import from_extraction is not implemented")
+        raise ValueError("No input data / invalid schema type")

33-53: 未使用的 max_pipeline 与并发安全性

  • max_pipeline 未被使用,可考虑用于限制池容量或移除参数。
  • pipeline_pool/manager 并发访问未显式加锁,请确认 GPipelineManager 内部是否线程安全;否则需在 schedule_flow 入口加锁或使用每-flow 细粒度锁。

是否需要我补一版基于 per-flow RLock 的并发安全改造?

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 226d805 and b497d14.

📒 Files selected for processing (10)
  • hugegraph-llm/src/hugegraph_llm/models/embeddings/init_embedding.py (2 hunks)
  • hugegraph-llm/src/hugegraph_llm/models/llms/init_llm.py (2 hunks)
  • hugegraph-llm/src/hugegraph_llm/operators/document_op/chunk_split.py (2 hunks)
  • hugegraph-llm/src/hugegraph_llm/operators/index_op/build_vector_index.py (1 hunks)
  • hugegraph-llm/src/hugegraph_llm/operators/llm_op/info_extract.py (6 hunks)
  • hugegraph-llm/src/hugegraph_llm/operators/llm_op/property_graph_extract.py (6 hunks)
  • hugegraph-llm/src/hugegraph_llm/operators/scheduler.py (1 hunks)
  • hugegraph-llm/src/hugegraph_llm/operators/util.py (1 hunks)
  • hugegraph-llm/src/hugegraph_llm/state/ai_state.py (1 hunks)
  • hugegraph-llm/src/hugegraph_llm/utils/vector_index_utils.py (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • hugegraph-llm/src/hugegraph_llm/state/ai_state.py
🧰 Additional context used
🧠 Learnings (4)
📚 Learning: 2025-06-25T09:45:10.751Z
Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:100-116
Timestamp: 2025-06-25T09:45:10.751Z
Learning: In hugegraph-llm BasePromptConfig class, llm_settings is a runtime property that is loaded from config through dependency injection during object initialization, not a static class attribute. Static analysis tools may flag this as missing but it's intentional design.

Applied to files:

  • hugegraph-llm/src/hugegraph_llm/models/embeddings/init_embedding.py
  • hugegraph-llm/src/hugegraph_llm/models/llms/init_llm.py
  • hugegraph-llm/src/hugegraph_llm/operators/llm_op/property_graph_extract.py
📚 Learning: 2025-07-31T12:32:32.542Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#43
File: hugegraph-llm/src/hugegraph_llm/utils/embedding_utils.py:76-83
Timestamp: 2025-07-31T12:32:32.542Z
Learning: Embedding model names in HugeGraph-AI follow strict naming conventions and typically only contain letters (a-z, A-Z), numbers (0-9), hyphens (-), and periods (.) for version numbers. They do not contain other extraneous characters that would require extensive filename sanitization beyond basic slash replacement.

Applied to files:

  • hugegraph-llm/src/hugegraph_llm/utils/vector_index_utils.py
📚 Learning: 2025-06-25T09:50:06.213Z
Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:124-137
Timestamp: 2025-06-25T09:50:06.213Z
Learning: Language-specific prompt attributes (answer_prompt_CN, answer_prompt_EN, extract_graph_prompt_CN, extract_graph_prompt_EN, gremlin_generate_prompt_CN, gremlin_generate_prompt_EN, keywords_extract_prompt_CN, keywords_extract_prompt_EN, doc_input_text_CN, doc_input_text_EN) are defined in the PromptConfig class in hugegraph-llm/src/hugegraph_llm/config/prompt_config.py, which inherits from BasePromptConfig, making these attributes accessible in the parent class methods.

Applied to files:

  • hugegraph-llm/src/hugegraph_llm/operators/llm_op/property_graph_extract.py
  • hugegraph-llm/src/hugegraph_llm/operators/llm_op/info_extract.py
📚 Learning: 2025-08-18T13:20:30.343Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#32
File: hugegraph-llm/src/hugegraph_llm/operators/llm_op/keyword_extract.py:61-63
Timestamp: 2025-08-18T13:20:30.343Z
Learning: NLTKHelper in hugegraph-llm uses lazy loading for stopwords and calls nltk.corpus.stopwords.words(lang) directly with the provided language parameter. It does not preload both English and Chinese stopwords - each language is loaded on first access. The lang parameter must match NLTK's expected language codes ("english", "chinese") or it will fail.

Applied to files:

  • hugegraph-llm/src/hugegraph_llm/operators/document_op/chunk_split.py
🧬 Code graph analysis (8)
hugegraph-llm/src/hugegraph_llm/models/embeddings/init_embedding.py (4)
hugegraph-llm/src/hugegraph_llm/config/llm_config.py (1)
  • LLMConfig (25-81)
hugegraph-llm/src/hugegraph_llm/models/embeddings/litellm.py (1)
  • LiteLLMEmbedding (32-93)
hugegraph-llm/src/hugegraph_llm/models/embeddings/ollama.py (1)
  • OllamaEmbedding (25-71)
hugegraph-llm/src/hugegraph_llm/models/embeddings/openai.py (1)
  • OpenAIEmbedding (24-84)
hugegraph-llm/src/hugegraph_llm/models/llms/init_llm.py (4)
hugegraph-llm/src/hugegraph_llm/config/llm_config.py (1)
  • LLMConfig (25-81)
hugegraph-llm/src/hugegraph_llm/models/llms/ollama.py (1)
  • OllamaClient (29-154)
hugegraph-llm/src/hugegraph_llm/models/llms/openai.py (1)
  • OpenAIClient (34-226)
hugegraph-llm/src/hugegraph_llm/models/llms/litellm.py (1)
  • LiteLLMClient (34-191)
hugegraph-llm/src/hugegraph_llm/utils/vector_index_utils.py (4)
hugegraph-llm/src/hugegraph_llm/operators/scheduler.py (3)
  • SchedulerSingleton (215-225)
  • get_instance (220-225)
  • schedule_flow (58-87)
hugegraph-llm/src/hugegraph_llm/utils/embedding_utils.py (2)
  • get_filename_prefix (76-83)
  • get_index_folder_name (86-91)
hugegraph-llm/src/hugegraph_llm/indices/vector_index.py (3)
  • VectorIndex (32-153)
  • from_index_file (40-80)
  • clean (136-153)
hugegraph-llm/src/hugegraph_llm/operators/kg_construction_task.py (1)
  • build_vector_index (92-94)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/property_graph_extract.py (3)
hugegraph-llm/src/hugegraph_llm/operators/util.py (1)
  • init_context (19-22)
hugegraph-llm/src/hugegraph_llm/models/llms/init_llm.py (2)
  • get_chat_llm (25-46)
  • get_chat_llm (103-124)
hugegraph-llm/src/hugegraph_llm/state/ai_state.py (2)
  • WkFlowState (38-81)
  • WkFlowInput (21-35)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/info_extract.py (5)
hugegraph-llm/src/hugegraph_llm/document/chunk_split.py (1)
  • ChunkSplitter (23-56)
hugegraph-llm/src/hugegraph_llm/models/llms/base.py (2)
  • BaseLLM (22-74)
  • generate (26-31)
hugegraph-llm/src/hugegraph_llm/operators/util.py (1)
  • init_context (19-22)
hugegraph-llm/src/hugegraph_llm/models/llms/init_llm.py (2)
  • get_chat_llm (25-46)
  • get_chat_llm (103-124)
hugegraph-llm/src/hugegraph_llm/state/ai_state.py (2)
  • WkFlowInput (21-35)
  • WkFlowState (38-81)
hugegraph-llm/src/hugegraph_llm/operators/document_op/chunk_split.py (3)
hugegraph-llm/src/hugegraph_llm/operators/util.py (1)
  • init_context (19-22)
hugegraph-llm/src/hugegraph_llm/operators/index_op/build_vector_index.py (4)
  • init (43-44)
  • node_init (46-58)
  • run (60-78)
  • run (95-106)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/info_extract.py (4)
  • init (222-223)
  • node_init (225-230)
  • run (167-191)
  • run (295-331)
hugegraph-llm/src/hugegraph_llm/operators/index_op/build_vector_index.py (5)
hugegraph-llm/src/hugegraph_llm/utils/embedding_utils.py (3)
  • get_embeddings_parallel (33-73)
  • get_filename_prefix (76-83)
  • get_index_folder_name (86-91)
hugegraph-llm/src/hugegraph_llm/operators/util.py (1)
  • init_context (19-22)
hugegraph-llm/src/hugegraph_llm/models/embeddings/init_embedding.py (2)
  • get_embedding (32-52)
  • get_embedding (59-79)
hugegraph-llm/src/hugegraph_llm/state/ai_state.py (2)
  • WkFlowInput (21-35)
  • WkFlowState (38-81)
hugegraph-llm/src/hugegraph_llm/indices/vector_index.py (4)
  • VectorIndex (32-153)
  • from_index_file (40-80)
  • add (95-102)
  • to_index_file (82-93)
hugegraph-llm/src/hugegraph_llm/operators/scheduler.py (8)
hugegraph-llm/src/hugegraph_llm/operators/common_op/check_schema.py (4)
  • CheckSchemaNode (179-332)
  • init (183-184)
  • run (45-60)
  • run (192-210)
hugegraph-llm/src/hugegraph_llm/operators/document_op/chunk_split.py (4)
  • ChunkSplitNode (33-86)
  • init (34-35)
  • run (74-86)
  • run (122-131)
hugegraph-llm/src/hugegraph_llm/operators/hugegraph_op/schema_manager.py (4)
  • SchemaManagerNode (79-147)
  • init (83-84)
  • run (66-76)
  • run (128-141)
hugegraph-llm/src/hugegraph_llm/utils/vector_index_utils.py (1)
  • build_vector_index (106-111)
hugegraph-llm/src/hugegraph_llm/operators/index_op/build_vector_index.py (4)
  • BuildVectorIndexNode (39-78)
  • init (43-44)
  • run (60-78)
  • run (95-106)
hugegraph-llm/src/hugegraph_llm/state/ai_state.py (3)
  • WkFlowState (38-81)
  • WkFlowInput (21-35)
  • to_json (68-81)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/info_extract.py (4)
  • InfoExtractNode (218-354)
  • init (222-223)
  • run (167-191)
  • run (295-331)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/property_graph_extract.py (4)
  • PropertyGraphExtractNode (187-300)
  • init (191-193)
  • run (97-122)
  • run (202-238)
🪛 GitHub Actions: Pylint
hugegraph-llm/src/hugegraph_llm/operators/util.py

[error] 16-16: E0611: No name 'CStatus' in module 'PyCGraph' (no-name-in-module)

hugegraph-llm/src/hugegraph_llm/models/embeddings/init_embedding.py

[warning] 32-32: Pylint: W0621: Redefining name 'llm_settings' from outer scope (redefined-outer-name)

hugegraph-llm/src/hugegraph_llm/models/llms/init_llm.py

[warning] 25-25: Pylint: W0621: Redefining name 'llm_settings' from outer scope (redefined-outer-name)


[warning] 49-49: Pylint: W0621: Redefining name 'llm_settings' from outer scope (redefined-outer-name)


[warning] 73-73: Pylint: W0621: Redefining name 'llm_settings' from outer scope (redefined-outer-name)

hugegraph-llm/src/hugegraph_llm/operators/llm_op/property_graph_extract.py

[error] 32-32: E0611: No name 'GNode' in module 'PyCGraph' (no-name-in-module)


[error] 32-32: E0611: No name 'CStatus' in module 'PyCGraph' (no-name-in-module)

hugegraph-llm/src/hugegraph_llm/operators/llm_op/info_extract.py

[error] 29-29: E0611: No name 'GNode' in module 'PyCGraph' (no-name-in-module)


[error] 29-29: E0611: No name 'CStatus' in module 'PyCGraph' (no-name-in-module)


[warning] 226-226: W0201: Attribute 'llm' defined outside init (attribute-defined-outside-init)


[warning] 229-229: W0201: Attribute 'example_prompt' defined outside init (attribute-defined-outside-init)

hugegraph-llm/src/hugegraph_llm/operators/document_op/chunk_split.py

[error] 23-23: E0611: No name 'GNode' in module 'PyCGraph' (no-name-in-module)


[error] 23-23: E0611: No name 'CStatus' in module 'PyCGraph' (no-name-in-module)


[warning] 49-49: W0201: Attribute 'texts' defined outside init (attribute-defined-outside-init)


[warning] 50-50: W0201: Attribute 'separators' defined outside init (attribute-defined-outside-init)


[warning] 51-51: W0201: Attribute 'text_splitter' defined outside init (attribute-defined-outside-init)

hugegraph-llm/src/hugegraph_llm/operators/index_op/build_vector_index.py

[error] 36-36: Pylint: E0611: No name 'GNode' in module 'PyCGraph' (no-name-in-module)


[error] 36-36: Pylint: E0611: No name 'CStatus' in module 'PyCGraph' (no-name-in-module)


[warning] 47-47: Pylint: W0201: Attribute 'embedding' defined outside init (attribute-defined-outside-init)


[warning] 48-48: Pylint: W0201: Attribute 'folder_name' defined outside init (attribute-defined-outside-init)


[warning] 51-51: Pylint: W0201: Attribute 'index_dir' defined outside init (attribute-defined-outside-init)


[warning] 52-52: Pylint: W0201: Attribute 'filename_prefix' defined outside init (attribute-defined-outside-init)


[warning] 55-55: Pylint: W0201: Attribute 'vector_index' defined outside init (attribute-defined-outside-init)

hugegraph-llm/src/hugegraph_llm/operators/scheduler.py

[error] 19-19: E0611: No name 'GPipeline' in module 'PyCGraph' (no-name-in-module)


[error] 19-19: E0611: No name 'GPipelineManager' in module 'PyCGraph' (no-name-in-module)


[error] 60-60: E0702: Raising str while only classes or instances are allowed (raising-bad-type)


[warning] 66-66: R1705: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it (no-else-return)


[error] 71-71: E0702: Raising str while only classes or instances are allowed (raising-bad-type)


[error] 74-74: E0702: Raising str while only classes or instances are allowed (raising-bad-type)


[warning] 95-95: R1705: Unnecessary "elif" after "return", remove the leading "el" from "elif" (no-else-return)


[warning] 104-104: R1711: Useless return at end of function or method (useless-return)


[warning] 131-131: R1711: Useless return at end of function or method (useless-return)


[warning] 147-147: W0707: Consider explicitly re-raising using 'except Exception as exc' and 'raise 'ERROR: Invalid JSON format in schema. Please check it carefully.' from exc' (raise-missing-from)


[error] 147-147: E0702: Raising str while only classes or instances are allowed (raising-bad-type)


[warning] 174-174: W0707: Consider explicitly re-raising using 'except Exception as exc' and 'raise 'ERROR: Invalid JSON format in schema. Please check it carefully.' from exc' (raise-missing-from)


[error] 174-174: E0702: Raising str while only classes or instances are allowed (raising-bad-type)

🔇 Additional comments (2)
hugegraph-llm/src/hugegraph_llm/utils/vector_index_utils.py (1)

79-87: 确认字段语义:graph_properties_vector_num 的来源是否正确

当前实现使用 len(chunk_vector_index.properties),更像是“chunk 的属性数量”而非“graph 属性向量数量”;仓库中仅在 hugegraph-llm/src/hugegraph_llm/utils/vector_index_utils.py(约行 83)找到该字段引用。确认下游消费方期望的语义,必要时改名或改为正确的数据来源。

hugegraph-llm/src/hugegraph_llm/operators/llm_op/info_extract.py (1)

29-29: 修复:PyCGraph 导入失败(缺少 GNode / CStatus)并确保 CI 安装依赖

  • 现象:仓内多处直接 import PyCGraph 并使用 GNode/GParam/CStatus,但仓内未发现 PyCGraph 源或这些符号的定义;pyproject.toml 中以 git 子目录方式声明了依赖:pycgraph = { git = "https://github.com/ChunelFeng/CGraph.git", subdirectory = "python", rev = "main", marker = "sys_platform == 'linux'" }。
  • 受影响(示例)文件:hugegraph-llm/src/hugegraph_llm/state/ai_state.py:16、hugegraph-llm/src/hugegraph_llm/operators/util.py:16、hugegraph-llm/src/hugegraph_llm/operators/llm_op/info_extract.py:29、hugegraph-llm/src/hugegraph_llm/operators/document_op/chunk_split.py:23、hugegraph-llm/src/hugegraph_llm/operators/scheduler.py:19、hugegraph-llm/src/hugegraph_llm/operators/index_op/build_vector_index.py:36、hugegraph-llm/src/hugegraph_llm/operators/hugegraph_op/schema_manager.py:24、hugegraph-llm/src/hugegraph_llm/operators/common_op/check_schema.py:26、hugegraph-llm/src/hugegraph_llm/operators/llm_op/property_graph_extract.py:32。
  • 建议动作:1) 确认 CI 是否会安装 pyproject 指定的 git 子目录依赖(注意 marker=sys_platform=='linux');2) 验证上游包是否导出 PyCGraph 模块及符号 GNode/GParam/CStatus,若导出名不同则修正导入或增加适配层;3) 若依赖未被正确安装,改用可跨平台的发布包或在 CI 中显式安装该子目录依赖;4) 不要通过禁用 pylint/静态检查掩盖导入错误,先修复依赖或导入后再调整 linter 配置。


from langchain_text_splitters import RecursiveCharacterTextSplitter
from hugegraph_llm.operators.util import init_context
from PyCGraph import GNode, CStatus
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

CI 导入失败:PyCGraph 未导出 GNode/CStatus 或未安装依赖

Pylint 报错 E0611。请在依赖中声明/修正 PyCGraph 包并确认导出符号,或改为项目内正确路径。建议优先修复依赖而非屏蔽检查。

可运行脚本核查依赖声明与全仓导入点:


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== 查找依赖清单中是否包含 PyCGraph/pycgraph =="
fd -t f -a 'requirements*.txt' 'pyproject.toml' 'setup.cfg' 2>/dev/null | xargs -r -I{} sh -c 'echo "--- {}"; cat "{}"' | rg -nPi 'pycgraph|pyc-?graph|PyCGraph' || true

echo -e "\n== 统计代码中对 PyCGraph 的导入 =="
rg -nP --type=py '\bfrom\s+PyCGraph\s+import\b'

echo -e "\n== 提示:若为私有包,请在 CI 安装步骤中显式 pip install 对应源与版本,并在清单中固定版本。=="

Length of output: 1057


修复缺失依赖:将 PyCGraph 添加到依赖或修正导入

未在 requirements*/pyproject.toml/setup.cfg 中找到 PyCGraph,CI/静态检查会因下面的导入失败。受影响文件(需处理):

  • hugegraph-llm/src/hugegraph_llm/state/ai_state.py:16 — from PyCGraph import GParam, CStatus
  • hugegraph-llm/src/hugegraph_llm/operators/util.py:16 — from PyCGraph import CStatus
  • hugegraph-llm/src/hugegraph_llm/operators/common_op/check_schema.py:26 — from PyCGraph import GNode, CStatus
  • hugegraph-llm/src/hugegraph_llm/operators/scheduler.py:19 — from PyCGraph import GPipeline, GPipelineManager
  • hugegraph-llm/src/hugegraph_llm/operators/index_op/build_vector_index.py:36 — from PyCGraph import GNode, CStatus
  • hugegraph-llm/src/hugegraph_llm/operators/hugegraph_op/schema_manager.py:24 — from PyCGraph import GNode, CStatus
  • hugegraph-llm/src/hugegraph_llm/operators/document_op/chunk_split.py:23 — from PyCGraph import GNode, CStatus
  • hugegraph-llm/src/hugegraph_llm/operators/llm_op/property_graph_extract.py:32 — from PyCGraph import GNode, CStatus
  • hugegraph-llm/src/hugegraph_llm/operators/llm_op/info_extract.py:29 — from PyCGraph import GNode, CStatus

必须采取的操作(任一):

  • 在 pyproject.toml / requirements.txt / setup.cfg 中声明并固定 PyCGraph(或正确的 PyPI 包名/版本),并在 CI 安装阶段确保 pip install 可访问(包括私有索引/凭据);或
  • 若 PyCGraph 实为仓内模块,改为正确的相对/本地导入或将其并入仓库;或
  • 若包存在但不导出所需符号,修正导入以匹配包的公共 API(确认 GNode/GParam/CStatus/GPipeline/GPipelineManager 是否为导出符号)。

优先修依赖与导入,不要通过屏蔽 Pylint/忽略导入错误来掩盖问题。

🧰 Tools
🪛 GitHub Actions: Pylint

[error] 23-23: E0611: No name 'GNode' in module 'PyCGraph' (no-name-in-module)


[error] 23-23: E0611: No name 'CStatus' in module 'PyCGraph' (no-name-in-module)

Comment on lines +33 to +37
from hugegraph_llm.operators.util import init_context
from hugegraph_llm.models.embeddings.init_embedding import get_embedding
from hugegraph_llm.state.ai_state import WkFlowInput, WkFlowState
from PyCGraph import GNode, CStatus

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

为 PyCGraph 导入提供容错,解决 E0611;并避免运行时硬依赖

与 util.py 一致加入 try/except 与最小桩类,保证 CI 与类型名可用。

-from PyCGraph import GNode, CStatus
+from typing import TYPE_CHECKING
+try:
+    from PyCGraph import GNode, CStatus  # type: ignore[import]
+except Exception:  # pragma: no cover
+    class CStatus:
+        def __init__(self, ok: bool = True, info: str = "") -> None:
+            self._ok = ok
+            self._info = info
+        def isErr(self) -> bool:
+            return not self._ok
+        def getInfo(self) -> str:
+            return self._info
+    class GNode:  # minimal fallback
+        ...
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
from hugegraph_llm.operators.util import init_context
from hugegraph_llm.models.embeddings.init_embedding import get_embedding
from hugegraph_llm.state.ai_state import WkFlowInput, WkFlowState
from PyCGraph import GNode, CStatus
from hugegraph_llm.operators.util import init_context
from hugegraph_llm.models.embeddings.init_embedding import get_embedding
from hugegraph_llm.state.ai_state import WkFlowInput, WkFlowState
from typing import TYPE_CHECKING
try:
from PyCGraph import GNode, CStatus # type: ignore[import]
except Exception: # pragma: no cover
class CStatus:
def __init__(self, ok: bool = True, info: str = "") -> None:
self._ok = ok
self._info = info
def isErr(self) -> bool:
return not self._ok
def getInfo(self) -> str:
return self._info
class GNode: # minimal fallback
...
🧰 Tools
🪛 GitHub Actions: Pylint

[error] 36-36: Pylint: E0611: No name 'GNode' in module 'PyCGraph' (no-name-in-module)


[error] 36-36: Pylint: E0611: No name 'CStatus' in module 'PyCGraph' (no-name-in-module)

🤖 Prompt for AI Agents
In hugegraph-llm/src/hugegraph_llm/operators/index_op/build_vector_index.py
around lines 33 to 37, the direct import of PyCGraph causes E0611 and creates a
hard runtime dependency; wrap the import in a try/except ImportError and provide
minimal fallback stub definitions for GNode and CStatus (matching the
names/shape used in this module) so type checks and CI pass when PyCGraph is
unavailable; ensure the except branch defines the stub classes with the minimal
attributes/methods referenced elsewhere and keep the original import when
available.

Comment on lines +73 to +75
# TODO: use async_get_texts_embedding instead of single sync method
chunks_embedding = asyncio.run(get_embeddings_parallel(self.embedding, chunks))
if len(chunks_embedding) > 0:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

💡 Verification agent

🧩 Analysis chain

避免在已运行事件循环中调用 asyncio.run(潜在运行时崩溃)

asyncio.run() 在已有事件循环环境会抛错。建议改为“检测后择径运行”或将异步转同步的封装下沉到工具函数中。

# 可在 embedding_utils.py 增加同步封装,内部做事件循环检测:
def get_embeddings_parallel_sync(embedding, texts):
    try:
        loop = asyncio.get_running_loop()
    except RuntimeError:
        return asyncio.run(get_embeddings_parallel(embedding, texts))
    else:
        # 已在事件循环中,需 await;调用方也应为 async(或在上层 pipeline 层支持异步)
        return loop.create_task(get_embeddings_parallel(embedding, texts))

调用侧根据返回值类型调整:若返回 Task,则在上层 await,或考虑将 Node.run 设计为异步。请确认当前 GPipeline 对异步节点的支持策略。


🏁 Script executed:

#!/bin/bash
# 排查 pipeline 是否在事件循环内运行节点
rg -nP 'asyncio\.(run|get_running_loop|new_event_loop)' -C2 hugegraph-llm/src/hugegraph_llm/operators

Length of output: 4242


禁止在已运行事件循环中使用 asyncio.run(多处需修复)

仓库中多处直接调用 asyncio.run(get_embeddings_parallel(...)),在已有事件循环中会抛 RuntimeError 并导致运行时崩溃。

定位(必须处理):

  • hugegraph-llm/src/hugegraph_llm/operators/index_op/build_vector_index.py:行 74, 102
  • hugegraph-llm/src/hugegraph_llm/operators/index_op/gremlin_example_index_query.py:行 67
  • hugegraph-llm/src/hugegraph_llm/operators/index_op/build_semantic_index.py:行 63
  • hugegraph-llm/src/hugegraph_llm/operators/index_op/build_gremlin_example_index.py:行 42
  • hugegraph-llm/src/hugegraph_llm/operators/llm_op/answer_synthesize.py:行 74

建议(二选一,需统一项目策略):

  • 推荐(长期方案):将 GPipeline / Node.run 及上层调用链改为 async,统一使用 await;
  • 备选(兼容方案):在 embedding_utils.py 增加同步封装 get_embeddings_parallel_sync,内部检测运行循环——无循环使用 asyncio.run(coro),有循环时要么返回 awaitable/Task(调用方必须 await),要么在独立线程的 loop 上用 asyncio.run_coroutine_threadsafe 并同步等待结果;并替换/移除仓库中所有直接的 asyncio.run(...) 调用。

确认点:确认 GPipeline/Node 是否应/能支持异步节点;据此选择第一(推荐)或第二方案并在上述文件统一修改。

🤖 Prompt for AI Agents
In hugegraph-llm/src/hugegraph_llm/operators/index_op/build_vector_index.py
lines 73-75, the code uses asyncio.run(get_embeddings_parallel(...)) which will
raise RuntimeError if an event loop is already running; replace this direct
asyncio.run call with a synchronous wrapper from embedding_utils (e.g.,
get_embeddings_parallel_sync) that internally detects an existing event loop and
either runs the coroutine with asyncio.run when no loop exists or executes it
safely on a separate thread/event-loop (using asyncio.run_coroutine_threadsafe
or by returning an awaitable the caller must await). Update the call here to use
that sync wrapper (or await the coroutine if this function is made async), and
apply the same replacement pattern to the other listed files (lines noted in the
review) so no code directly calls asyncio.run in a running loop.

from hugegraph_llm.operators.util import init_context
from hugegraph_llm.models.llms.init_llm import get_chat_llm
from hugegraph_llm.state.ai_state import WkFlowState, WkFlowInput
from PyCGraph import GNode, CStatus
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

CI 导入失败:PyCGraph(GNode/CStatus)

同上,请补齐依赖并确保导出符号;不建议用禁用规则掩盖。

复用依赖核查脚本。


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "PWD: $(pwd)"
echo "---- search for 'PyCGraph' in repo ----"
if command -v rg >/dev/null 2>&1; then
  rg -n --hidden -S "PyCGraph" -C3 || true
else
  grep -RIn --exclude-dir=.git "PyCGraph" || true
fi

echo "---- search for 'GNode' or 'CStatus' usages/definitions ----"
if command -v rg >/dev/null 2>&1; then
  rg -n --hidden -S -P "\bGNode\b|\bCStatus\b" -C3 || true
else
  grep -RIn --exclude-dir=.git -E "\bGNode\b|\bCStatus\b" || true
fi

echo "---- show the reviewed file (up to 240 lines) ----"
file="hugegraph-llm/src/hugegraph_llm/operators/llm_op/property_graph_extract.py"
if [ -f "$file" ]; then
  sed -n '1,240p' "$file" || true
else
  echo "File not found: $file"
fi

echo "---- list common packaging files if present ----"
for f in pyproject.toml setup.py setup.cfg requirements.txt Pipfile poetry.lock; do
  if [ -f "$f" ]; then
    echo "=== $f ==="
    sed -n '1,200p' "$f"
  fi
done

echo "---- search packaging / requirements for PyCGraph ----"
if command -v rg >/dev/null 2>&1; then
  rg -n --hidden -S "PyCGraph" || true
else
  grep -RIn --exclude-dir=.git "PyCGraph" || true
fi

echo "---- done ----"

Length of output: 45308


CI 导入失败:缺少 PyCGraph(GNode/CStatus 等)依赖

仓库中多处直接 from PyCGraph import ...,但未在项目依赖中声明,导致 CI 导入失败。

  • 受影响文件(示例):hugegraph-llm/src/hugegraph_llm/state/ai_state.py, hugegraph-llm/src/hugegraph_llm/operators/util.py, hugegraph-llm/src/hugegraph_llm/operators/scheduler.py, hugegraph-llm/src/hugegraph_llm/operators/index_op/build_vector_index.py, hugegraph-llm/src/hugegraph_llm/operators/hugegraph_op/schema_manager.py, hugegraph-llm/src/hugegraph_llm/operators/document_op/chunk_split.py, hugegraph-llm/src/hugegraph_llm/operators/common_op/check_schema.py, hugegraph-llm/src/hugegraph_llm/operators/llm_op/property_graph_extract.py, hugegraph-llm/src/hugegraph_llm/operators/llm_op/info_extract.py
  • 处理动作:在 pyproject.toml 或相应子包的依赖配置中显式添加 PyCGraph;或者将所需框架适配/实现移入仓库并保证安装可用。
  • 确保 PyCGraph 包导出所需符号(至少包含 GNode、CStatus、GParam、GPipeline、GPipelineManager 等),或调整导入为内部兼容适配器。
  • 不要用 lint/disable 或静默 try/except 掩盖 ImportError;复用本次依赖核查(rg/grep)确认所有引用并在 CI 上重新验证。
🧰 Tools
🪛 GitHub Actions: Pylint

[error] 32-32: E0611: No name 'GNode' in module 'PyCGraph' (no-name-in-module)


[error] 32-32: E0611: No name 'CStatus' in module 'PyCGraph' (no-name-in-module)

🤖 Prompt for AI Agents
In hugegraph-llm/src/hugegraph_llm/operators/llm_op/property_graph_extract.py
around line 32, imports from PyCGraph (e.g., GNode, CStatus) cause CI import
failures because PyCGraph is not declared as a project dependency; fix by adding
PyCGraph to the project dependencies (pyproject.toml or the relevant subpackage
requirements) or by implementing and importing an internal compatibility adapter
that exports the required symbols (GNode, CStatus, GParam, GPipeline,
GPipelineManager, etc.); ensure the declared package actually exports those
names, update all other files that import PyCGraph the same way, verify via
rg/grep that no references remain unhandled, and do not silence ImportError with
try/except—run CI to validate the imports.

# See the License for the specific language governing permissions and
# limitations under the License.

from PyCGraph import CStatus
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

修复导入失败:为 PyCGraph 提供容错导入与最小桩实现,消除 Pylint E0611

CI 报错 E0611。建议加上 try/except 及最小 CStatus 桩类,既保证本地/CI 过检,也不影响实际环境中有 PyCGraph 时的行为。

-from PyCGraph import CStatus
+from typing import TYPE_CHECKING
+try:
+    from PyCGraph import CStatus  # type: ignore[import]
+except Exception:  # pragma: no cover
+    class CStatus:  # minimal fallback for lint/CI
+        def __init__(self, ok: bool = True, info: str = "") -> None:
+            self._ok = ok
+            self._info = info
+        def isErr(self) -> bool:
+            return not self._ok
+        def getInfo(self) -> str:
+            return self._info
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
from PyCGraph import CStatus
from typing import TYPE_CHECKING
try:
from PyCGraph import CStatus # type: ignore[import]
except Exception: # pragma: no cover
class CStatus: # minimal fallback for lint/CI
def __init__(self, ok: bool = True, info: str = "") -> None:
self._ok = ok
self._info = info
def isErr(self) -> bool:
return not self._ok
def getInfo(self) -> str:
return self._info
🧰 Tools
🪛 GitHub Actions: Pylint

[error] 16-16: E0611: No name 'CStatus' in module 'PyCGraph' (no-name-in-module)

🤖 Prompt for AI Agents
In hugegraph-llm/src/hugegraph_llm/operators/util.py around line 16, the direct
import from PyCGraph causes CI Pylint E0611 when the package is missing; wrap
the import in a try/except ImportError and, in the except block, define a
minimal CStatus stub class that provides the attributes/methods your module
expects (e.g., status codes, simple initializer, and any used
methods/properties) so tests and lint pass locally/CI while preserving the real
PyCGraph behavior when available.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/property_graph_extract.py (1)

62-67: 对 schema 的“nullable_keys/primary_keys/properties”存在 KeyError 风险(来自图库 schema 或简化 schema 时缺失字段)

为缺失字段提供安全默认值,避免 KeyError;同时优化非空键计算逻辑。

-    for vertex in schema["vertexlabels"]:
-        properties_map["vertex"][vertex["name"]] = {
-            "primary_keys": vertex["primary_keys"],
-            "nullable_keys": vertex["nullable_keys"],
-            "properties": vertex["properties"],
-        }
+    for vertex in schema["vertexlabels"]:
+        vname = vertex["name"]
+        vprops = vertex.get("properties", [])
+        properties_map["vertex"][vname] = {
+            "primary_keys": vertex.get("primary_keys", vprops[:1] if vprops else []),
+            "nullable_keys": vertex.get("nullable_keys", vprops[1:] if vprops else []),
+            "properties": vprops,
+        }
@@
-    for edge in schema["edgelabels"]:
-        properties_map["edge"][edge["name"]] = {"properties": edge["properties"]}
+    for edge in schema["edgelabels"]:
+        properties_map["edge"][edge["name"]] = {"properties": edge.get("properties", [])}
@@
-            non_nullable_keys = set(
-                properties_map[item_type][label]["properties"]
-            ).difference(set(properties_map[item_type][label]["nullable_keys"]))
+            vmeta = properties_map[item_type][label]
+            non_nullable_keys = set(vmeta.get("properties", [])).difference(
+                set(vmeta.get("nullable_keys", []))
+            )

Also applies to: 69-70, 75-77

♻️ Duplicate comments (5)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/property_graph_extract.py (1)

32-32: PyCGraph 直连导入仍可能导致 CI 报错(E0611/ImportError)

请按项目统一策略:要么在依赖中显式声明并保证导出这些符号,要么用 try/except 提供最小桩以便在缺失时不中断类型名解析。

建议运行脚本确认仓库已声明并可导入 PyCGraph、且导出 GNode/CStatus 等。

#!/bin/bash
rg -nP 'from\s+PyCGraph\s+import\s+(GNode|CStatus|GPipeline|GPipelineManager)' -C1
fd -a 'pyproject.toml' -X sed -n '1,200p' {} 2>/dev/null
python - <<'PY'
try:
    import PyCGraph
    print("PyCGraph OK:", dir(PyCGraph)[:20])
except Exception as e:
    print("PyCGraph import failed:", e)
PY
hugegraph-llm/src/hugegraph_llm/operators/index_op/build_vector_index.py (2)

36-36: PyCGraph 直连导入:与项目其余文件一致处理容错或补齐依赖

请按统一策略处理(依赖声明或最小桩)。此前已提示。


73-76: 禁止在可能存在事件循环的环境中使用 asyncio.run(运行时崩溃风险)

Node.run 及同步类 run 均直接调用 asyncio.run(...);若外层已有事件循环会抛 RuntimeError。

-from hugegraph_llm.utils.embedding_utils import (
-    get_embeddings_parallel,
-    get_filename_prefix,
-    get_index_folder_name,
-)
+from hugegraph_llm.utils.embedding_utils import (
+    get_embeddings_parallel,            # 保留(供同步封装内部使用)
+    get_filename_prefix,
+    get_index_folder_name,
+    get_embeddings_parallel_sync,       # 新增:同步安全封装
+)
@@
-        # TODO: use async_get_texts_embedding instead of single sync method
-        chunks_embedding = asyncio.run(get_embeddings_parallel(self.embedding, chunks))
+        # 使用同步安全封装,内部处理事件循环有无
+        chunks_embedding = get_embeddings_parallel_sync(self.embedding, chunks)
@@
-        # TODO: use async_get_texts_embedding instead of single sync method
-        chunks_embedding = asyncio.run(get_embeddings_parallel(self.embedding, chunks))
+        chunks_embedding = get_embeddings_parallel_sync(self.embedding, chunks)

补充:在 hugegraph_llm/utils/embedding_utils.py 中新增同步封装(示例):

def get_embeddings_parallel_sync(embedding, texts):
    try:
        loop = asyncio.get_running_loop()
    except RuntimeError:
        return asyncio.run(get_embeddings_parallel(embedding, texts))
    # 已有事件循环:在独立线程执行并同步等待
    import concurrent.futures
    def _work():
        return asyncio.run(get_embeddings_parallel(embedding, texts))
    with concurrent.futures.ThreadPoolExecutor(max_workers=1) as ex:
        return ex.submit(_work).result()

请批量替换全仓库对 asyncio.run(get_embeddings_parallel(...)) 的调用:

#!/bin/bash
rg -nP 'asyncio\.run\(\s*get_embeddings_parallel\(' -C2

Also applies to: 101-104

hugegraph-llm/src/hugegraph_llm/operators/llm_op/info_extract.py (2)

29-29: PyCGraph 导入容错/依赖声明仍需统一

与其他文件一致处理,避免 CI 导入失败。


295-333: 锁使用不安全:异常路径提前解锁、长时间持锁与无锁写回混用

  • 301 行在异常时手工解锁,缺少 try/finally。
  • LLM 调用应无锁,读写上下文短锁。
  • call_count 更新应在锁内。
     def run(self) -> CStatus:
         sts = self.node_init()
         if sts.isErr():
             return sts
-        self.context.lock()
-        if self.context.chunks is None:
-            self.context.unlock()
-            raise ValueError("parameter required by extract node not found in context.")
-        schema = self.context.schema
-        chunks = self.context.chunks
-
-        if schema:
-            self.context.vertices = []
-            self.context.edges = []
-        else:
-            self.context.triples = []
-
-        self.context.unlock()
+        # 1) 快照输入(短锁)
+        self.context.lock()
+        try:
+            if self.context.chunks is None:
+                raise ValueError("parameter required by extract node not found in context.")
+            schema = self.context.schema
+            chunks = list(self.context.chunks)
+            if schema:
+                self.context.vertices = []
+                self.context.edges = []
+            else:
+                self.context.triples = []
+        finally:
+            self.context.unlock()
 
-        for sentence in chunks:
+        # 2) 推理(无锁),必要时分段写回
+        for sentence in chunks:
             proceeded_chunk = self.extract_triples_by_llm(schema, sentence)
             log.debug(
                 "[Legacy] %s input: %s \n output:%s",
                 self.__class__.__name__,
                 sentence,
                 proceeded_chunk,
             )
-            if schema:
-                self.extract_triples_by_regex_with_schema(schema, proceeded_chunk)
-            else:
-                self.extract_triples_by_regex(proceeded_chunk)
+            self.context.lock()
+            try:
+                if schema:
+                    self.extract_triples_by_regex_with_schema(schema, proceeded_chunk)
+                else:
+                    self.extract_triples_by_regex(proceeded_chunk)
+            finally:
+                self.context.unlock()
 
-        if self.context.call_count:
-            self.context.call_count += len(chunks)
-        else:
-            self.context.call_count = len(chunks)
-        self._filter_long_id()
+        # 3) 统计与过滤(短锁)
+        self.context.lock()
+        try:
+            self.context.call_count = (self.context.call_count or 0) + len(chunks)
+            self._filter_long_id()
+        finally:
+            self.context.unlock()
         return CStatus()
🧹 Nitpick comments (6)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/property_graph_extract.py (3)

34-37: SCHEMA_EXAMPLE_PROMPT 初始化时固定,后续 prompt 变更不会生效

建议改为按需读取配置或提供 getter,避免模块级常量在运行期间与最新配置脱节(参照你们 BasePromptConfig 的运行时注入设计)。


245-249: 变量名遮蔽模块名(prompt)易读性差

本地变量名与顶层导入的 prompt 模块同名,建议重命名避免混淆。

-        prompt = generate_extract_property_graph_prompt(chunk, schema)
-        if self.example_prompt is not None:
-            prompt = self.example_prompt + prompt
-        return self.llm.generate(prompt=prompt)
+        prompt_text = generate_extract_property_graph_prompt(chunk, schema)
+        if self.example_prompt is not None:
+            prompt_text = self.example_prompt + prompt_text
+        return self.llm.generate(prompt=prompt_text)

250-253: JSON 抽取使用贪婪正则,容易跨越多段/格式噪声匹配过多

建议:优先尝试解析代码块中的 JSON;或使用更稳健的提取策略(逐字符括号计数)/jsonfix 库做容错再解析。当前实现在含示例/多 JSON 的输出里容易误匹配。

hugegraph-llm/src/hugegraph_llm/operators/scheduler.py (2)

33-52: 并发调度的池管理需要确认线程安全语义

schedule_flow 未加锁;假设 GPipelineManager 内部已并发安全。若将来并发调用较多,建议为每个 flow 的 manager 增设细粒度互斥或在 fetch/add/release 处统一串行化,避免重复创建/双重归还。

Also applies to: 58-88


104-108: 构建向量索引时语言/切分方式硬编码为中文段落

建议从 texts/配置推断或作为入参暴露,避免英文场景效果退化。

hugegraph-llm/src/hugegraph_llm/operators/llm_op/info_extract.py (1)

347-356: 过滤长 ID 时未加锁,存在竞态

建议在短锁内就地过滤,或先拷贝后替换。

-    def _filter_long_id(self):
-        self.context.vertices = [
-            vertex for vertex in self.context.vertices if self.valid(vertex["id"])
-        ]
-        self.context.edges = [
-            edge
-            for edge in self.context.edges
-            if self.valid(edge["start"]) and self.valid(edge["end"])
-        ]
+    def _filter_long_id(self):
+        self.context.lock()
+        try:
+            self.context.vertices = [
+                v for v in self.context.vertices if self.valid(v["id"])
+            ]
+            self.context.edges = [
+                e for e in self.context.edges if self.valid(e["start"]) and self.valid(e["end"])
+            ]
+        finally:
+            self.context.unlock()
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b497d14 and a353eec.

📒 Files selected for processing (4)
  • hugegraph-llm/src/hugegraph_llm/operators/index_op/build_vector_index.py (1 hunks)
  • hugegraph-llm/src/hugegraph_llm/operators/llm_op/info_extract.py (6 hunks)
  • hugegraph-llm/src/hugegraph_llm/operators/llm_op/property_graph_extract.py (6 hunks)
  • hugegraph-llm/src/hugegraph_llm/operators/scheduler.py (1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-06-25T09:50:06.213Z
Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:124-137
Timestamp: 2025-06-25T09:50:06.213Z
Learning: Language-specific prompt attributes (answer_prompt_CN, answer_prompt_EN, extract_graph_prompt_CN, extract_graph_prompt_EN, gremlin_generate_prompt_CN, gremlin_generate_prompt_EN, keywords_extract_prompt_CN, keywords_extract_prompt_EN, doc_input_text_CN, doc_input_text_EN) are defined in the PromptConfig class in hugegraph-llm/src/hugegraph_llm/config/prompt_config.py, which inherits from BasePromptConfig, making these attributes accessible in the parent class methods.

Applied to files:

  • hugegraph-llm/src/hugegraph_llm/operators/llm_op/property_graph_extract.py
  • hugegraph-llm/src/hugegraph_llm/operators/llm_op/info_extract.py
📚 Learning: 2025-06-25T09:45:10.751Z
Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:100-116
Timestamp: 2025-06-25T09:45:10.751Z
Learning: In hugegraph-llm BasePromptConfig class, llm_settings is a runtime property that is loaded from config through dependency injection during object initialization, not a static class attribute. Static analysis tools may flag this as missing but it's intentional design.

Applied to files:

  • hugegraph-llm/src/hugegraph_llm/operators/llm_op/property_graph_extract.py
🧬 Code graph analysis (4)
hugegraph-llm/src/hugegraph_llm/operators/scheduler.py (8)
hugegraph-llm/src/hugegraph_llm/operators/common_op/check_schema.py (4)
  • CheckSchemaNode (179-332)
  • init (183-184)
  • run (45-60)
  • run (192-210)
hugegraph-llm/src/hugegraph_llm/operators/document_op/chunk_split.py (4)
  • ChunkSplitNode (33-86)
  • init (34-35)
  • run (74-86)
  • run (122-131)
hugegraph-llm/src/hugegraph_llm/operators/hugegraph_op/schema_manager.py (4)
  • SchemaManagerNode (79-147)
  • init (83-84)
  • run (66-76)
  • run (128-141)
hugegraph-llm/src/hugegraph_llm/utils/vector_index_utils.py (1)
  • build_vector_index (106-111)
hugegraph-llm/src/hugegraph_llm/operators/index_op/build_vector_index.py (4)
  • BuildVectorIndexNode (39-79)
  • init (43-44)
  • run (60-79)
  • run (96-107)
hugegraph-llm/src/hugegraph_llm/state/ai_state.py (3)
  • WkFlowState (38-81)
  • WkFlowInput (21-35)
  • to_json (68-81)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/info_extract.py (4)
  • InfoExtractNode (218-355)
  • init (222-223)
  • run (167-191)
  • run (295-332)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/property_graph_extract.py (4)
  • PropertyGraphExtractNode (187-304)
  • init (191-193)
  • run (97-122)
  • run (202-242)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/property_graph_extract.py (3)
hugegraph-llm/src/hugegraph_llm/operators/util.py (1)
  • init_context (19-22)
hugegraph-llm/src/hugegraph_llm/models/llms/init_llm.py (2)
  • get_chat_llm (25-46)
  • get_chat_llm (103-124)
hugegraph-llm/src/hugegraph_llm/state/ai_state.py (2)
  • WkFlowState (38-81)
  • WkFlowInput (21-35)
hugegraph-llm/src/hugegraph_llm/operators/index_op/build_vector_index.py (5)
hugegraph-llm/src/hugegraph_llm/utils/embedding_utils.py (3)
  • get_embeddings_parallel (33-73)
  • get_filename_prefix (76-83)
  • get_index_folder_name (86-91)
hugegraph-llm/src/hugegraph_llm/operators/util.py (1)
  • init_context (19-22)
hugegraph-llm/src/hugegraph_llm/models/embeddings/init_embedding.py (2)
  • get_embedding (32-52)
  • get_embedding (59-79)
hugegraph-llm/src/hugegraph_llm/state/ai_state.py (2)
  • WkFlowInput (21-35)
  • WkFlowState (38-81)
hugegraph-llm/src/hugegraph_llm/indices/vector_index.py (4)
  • VectorIndex (32-153)
  • from_index_file (40-80)
  • add (95-102)
  • to_index_file (82-93)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/info_extract.py (4)
hugegraph-llm/src/hugegraph_llm/document/chunk_split.py (1)
  • ChunkSplitter (23-56)
hugegraph-llm/src/hugegraph_llm/operators/util.py (1)
  • init_context (19-22)
hugegraph-llm/src/hugegraph_llm/models/llms/init_llm.py (2)
  • get_chat_llm (25-46)
  • get_chat_llm (103-124)
hugegraph-llm/src/hugegraph_llm/state/ai_state.py (2)
  • WkFlowInput (21-35)
  • WkFlowState (38-81)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: build (3.10)
  • GitHub Check: build (3.11)
🔇 Additional comments (2)
hugegraph-llm/src/hugegraph_llm/operators/index_op/build_vector_index.py (1)

65-71: 锁的使用已补上 try/finally,LGTM

确保异常分支不会遗留锁。

hugegraph-llm/src/hugegraph_llm/operators/scheduler.py (1)

191-210: 空结果时的返回结构已做健壮化,LGTM

避免 KeyError,返回一致的 JSON 结构。

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

license-eye has checked 332 files.

Valid Invalid Ignored Fixed
265 3 64 0
Click to see the invalid file list
  • hugegraph-llm/src/hugegraph_llm/flows/build_vector_index.py
  • hugegraph-llm/src/hugegraph_llm/flows/common.py
  • hugegraph-llm/src/hugegraph_llm/flows/graph_extract.py
Use this command to fix any missing license headers
```bash

docker run -it --rm -v $(pwd):/github/workspace apache/skywalking-eyes header fix

</details>

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (4)
hugegraph-llm/src/hugegraph_llm/utils/graph_index_utils.py (2)

47-49: 获取文件前缀无需依赖已构造的 Embedding 实例。

为保持一致性与避免不必要依赖,直接使用配置映射的 model 名称。

-    filename_prefix = get_filename_prefix(
-        llm_settings.embedding_type, getattr(builder.embedding, "model_name", None)
-    )
+    from ..models.embeddings.init_embedding import model_map
+    filename_prefix = get_filename_prefix(
+        llm_settings.embedding_type, model_map.get(llm_settings.embedding_type)
+    )

63-66: 同上:清理索引时避免重复实例化 Embeddings。

-    filename_prefix = get_filename_prefix(
-        llm_settings.embedding_type,
-        getattr(Embeddings().get_embedding(), "model_name", None),
-    )
+    from ..models.embeddings.init_embedding import model_map
+    filename_prefix = get_filename_prefix(
+        llm_settings.embedding_type, model_map.get(llm_settings.embedding_type)
+    )
hugegraph-llm/src/hugegraph_llm/operators/llm_op/property_graph_extract.py (2)

32-32: 缺少 PyCGraph 依赖声明

项目中多处导入 PyCGraph,但未在依赖配置中声明,导致 CI 失败。需要在 pyproject.toml 或相应的依赖文件中添加 PyCGraph


206-241: 锁粒度过大影响并发性能

run 方法在 LLM 调用期间持有锁(222-231行),这会严重影响并发性能。LLM 调用通常耗时较长,应该避免在持锁状态下执行。

此外,call_count 的更新(241行)在锁外进行,存在并发竞态条件。

建议采用"读快照-处理-写回"模式,减小锁粒度:

 def run(self) -> CStatus:
     sts = self.node_init()
     if sts.isErr():
         return sts
+    
+    # 1) 读取上下文快照(短锁)
     self.context.lock()
     try:
         if self.context.schema is None or self.context.chunks is None:
             raise ValueError(
                 "parameter required by extract node not found in context."
             )
         schema = self.context.schema
-        chunks = self.context.chunks
+        chunks = list(self.context.chunks)  # 创建副本以避免在锁外修改
         if self.context.vertices is None:
             self.context.vertices = []
         if self.context.edges is None:
             self.context.edges = []
     finally:
         self.context.unlock()
 
+    # 2) LLM 处理(无锁)
     items = []
     for chunk in chunks:
         proceeded_chunk = self.extract_property_graph_by_llm(schema, chunk)
         log.debug(
             "[LLM] %s input: %s \n output:%s",
             self.__class__.__name__,
             chunk,
             proceeded_chunk,
         )
         items.extend(self._extract_and_filter_label(schema, proceeded_chunk))
     items = filter_item(schema, items)
+    
+    # 3) 写回结果(短锁)
     self.context.lock()
     try:
         for item in items:
             if item["type"] == "vertex":
                 self.context.vertices.append(item)
             elif item["type"] == "edge":
                 self.context.edges.append(item)
+        self.context.call_count = (self.context.call_count or 0) + len(chunks)
     finally:
         self.context.unlock()
-    self.context.call_count = (self.context.call_count or 0) + len(chunks)
     return CStatus()
🧹 Nitpick comments (15)
hugegraph-llm/src/hugegraph_llm/flows/common.py (1)

18-18: 补充返回类型标注以提升可读性与静态检查效果。

为抽象方法添加返回类型,避免下游实现歧义。

 from hugegraph_llm.state.ai_state import WkFlowInput
+from typing import Any

 class BaseFlow(ABC):
@@
-    def prepare(self, prepared_input: WkFlowInput, *args, **kwargs):
+    def prepare(self, prepared_input: WkFlowInput, *args, **kwargs) -> None:
@@
-    def build_flow(self, *args, **kwargs):
+    def build_flow(self, *args, **kwargs) -> Any:
@@
-    def post_deal(self, *args, **kwargs):
+    def post_deal(self, *args, **kwargs) -> Any:

Also applies to: 26-31, 33-39, 40-45

hugegraph-llm/src/hugegraph_llm/flows/build_vector_index.py (2)

31-35: prepare 未校验 texts 为空场景,可能导致下游算子异常。

建议在入口做轻量校验,尽早失败。

 def prepare(self, prepared_input: WkFlowInput, texts):
+    if not texts or (isinstance(texts, list) and len(texts) == 0):
+        raise ValueError("texts cannot be empty")
     prepared_input.texts = texts
     prepared_input.language = "zh"
     prepared_input.split_type = "paragraph"
     return

16-25: 补充类型标注与参数语义,便于维护与 IDE 诊断。

为方法添加 typing;可选:后续让 language/split_type 变为可配置参数(保持默认不改行为)。

 from hugegraph_llm.flows.common import BaseFlow
 from hugegraph_llm.state.ai_state import WkFlowInput
+from typing import List, Union, Optional

@@
-class BuildVectorIndexFlow(BaseFlow):
+class BuildVectorIndexFlow(BaseFlow):
     def __init__(self):
         pass

-    def prepare(self, prepared_input: WkFlowInput, texts):
+    def prepare(self, prepared_input: WkFlowInput, texts: Union[str, List[str]]) -> None:
@@
-    def build_flow(self, texts):
+    def build_flow(self, texts: Union[str, List[str]]) -> GPipeline:
@@
-    def post_deal(self, pipeline=None):
+    def post_deal(self, pipeline: Optional[GPipeline] = None) -> str:

Also applies to: 27-55

hugegraph-llm/src/hugegraph_llm/utils/vector_index_utils.py (2)

65-90: 空索引时 embed_dim 返回默认 1024,具误导性。

当 chunks 索引文件缺失且 record_miss=False 时会返回空索引,其 d=1024 并非真实维度。建议在空索引时返回 None。

     chunk_vector_index = VectorIndex.from_index_file(
         str(os.path.join(resource_path, folder_name, "chunks")),
         filename_prefix,
         record_miss=False,
     )
@@
-    return json.dumps(
-        {
-            "embed_dim": chunk_vector_index.index.d,
+    empty_chunk = (
+        chunk_vector_index.index.ntotal == 0
+        and len(chunk_vector_index.properties) == 0
+    )
+    embed_dim = None if empty_chunk else chunk_vector_index.index.d
+    return json.dumps(
+        {
+            "embed_dim": embed_dim,
             "vector_info": {
                 "chunk_vector_num": chunk_vector_index.index.ntotal,
                 "graph_vid_vector_num": graph_vid_vector_index.index.ntotal,
                 "graph_properties_vector_num": len(chunk_vector_index.properties),
             },
         },
         ensure_ascii=False,
         indent=2,
     )

106-112: 与 graph_index_utils 保持一致:捕获调度异常并转为 UI 友好错误。

避免未处理异常直接冒泡到前端。

 def build_vector_index(input_file, input_text):
@@
-    scheduler = SchedulerSingleton.get_instance()
-    return scheduler.schedule_flow("build_vector_index", texts)
+    scheduler = SchedulerSingleton.get_instance()
+    try:
+        return scheduler.schedule_flow("build_vector_index", texts)
+    except Exception as e:  # pylint: disable=broad-exception-caught
+        raise gr.Error(str(e))
hugegraph-llm/src/hugegraph_llm/utils/graph_index_utils.py (1)

181-183: 移除遗留调试输出。

避免在后端打印敏感或冗余信息。

-        print(context)
hugegraph-llm/src/hugegraph_llm/flows/scheduler.py (3)

55-56: 初始化失败缺少上下文信息,难以排障。

建议附带 status.getInfo()

-            if status.isErr():
-                raise RuntimeError("Error in flow init")
+            if status.isErr():
+                raise RuntimeError(f"Error in flow init {status.getInfo()}")

52-53: 拼写修正。

-            # call coresponding flow_func to create new workflow
+            # call corresponding flow_func to create new workflow

24-40: max_pipeline 未被使用。

要么实现上限控制(如限制 GPipelineManager 容量),要么移除此参数以免误导。

是否计划利用 max_pipeline 约束 manager 中活跃/缓存的 pipeline 数量?

hugegraph-llm/src/hugegraph_llm/flows/graph_extract.py (3)

32-32: 未使用的 __init__ 方法

__init__ 方法中只有 pass,没有任何初始化逻辑。如果不需要初始化,可以直接删除这个方法。

 class GraphExtractFlow(BaseFlow):
-    def __init__(self):
-        pass
-
     def _import_schema(

59-65: 重复的 JSON 解析逻辑

prepare 方法中的 JSON 解析逻辑(59-65行)与 build_flow 方法中的逻辑(81-87行)重复。prepare 中已经解析并设置了 prepared_input.schema,但在 build_flow 中又重新解析了一次。

建议重构以避免重复解析:

 def prepare(
     self, prepared_input: WkFlowInput, schema, texts, example_prompt, extract_type
 ):
     # prepare input data
     prepared_input.texts = texts
     prepared_input.language = "zh"
     prepared_input.split_type = "document"
     prepared_input.example_prompt = example_prompt
-    prepared_input.schema = schema
     schema = schema.strip()
     if schema.startswith("{"):
         try:
             schema = json.loads(schema)
             prepared_input.schema = schema
+            prepared_input.is_json_schema = True
         except json.JSONDecodeError as exc:
             log.error("Invalid JSON format in schema. Please check it again.")
             raise ValueError("Invalid JSON format in schema.") from exc
     else:
         log.info("Get schema '%s' from graphdb.", schema)
         prepared_input.graph_name = schema
+        prepared_input.is_json_schema = False
     return

然后在 build_flow 中直接使用 prepared_input 的状态:

 def build_flow(self, schema, texts, example_prompt, extract_type):
     pipeline = GPipeline()
     prepared_input = WkFlowInput()
     # prepare input data
     self.prepare(prepared_input, schema, texts, example_prompt, extract_type)
     
     pipeline.createGParam(prepared_input, "wkflow_input")
     pipeline.createGParam(WkFlowState(), "wkflow_state")
-    schema = schema.strip()
     schema_node = None
-    if schema.startswith("{"):
-        try:
-            schema = json.loads(schema)
-            schema_node = self._import_schema(from_user_defined=schema)
-        except json.JSONDecodeError as exc:
-            log.error("Invalid JSON format in schema. Please check it again.")
-            raise ValueError("Invalid JSON format in schema.") from exc
+    if getattr(prepared_input, 'is_json_schema', False):
+        schema_node = self._import_schema(from_user_defined=prepared_input.schema)
     else:
-        log.info("Get schema '%s' from graphdb.", schema)
-        schema_node = self._import_schema(from_hugegraph=schema)
+        schema_node = self._import_schema(from_hugegraph=prepared_input.graph_name)

113-122: 警告信息可能误导用户

当没有提取到顶点和边时,警告信息 "The schema may not match the Doc" 可能过于武断。可能存在其他原因导致提取失败,比如文档内容本身不包含相关信息、LLM 提取能力限制等。

建议提供更全面的诊断信息:

 if not vertices and not edges:
-    log.info("Please check the schema.(The schema may not match the Doc)")
+    log.info("No vertices or edges extracted. Possible reasons: 1) Schema mismatch, 2) Document contains no extractable entities, 3) LLM extraction limitations")
     return json.dumps(
         {
             "vertices": vertices,
             "edges": edges,
-            "warning": "The schema may not match the Doc",
+            "warning": "No data extracted. Check: schema compatibility, document content, or extraction parameters",
         },
         ensure_ascii=False,
         indent=2,
     )
hugegraph-llm/src/hugegraph_llm/operators/llm_op/property_graph_extract.py (3)

244-249: 重复的方法实现

PropertyGraphExtractNode.extract_property_graph_by_llm 方法(244-249行)与 PropertyGraphExtract.extract_property_graph_by_llm 方法(124-129行)完全相同,存在代码重复。

建议提取为模块级函数或基类方法以避免重复:

+def extract_property_graph_by_llm(llm, schema, chunk, example_prompt=None):
+    """提取属性图的通用方法"""
+    prompt = generate_extract_property_graph_prompt(chunk, schema)
+    if example_prompt is not None:
+        prompt = example_prompt + prompt
+    return llm.generate(prompt=prompt)

 class PropertyGraphExtract:
     # ...
     def extract_property_graph_by_llm(self, schema, chunk):
-        prompt = generate_extract_property_graph_prompt(chunk, schema)
-        if self.example_prompt is not None:
-            prompt = self.example_prompt + prompt
-        return self.llm.generate(prompt=prompt)
+        return extract_property_graph_by_llm(self.llm, schema, chunk, self.example_prompt)

 class PropertyGraphExtractNode(GNode):
     # ...
     def extract_property_graph_by_llm(self, schema, chunk):
-        prompt = generate_extract_property_graph_prompt(chunk, schema)
-        if self.example_prompt is not None:
-            prompt = self.example_prompt + prompt
-        return self.llm.generate(prompt=prompt)
+        return extract_property_graph_by_llm(self.llm, schema, chunk, self.example_prompt)

250-304: 重复的 _extract_and_filter_label 方法

PropertyGraphExtractNode._extract_and_filter_label 方法(250-304行)与 PropertyGraphExtract._extract_and_filter_label 方法(130-184行)几乎完全相同,仅在日志格式上有细微差异。

建议将此方法提取为共享的工具函数:

+def extract_and_filter_label(schema, text, necessary_keys, logger=log) -> List[Dict[str, Any]]:
+    """从文本中提取并过滤标签的通用方法"""
+    json_match = re.search(r"({.*})", text, re.DOTALL)
+    if not json_match:
+        logger.critical(
+            "Invalid property graph! No JSON object found, "
+            "please check the output format example in prompt."
+        )
+        return []
+    json_str = json_match.group(1).strip()
+    
+    items = []
+    try:
+        property_graph = json.loads(json_str)
+        # ... 其余逻辑保持不变
+    except json.JSONDecodeError:
+        logger.critical(
+            "Invalid property graph JSON! Please check the extracted JSON data carefully"
+        )
+    return items

 class PropertyGraphExtract:
     def _extract_and_filter_label(self, schema, text) -> List[Dict[str, Any]]:
-        # 完整的方法实现
+        return extract_and_filter_label(schema, text, self.NECESSARY_ITEM_KEYS)

 class PropertyGraphExtractNode(GNode):
     def _extract_and_filter_label(self, schema, text) -> List[Dict[str, Any]]:
-        # 完整的方法实现
+        return extract_and_filter_label(schema, text, self.NECESSARY_ITEM_KEYS)

34-37: TODO 注释需要跟进处理

TODO 注释指出 SCHEMA_EXAMPLE_PROMPT 在系统加载后不会根据 prompt.extract_graph_prompt 的变化而更新,这可能不符合预期。

需要确认这个行为是否符合设计预期。如果需要动态更新,我可以帮助实现一个延迟求值的解决方案。是否需要我创建一个 issue 来跟踪这个问题?

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a353eec and b31eca3.

📒 Files selected for processing (8)
  • hugegraph-llm/src/hugegraph_llm/flows/__init__.py (1 hunks)
  • hugegraph-llm/src/hugegraph_llm/flows/build_vector_index.py (1 hunks)
  • hugegraph-llm/src/hugegraph_llm/flows/common.py (1 hunks)
  • hugegraph-llm/src/hugegraph_llm/flows/graph_extract.py (1 hunks)
  • hugegraph-llm/src/hugegraph_llm/flows/scheduler.py (1 hunks)
  • hugegraph-llm/src/hugegraph_llm/operators/llm_op/property_graph_extract.py (6 hunks)
  • hugegraph-llm/src/hugegraph_llm/utils/graph_index_utils.py (9 hunks)
  • hugegraph-llm/src/hugegraph_llm/utils/vector_index_utils.py (3 hunks)
✅ Files skipped from review due to trivial changes (1)
  • hugegraph-llm/src/hugegraph_llm/flows/init.py
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-07-31T12:32:32.542Z
Learnt from: Gfreely
PR: hugegraph/hugegraph-ai#43
File: hugegraph-llm/src/hugegraph_llm/utils/embedding_utils.py:76-83
Timestamp: 2025-07-31T12:32:32.542Z
Learning: Embedding model names in HugeGraph-AI follow strict naming conventions and typically only contain letters (a-z, A-Z), numbers (0-9), hyphens (-), and periods (.) for version numbers. They do not contain other extraneous characters that would require extensive filename sanitization beyond basic slash replacement.

Applied to files:

  • hugegraph-llm/src/hugegraph_llm/utils/vector_index_utils.py
📚 Learning: 2025-06-25T09:50:06.213Z
Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:124-137
Timestamp: 2025-06-25T09:50:06.213Z
Learning: Language-specific prompt attributes (answer_prompt_CN, answer_prompt_EN, extract_graph_prompt_CN, extract_graph_prompt_EN, gremlin_generate_prompt_CN, gremlin_generate_prompt_EN, keywords_extract_prompt_CN, keywords_extract_prompt_EN, doc_input_text_CN, doc_input_text_EN) are defined in the PromptConfig class in hugegraph-llm/src/hugegraph_llm/config/prompt_config.py, which inherits from BasePromptConfig, making these attributes accessible in the parent class methods.

Applied to files:

  • hugegraph-llm/src/hugegraph_llm/operators/llm_op/property_graph_extract.py
📚 Learning: 2025-06-25T09:45:10.751Z
Learnt from: day0n
PR: hugegraph/hugegraph-ai#16
File: hugegraph-llm/src/hugegraph_llm/config/models/base_prompt_config.py:100-116
Timestamp: 2025-06-25T09:45:10.751Z
Learning: In hugegraph-llm BasePromptConfig class, llm_settings is a runtime property that is loaded from config through dependency injection during object initialization, not a static class attribute. Static analysis tools may flag this as missing but it's intentional design.

Applied to files:

  • hugegraph-llm/src/hugegraph_llm/operators/llm_op/property_graph_extract.py
🧬 Code graph analysis (6)
hugegraph-llm/src/hugegraph_llm/flows/common.py (2)
hugegraph-llm/src/hugegraph_llm/state/ai_state.py (1)
  • WkFlowInput (21-35)
hugegraph-llm/src/hugegraph_llm/flows/graph_extract.py (3)
  • prepare (49-69)
  • build_flow (71-106)
  • post_deal (108-127)
hugegraph-llm/src/hugegraph_llm/flows/scheduler.py (3)
hugegraph-llm/src/hugegraph_llm/utils/vector_index_utils.py (1)
  • build_vector_index (106-111)
hugegraph-llm/src/hugegraph_llm/flows/common.py (4)
  • BaseFlow (21-45)
  • build_flow (34-38)
  • post_deal (41-45)
  • prepare (27-31)
hugegraph-llm/src/hugegraph_llm/flows/graph_extract.py (4)
  • GraphExtractFlow (30-127)
  • build_flow (71-106)
  • post_deal (108-127)
  • prepare (49-69)
hugegraph-llm/src/hugegraph_llm/utils/vector_index_utils.py (4)
hugegraph-llm/src/hugegraph_llm/flows/scheduler.py (3)
  • SchedulerSingleton (75-85)
  • get_instance (80-85)
  • schedule_flow (45-72)
hugegraph-llm/src/hugegraph_llm/utils/embedding_utils.py (2)
  • get_filename_prefix (76-83)
  • get_index_folder_name (86-91)
hugegraph-llm/src/hugegraph_llm/indices/vector_index.py (3)
  • VectorIndex (32-153)
  • from_index_file (40-80)
  • clean (136-153)
hugegraph-llm/src/hugegraph_llm/operators/kg_construction_task.py (1)
  • build_vector_index (92-94)
hugegraph-llm/src/hugegraph_llm/utils/graph_index_utils.py (3)
hugegraph-llm/src/hugegraph_llm/flows/scheduler.py (3)
  • SchedulerSingleton (75-85)
  • get_instance (80-85)
  • schedule_flow (45-72)
hugegraph-llm/src/hugegraph_llm/operators/kg_construction_task.py (5)
  • KgBuilder (39-113)
  • fetch_graph_data (58-60)
  • run (106-109)
  • chunk_split (62-69)
  • extract_info (71-78)
hugegraph-llm/src/hugegraph_llm/utils/vector_index_utils.py (1)
  • read_documents (34-60)
hugegraph-llm/src/hugegraph_llm/flows/graph_extract.py (6)
hugegraph-llm/src/hugegraph_llm/flows/common.py (4)
  • BaseFlow (21-45)
  • prepare (27-31)
  • build_flow (34-38)
  • post_deal (41-45)
hugegraph-llm/src/hugegraph_llm/state/ai_state.py (2)
  • WkFlowInput (21-35)
  • WkFlowState (38-81)
hugegraph-llm/src/hugegraph_llm/operators/document_op/chunk_split.py (1)
  • ChunkSplitNode (33-86)
hugegraph-llm/src/hugegraph_llm/operators/hugegraph_op/schema_manager.py (1)
  • SchemaManagerNode (79-147)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/info_extract.py (1)
  • InfoExtractNode (218-355)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/property_graph_extract.py (1)
  • PropertyGraphExtractNode (187-304)
hugegraph-llm/src/hugegraph_llm/operators/llm_op/property_graph_extract.py (5)
hugegraph-llm/src/hugegraph_llm/document/chunk_split.py (1)
  • ChunkSplitter (23-56)
hugegraph-llm/src/hugegraph_llm/models/llms/base.py (2)
  • BaseLLM (22-74)
  • generate (26-31)
hugegraph-llm/src/hugegraph_llm/operators/util.py (1)
  • init_context (19-22)
hugegraph-llm/src/hugegraph_llm/models/llms/init_llm.py (2)
  • get_chat_llm (25-46)
  • get_chat_llm (103-124)
hugegraph-llm/src/hugegraph_llm/state/ai_state.py (2)
  • WkFlowState (38-81)
  • WkFlowInput (21-35)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: build (3.10)
  • GitHub Check: build (3.11)
🔇 Additional comments (2)
hugegraph-llm/src/hugegraph_llm/flows/common.py (1)

21-45: 抽象接口与调度器契约一致,LGTM。

接口方法与 Scheduler 的使用方式匹配,无明显问题。

hugegraph-llm/src/hugegraph_llm/flows/scheduler.py (1)

65-71: 确认 pipeline 在复用前是否重置 wkflow_state(或显式调用 WkFlowState.reset)

验证结果:hugegraph-llm/src/hugegraph_llm/state/ai_state.py 中存在 WkFlowState.reset;flows(graph_extract.py、build_vector_index.py)在创建时使用了 pipeline.createGParam(WkFlowState(), "wkflow_state");hugegraph-llm/src/hugegraph_llm/flows/scheduler.py 在复用路径附近调用了 pipeline.init()(line 54)。未能确认 pipeline.init() 是否会清理 wkflow_state 或节点内部缓存。建议确认 pipeline.init() 内是否触发 WkFlowState.reset,若没有,则在复用分支显式重置(调用 pipeline.init()/pipeline.reset()、或调用 WkFlowState.reset、或替换为新 GParam 实例)以避免跨请求脏数据。

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
hugegraph-llm/src/hugegraph_llm/operators/util.py (1)

16-16: 为 PyCGraph 提供容错导入与兼容桩,避免 CI/Lint 失败并对齐 CStatus 语义

当前直接 from PyCGraph import CStatus 在无该依赖的环境会 ImportError/Pylint E0611;并且后续代码使用 CStatus(-1, "...") 表示错误。如果采用此前建议的最小桩(ok: bool),则会把 -1 视为 truthy,误判为成功,掩盖错误。建议:

  • try/except 导入,并提供“既兼容 bool,也兼容 int code”的桩实现(int<0 视为错误)。
  • 加上 # type: ignore[import] 以消除类型告警。

应用如下修复(替换该导入处):

-from PyCGraph import CStatus
+from typing import TYPE_CHECKING
+try:
+    from PyCGraph import CStatus  # type: ignore[import]
+except Exception:  # pragma: no cover
+    class CStatus:  # minimal fallback compatible with bool or int code
+        def __init__(self, code_or_ok=True, info: str = "") -> None:
+            try:
+                # 支持两种构造:CStatus(True/False, msg) 或 CStatus(code:int, msg)
+                if isinstance(code_or_ok, bool):
+                    self._ok = code_or_ok
+                    self._code = 0 if code_or_ok else -1
+                else:
+                    code = int(code_or_ok)
+                    self._code = code
+                    self._ok = code >= 0
+            except Exception:
+                self._code = -1
+                self._ok = False
+            self._info = info
+        def isErr(self) -> bool:
+            return not self._ok
+        def getInfo(self) -> str:
+            return self._info

建议在仓库中扫描是否还有使用 CStatus(-1, ...) 的位置以确保语义统一:

#!/bin/bash
# 查找以负数 code 表示错误的用法
rg -nP 'CStatus\(\s*-\d+\s*,'
🧹 Nitpick comments (2)
hugegraph-llm/src/hugegraph_llm/operators/util.py (2)

19-27: 避免半初始化与提升错误可观测性:先取值校验,再回写属性,并标明缺失项

当前实现若 wkflow_state 成功而 wkflow_input 失败,会在返回错误时留下半初始化的 obj.context。同时错误信息未指明具体缺失键。建议:

  • 先读取到局部变量并校验,两者均有效后再回写到 obj
  • 错误信息包含具体缺失的参数名,便于定位;
  • 捕获更窄的异常(如 AttributeError)并保留原始异常信息上下文。

应用如下重构(保持返回值协议不变):

-def init_context(obj) -> CStatus:
-    try:
-        obj.context = obj.getGParamWithNoEmpty("wkflow_state")
-        obj.wk_input = obj.getGParamWithNoEmpty("wkflow_input")
-        if obj.context is None or obj.wk_input is None:
-            return CStatus(-1, "Required workflow parameters not found")
-        return CStatus()
-    except Exception as e:
-        return CStatus(-1, f"Failed to initialize context: {str(e)}")
+def init_context(obj) -> CStatus:
+    try:
+        context = obj.getGParamWithNoEmpty("wkflow_state")
+        wk_input = obj.getGParamWithNoEmpty("wkflow_input")
+    except AttributeError as e:
+        return CStatus(-1, f"Object missing getGParamWithNoEmpty(): {e}")
+    except Exception as e:
+        return CStatus(-1, f"Failed to fetch workflow parameters: {e}")
+
+    missing = []
+    if context is None:
+        missing.append("wkflow_state")
+    if wk_input is None:
+        missing.append("wkflow_input")
+    if missing:
+        return CStatus(-1, f"Required workflow parameters not found: {', '.join(missing)}")
+
+    obj.context = context
+    obj.wk_input = wk_input
+    return CStatus()

getGParamWithNoEmpty 的契约是否保证“空字符串/空容器”一定不会返回(仅 None)?若不能保证,需将 None 判定扩展为空值判定。请确认。


26-27: 不要吞掉所有异常;至少记录 trace 或保留原异常链

except Exception as e: 直接构造新消息会丢失堆栈。建议用异常链或日志保留上下文。若项目统一使用日志框架,可在此处 logger.exception(...)

可做最小改动:

-    except Exception as e:
-        return CStatus(-1, f"Failed to initialize context: {str(e)}")
+    except Exception as e:
+        # 可选:logger.exception("init_context failed")
+        return CStatus(-1, f"Failed to initialize context: {e}")  # 保留原异常文本
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b31eca3 and f5f6f8e.

📒 Files selected for processing (2)
  • hugegraph-llm/src/hugegraph_llm/flows/scheduler.py (1 hunks)
  • hugegraph-llm/src/hugegraph_llm/operators/util.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • hugegraph-llm/src/hugegraph_llm/flows/scheduler.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: build (3.11)
  • GitHub Check: build (3.10)
  • GitHub Check: build (3.10)
  • GitHub Check: build (3.11)
🔇 Additional comments (1)
hugegraph-llm/src/hugegraph_llm/operators/util.py (1)

1-14: License 头部一致性良好

Apache-2.0 头部规范且与仓库风格一致,无需改动。

@imbajin imbajin requested a review from Copilot September 16, 2025 17:11
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors the Scheduler class to introduce a dynamic workflow scheduling system with pipeline pooling. It transitions from the traditional KgBuilder pattern to a more modular, node-based architecture using the PyCGraph framework.

  • Introduces a pipeline pool architecture for managing reusable workflows (build_vector_index and graph_extract)
  • Adds new node-based operators for chunking, schema management, LLM operations, and vector indexing with proper state management
  • Replaces direct KgBuilder usage with centralized SchedulerSingleton for workflow execution

Reviewed Changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
hugegraph-llm/src/hugegraph_llm/utils/vector_index_utils.py Replaces KgBuilder with SchedulerSingleton for vector index building
hugegraph-llm/src/hugegraph_llm/utils/graph_index_utils.py Adds scheduler integration and introduces original extract_graph function
hugegraph-llm/src/hugegraph_llm/flows/scheduler.py Core scheduler implementation with pipeline pooling and flow management
hugegraph-llm/src/hugegraph_llm/state/ai_state.py Workflow state and input parameter classes for node-based processing
hugegraph-llm/src/hugegraph_llm/operators/ Multiple node implementations for modular workflow operations
hugegraph-llm/pyproject.toml Adds pycgraph dependency for Linux systems

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

)
break
graph["vertices"] = vertices_dict.values()
graph["vertices"] = list(vertices_dict.values())
Copy link

Copilot AI Sep 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line overwrites the existing vertices in the graph instead of appending to them. This could cause data loss if the graph already contains vertices from previous processing steps.

Suggested change
graph["vertices"] = list(vertices_dict.values())
# Append new vertices to graph["vertices"] without overwriting existing ones
existing_ids = set(v["id"] for v in graph.get("vertices", []))
for v in vertices_dict.values():
if v["id"] not in existing_ids:
graph["vertices"].append(v)

Copilot uses AI. Check for mistakes.
}
)
break
self.context.vertices = list(vertices_dict.values())
Copy link

Copilot AI Sep 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line overwrites the existing vertices in the context instead of extending them. In the node-based implementation, this could lose vertices that were processed in previous chunks or iterations.

Suggested change
self.context.vertices = list(vertices_dict.values())
# Merge new vertices into context.vertices, avoiding duplicates
existing_ids = set(v["id"] for v in self.context.vertices)
for v in vertices_dict.values():
if v["id"] not in existing_ids:
self.context.vertices.append(v)

Copilot uses AI. Check for mistakes.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants