Skip to content

loader update#7

Open
kenssa4eedfd wants to merge 32 commits intomasterfrom
loader-update
Open

loader update#7
kenssa4eedfd wants to merge 32 commits intomasterfrom
loader-update

Conversation

@kenssa4eedfd
Copy link
Collaborator

@kenssa4eedfd kenssa4eedfd commented Jan 16, 2026

…ader, and refactored the failure-handling logic for batch inserts.

Purpose of the PR

  • Adjusted several default parameters and descriptions in the Loader, and refactored the failure-handling logic for batch inserts.

Main Changes

Verifying these changes

  • Trivial rework / code cleanup without any test coverage. (No Need)
  • Already covered by existing tests, such as (please modify tests here).
  • Need tests and can be verified as follows:
    • xxx

Does this PR potentially affect the following parts?

  • Nope
  • Dependencies (add/update license info)
  • Modify configurations
  • The public API
  • Other affects (typed here)

Documentation Status

  • Doc - TODO
  • Doc - Done
  • Doc - No Need

Summary by CodeRabbit

发布说明

新功能

  • 添加 --parser-threads(兼容并弃用 --parallel-count);新增 batch-failure-fallback(默认启用)。
  • 连接池默认值改为基于 CPU 计算,并在解析/批处理线程变化时可自动调整。

Bug修复

  • 改进批量插入错误处理:启用回退时切换到单条插入,禁用时停止并向用户报告错误。

测试

  • 增加/扩展单元与功能测试,覆盖并发、回退与连接池自动调整行为。

运维/其他

  • 删除了自动 PR 评论的 CI 工作流;项目配置文件中审查规则调整。

✏️ Tip: You can customize this high-level summary in your review settings.

…ader, and refactored the failure-handling logic for batch inserts.
@github-actions
Copy link

@codecov-ai-reviewer review

@coderabbitai
Copy link

coderabbitai bot commented Jan 16, 2026

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

将解析线程配置从 parallelCount 迁移为 parseThreads(保留兼容别名);新增连接池默认常量并在解析后根据批量线程自动调整连接池;在批量插入失败时引入可选的单条插入回退路径;并补充/调整相关单元与功能测试,移除一个 CI 工作流文件。

Changes

Cohort / File(s) 变更概述
线程配置与连接池管理
hugegraph-loader/src/main/java/org/apache/hugegraph/loader/executor/LoadOptions.java
新增 DEFAULT_MAX_CONNECTIONSDEFAULT_MAX_CONNECTIONS_PER_ROUTE;将公开字段 parallelCount 替换为 parseThreads(同时接受 --parallel-count--parser-threads,并记录弃用警告);新增 batchFailureFallback 选项;增加解析后自动调整连接池的私有方法并添加相关日志。
核心加载器更新
hugegraph-loader/src/main/java/org/apache/hugegraph/loader/HugeGraphLoader.java
将方法内对并行解析线程的使用从 parallelCount 替换为 parseThreads;移除方法内对 parallelCount 的默认赋值逻辑;更新日志与线程池创建为使用 parseThreads
批量插入失败处理
hugegraph-loader/src/main/java/org/apache/hugegraph/loader/task/TaskManager.java
submitBatch 中添加停止状态检查以提前返回并释放信号量;在批量插入失败回调中,根据 batchFailureFallback 选择转为单条插入或标记错误并停止加载;确保信号量与时间统计在所有路径均被正确释放/记录并补充相应日志/用户通知。
功能测试调整
hugegraph-loader/src/test/java/org/apache/hugegraph/loader/test/functional/FileLoadTest.java, .../KafkaLoadTest.java
多处测试命令行加入 --parser-threads "1"(含 FIXME 注释),使测试在解析时使用固定解析线程数以提高确定性。
LoadOptions 单元测试
hugegraph-loader/src/test/java/org/apache/hugegraph/loader/test/unit/LoadOptionsTest.java
新增单元测试(约 213 行),覆盖连接池自动调整、解析线程默认与验证、对弃用参数 --parallel-count 的处理与日志断言。
TaskManager 单元测试
hugegraph-loader/src/test/java/org/apache/hugegraph/loader/test/unit/TaskManagerTest.java
新增单元测试(约 536 行),模拟批量失败回退(启用/禁用)、并发提交、信号量行为及相关指标与上下文状态断言。
CI 工作流
.github/workflows/auto-pr-review.yml
删除该 GitHub Actions 工作流文件(移除自动 PR 评论任务)。

Sequence Diagram

sequenceDiagram
    participant Client as Client/User
    participant TaskMgr as TaskManager
    participant BatchTask as BatchInsertTask
    participant GraphMgr as GraphManager
    participant SingleTask as SingleInsertTask

    Client->>TaskMgr: submitBatch(edges)
    TaskMgr->>TaskMgr: acquireBatchSemaphore
    alt context stopped
        TaskMgr->>TaskMgr: releaseBatchSemaphore
        TaskMgr-->>Client: return early
    else context running
        TaskMgr->>BatchTask: submit batch insert
        BatchTask->>GraphMgr: insertInBatch()
        alt success
            GraphMgr-->>BatchTask: success
            BatchTask->>TaskMgr: update metrics & releaseSemaphore
        else failure
            GraphMgr-->>BatchTask: error
            BatchTask->>TaskMgr: whenComplete(error)
            alt batchFailureFallback enabled
                TaskMgr->>TaskMgr: log warning "try single insert"
                TaskMgr->>SingleTask: submit single inserts
                SingleTask->>GraphMgr: insertOne() x N
                GraphMgr-->>SingleTask: results
                TaskMgr->>TaskMgr: releaseBatchSemaphore
            else batchFailureFallback disabled
                TaskMgr->>TaskMgr: mark error in context
                TaskMgr->>TaskMgr: stop loading & print error
                TaskMgr->>TaskMgr: releaseBatchSemaphore
            end
        end
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • Create auto-pr-review.yml #1 — 涉及相同路径 .github/workflows/auto-pr-review.yml 的变更,可能与本 PR 删除该 workflow 存在直接文件级关联或冲突。

Poem

🐰 我是小兔子在草地里跑,
解析线程改名静又闹,
批量若跌倒,单条来接套,
连接池自适应,日志悄声报,
测试护航,代码轻轻笑 :)

🚥 Pre-merge checks | ✅ 1 | ❌ 2
❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title "loader update" is vague and generic, lacking specificity about the actual changes made to the codebase. Consider using a more descriptive title that captures the main change, such as "Refactor batch insert failure handling and standardize parser thread parameters" or "Add parser-threads option and improve batch insert fallback behavior".
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch loader-update

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link

Summary of Changes

Hello @kenssa4eedfd, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the HugeGraph Loader by refining its configuration options for parallelism and connection management, making it more robust and user-friendly. It also significantly improves the failure handling mechanism for batch inserts, ensuring that the loading process stops gracefully and informs the user immediately upon encountering critical errors, rather than attempting less efficient fallback strategies.

Highlights

  • Loader Parallelism Configuration: The parallelCount option has been renamed to parallelThreads (with --parser-threads as an alias) and its default behavior has been updated. It now automatically determines the number of parallel read pipelines based on the number of available CPU cores if not explicitly set, and includes validation to ensure the value is at least 1.
  • Connection Pool Defaults and Descriptions: New constants DEFAULT_MAX_CONNECTIONS and DEFAULT_MAX_CONNECTIONS_PER_ROUTE have been introduced. The descriptions for batch-insert-threads, max-conn, and max-conn-per-route parameters have been enhanced to clarify how connection limits may be auto-adjusted based on the batch-insert-threads setting.
  • Refactored Batch Insert Failure Handling: The error handling logic for batch inserts has been significantly changed. Instead of attempting single inserts after a batch failure, the system will now immediately log an error, mark the loading context as stopped, shut down the batch and single insert services, and print a user-friendly error message, ensuring a more decisive failure response.
  • Early Exit for Stopped Loading Context: An early exit condition has been added to the submitBatch method. If the loading context has been stopped (e.g., due to a previous error), subsequent batch submissions will be prevented, improving system stability and preventing unnecessary work.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates several loader parameters and refactors the failure handling for batch inserts. The parameter changes, such as renaming --parallel-count to --parser-threads and updating default values and descriptions, improve clarity and usability. The core change is in the batch insert failure logic, which now adopts a fail-fast strategy instead of falling back to single inserts. While this can be a valid approach, I've found a couple of issues in the new implementation within TaskManager.java. Specifically, there's a bug in metrics reporting where the 'in-flight' count is not correctly updated on failure, and the new shutdown logic is decentralized, which could be improved for better robustness and maintainability. My review includes a suggestion to address these points.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
hugegraph-loader/src/main/java/org/apache/hugegraph/loader/task/TaskManager.java (1)

141-156: 在 acquire() 和 runAsync() 之间存在竞态条件,执行器关闭会导致信号量与指标泄漏

context.stopped() 仅在 acquire 前检查;若在等待期间执行器被 shutdown,后续 runAsync() 将同步抛出 RejectedExecutionException,此时 whenComplete 回调未注册,无法执行 release 和指标回收。

具体场景:当某个批次插入任务失败时,whenComplete 回调内调用 batchService.shutdown();此后提交的批次若在 acquire 返回但 runAsync 尚未执行时被提交,将遭遇关闭的执行器,导致异常未被捕获,信号量与 plusFlighting 计数永久泄漏。

建议在 acquire 后增加二次 stopped 检查,并对 runAsync 提交失败添加兜底处理。

🔧 建议修复
+import java.util.concurrent.RejectedExecutionException;
...
         try {
             this.batchSemaphore.acquire();
         } catch (InterruptedException e) {
             throw new LoadException("Interrupted while waiting to submit %s " +
                                     "batch in batch mode", e, mapping.type());
         }
+        if (this.context.stopped()) {
+            this.batchSemaphore.release();
+            return;
+        }
         LoadSummary summary = this.context.summary();
         summary.metrics(struct).plusFlighting(batch.size());
 
         InsertTask task = new BatchInsertTask(this.context, struct,
                                               mapping, batch);
-        CompletableFuture.runAsync(task, this.batchService).whenComplete(
-            (r, e) -> {
-                if (e != null) {
-                    LOG.error("Batch insert {} error, interrupting import", mapping.type(), e);
-                    this.context.occurredError();
-                    this.context.stopLoading();
-                    this.batchService.shutdown();
-                    this.singleService.shutdown();
-                    Printer.printError("Batch insert %s failed, stop loading, Please check the logs",
-                                       mapping.type().string());
-                } else {
-                    summary.metrics(struct).minusFlighting(batch.size());
-                }
-
-                this.batchSemaphore.release();
-                long end = System.currentTimeMillis();
-                this.context.summary().addTimeRange(mapping.type(), start, end);
-            });
+        try {
+            CompletableFuture.runAsync(task, this.batchService).whenComplete(
+                (r, e) -> {
+                    if (e != null) {
+                        LOG.error("Batch insert {} error, interrupting import", mapping.type(), e);
+                        this.context.occurredError();
+                        this.context.stopLoading();
+                        this.batchService.shutdown();
+                        this.singleService.shutdown();
+                        Printer.printError("Batch insert %s failed, stop loading, Please check the logs",
+                                           mapping.type().string());
+                    } else {
+                        summary.metrics(struct).minusFlighting(batch.size());
+                    }
+
+                    this.batchSemaphore.release();
+                    long end = System.currentTimeMillis();
+                    this.context.summary().addTimeRange(mapping.type(), start, end);
+                });
+        } catch (RejectedExecutionException ex) {
+            summary.metrics(struct).minusFlighting(batch.size());
+            this.batchSemaphore.release();
+            LOG.warn("Batch insert skipped because executor is shutdown", ex);
+        }
🤖 Fix all issues with AI agents
In
`@hugegraph-loader/src/main/java/org/apache/hugegraph/loader/task/TaskManager.java`:
- Around line 158-168: The failure path in TaskManager's batch completion
handler currently skips releasing the flighting metric
(summary.metrics(struct).minusFlighting), causing inflated stats; update the
whenComplete handler for the batch future so that
summary.metrics(struct).minusFlighting(batch.size()) is invoked in all cases
(both when e != null and else), i.e., perform the minusFlighting call before or
alongside the error handling block that logs the error, calls
this.context.occurredError()/stopLoading(), and shuts down services, ensuring
the metric is always decremented regardless of success or failure.
🧹 Nitpick comments (1)
hugegraph-loader/src/main/java/org/apache/hugegraph/loader/executor/LoadOptions.java (1)

188-192: 建议补充参数校验注解以增强一致性和用户体验。

当前 parallelThreads 缺少 @Parameter 注解上的 validateWith 属性,与 batchSize 的实现不一致。虽然下游代码(HugeGraphLoader 第 669-673 行)已正确处理了 null 兜底和值域校验,但在参数解析层面补充验证器可以更早地捕获用户输入错误,避免运行时异常。

建议修改如下,与现有的 PositiveValidator 模式保持一致:

建议改进
     `@Parameter`(names = {"--parallel-count", "--parser-threads"}, arity = 1,
+               validateWith = {PositiveValidator.class},
                description = "The number of parallel read pipelines. " +
                              "Default: auto (min(struct count, cpu)). " +
                              "Must be >= 1")
     public Integer parallelThreads = null;
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9683b47 and d3fa860.

📒 Files selected for processing (3)
  • hugegraph-loader/src/main/java/org/apache/hugegraph/loader/HugeGraphLoader.java
  • hugegraph-loader/src/main/java/org/apache/hugegraph/loader/executor/LoadOptions.java
  • hugegraph-loader/src/main/java/org/apache/hugegraph/loader/task/TaskManager.java
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: hubble-ci (11, 3.11)
  • GitHub Check: check-dependency-license
  • GitHub Check: Analyze (javascript)
  • GitHub Check: loader-ci (11)
  • GitHub Check: Analyze (java)
🔇 Additional comments (6)
hugegraph-loader/src/main/java/org/apache/hugegraph/loader/HugeGraphLoader.java (2)

665-674: 逻辑正确,实现合理。

自动默认值逻辑与验证顺序正确:

  1. 提前返回空结构列表,避免 min(0, availableProcessors) = 0 的边界情况
  2. 空值时自动计算默认值
  3. 最后验证确保用户显式设置的值也满足 >= 1 的约束

678-685: 日志和线程池创建正确更新。

日志输出和线程池创建均已正确使用 parallelThreads 变量,与新选项名称保持一致。

hugegraph-loader/src/main/java/org/apache/hugegraph/loader/executor/LoadOptions.java (3)

48-49: 默认连接数常量抽取清晰,OK。


159-165: 描述更新符合新行为,OK。


171-181: 自动调整功能未完整实现,但不存在默认值判定逻辑同步问题。

检查发现 HugeClientBuilder.configPool() 的自动调整逻辑基于参数 == 0 判定,而不是比较常量值。由于 LoadOptions 初始化 maxConnectionsmaxConnectionsPerRoute 时使用的是 CPUS * 4CPUS * 2(非零值),所以不会触发 configPool() 中的自动调整分支。

虽然参数描述说"If left as default...may be auto-adjusted",但当前代码中不存在根据 batchInsertThreads 动态调整这两个值的实现。

hugegraph-loader/src/main/java/org/apache/hugegraph/loader/task/TaskManager.java (1)

29-29: 新增 Printer 依赖合理。
用于用户侧错误提示,和下面的错误处理路径一致。

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

@kenssa4eedfd
Copy link
Collaborator Author

@codecov-ai-reviewer review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces several improvements to the HugeGraph loader. The parameter handling for parallel threads and HTTP connections has been refined for better flexibility and user clarity. A significant change is the refactoring of the batch insert failure logic to a fail-fast approach, which is a sound design choice to avoid long, partially failing loads. My review includes one suggestion to further improve the new failure handling logic by centralizing the shutdown process for better maintainability.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In
`@hugegraph-loader/src/main/java/org/apache/hugegraph/loader/executor/LoadOptions.java`:
- Around line 188-193: The parallelThreads field in LoadOptions lacks a
positive-value validator despite the description "Must be >= 1"; add the same
PositiveValidator used by other params to the `@Parameter` annotation for the
public Integer parallelThreads (e.g., include validator =
PositiveValidator.class in the annotation), and ensure the PositiveValidator
class is imported/available so invalid values (0 or negative) are rejected at
parse time.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In
`@hugegraph-loader/src/main/java/org/apache/hugegraph/loader/executor/LoadOptions.java`:
- Around line 171-181: Add positive-value validation to the HTTP connection
parameters to prevent zero/negative inputs: in class LoadOptions, update the
`@Parameter` annotations for the fields maxConnections and maxConnectionsPerRoute
to include the PositiveValidator (same validator used by other numeric options)
so that values <= 0 are rejected at CLI parsing time and avoid connection pool
initialization failures.

In
`@hugegraph-loader/src/main/java/org/apache/hugegraph/loader/HugeGraphLoader.java`:
- Around line 665-680: parallelThreads is declared as an Integer and may be
null, causing an NPE when autounboxed in the call to
ExecutorUtil.newFixedThreadPool; ensure a non-null primitive int is passed by
coalescing null to a safe default (e.g. Math.max(2, CPUS) or
context.options().defaultParallel) before calling newFixedThreadPool. Modify the
code around the use of this.context.options().parallelThreads (or change its
declaration) so you compute an int threads =
(this.context.options().parallelThreads != null ?
this.context.options().parallelThreads : <default>) and then call
ExecutorUtil.newFixedThreadPool(threads, "loader"); also consider validating
threads > 0 and logging a warning if a null/invalid value was replaced.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In
`@hugegraph-loader/src/main/java/org/apache/hugegraph/loader/executor/LoadOptions.java`:
- Around line 190-195: The parameter annotation for parallelThreads has a
mismatched default description ("Default: auto max(2, cpu)") versus the actual
default (public int parallelThreads = Math.max(2, CPUS/2)); update the
description in the `@Parameter` annotation to reflect the real default (e.g.,
"Default: max(2, cpu/2)") so the doc and the code match; locate the `@Parameter`
on the parallelThreads field in LoadOptions and edit only the description text
to match the Math.max(2, CPUS/2) semantics.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In
`@hugegraph-loader/src/main/java/org/apache/hugegraph/loader/task/TaskManager.java`:
- Around line 164-167: When a batch insert fails the code calls
this.context.stopLoading() and prints an error via Printer.printError("Batch
insert %s failed, stop loading...", mapping.type().string()) but does not set
the error flag; call this.context.occurredError() as well to mark the task
context as having errored. Update the failure branch (around the existing
this.context.stopLoading() / Printer.printError(...) block in TaskManager) to
invoke this.context.occurredError() so the loader state/reporting reflects the
error consistently.
- Around line 158-170: The flighting metric is being decremented twice when
batchFailureFallback is true (submitBatch calls plusFlighting(batch.size()) then
on batch failure calls submitInSingle, and both submitBatch's whenComplete and
submitInSingle's whenComplete call summary.metrics(...).minusFlighting), causing
incorrect counts; to fix, remove the minusFlighting call from submitInSingle's
whenComplete so only submitBatch balances the plusFlighting, leaving
submitBatch, submitInSingle, summary.metrics(...).minusFlighting, and the
batchFailureFallback path as the referenced places to edit; if submitInSingle
must also be used independently elsewhere, instead add a boolean flag parameter
to submitInSingle (e.g., skipFlightingDecrement) or document callers so that
only the original submitBatch caller performs minusFlighting in the fallback
flow.
🧹 Nitpick comments (1)
hugegraph-loader/src/main/java/org/apache/hugegraph/loader/executor/LoadOptions.java (1)

428-430: 自动调整逻辑存在边界条件问题。

batchInsertThreads 小于 CPUS 时(例如用户显式设置较小值),此逻辑会被触发但可能不会产生任何调整,因为默认连接数通常已足够。但当 batchInsertThreads 刚好等于 CPUS 时,即使用户显式指定了该值,调整逻辑也不会触发。

建议考虑使用标志位追踪用户是否显式设置了 batchInsertThreads,而非依赖值比较。

♻️ 可选改进方案

可在 LoadOptions 中添加一个 batchInsertThreadsUserSet 标志,或者在 parseOptions 中通过 JCommander 的机制检测参数是否被显式设置:

// 示例:可在解析后检测是否显式设置
// 当前逻辑在大多数情况下可以工作,但如果需要更精确的控制,
// 可考虑在未来版本中引入显式设置检测

Also applies to: 434-451

…ader, and refactored the failure-handling logic for batch inserts.
…ader, and refactored the failure-handling logic for batch inserts.
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
hugegraph-loader/src/main/java/org/apache/hugegraph/loader/task/TaskManager.java (1)

156-176: 成功路径缺少 minusFlighting 调用,导致指标永久膨胀。

当批量插入成功时(e == null),whenComplete 回调不会执行任何 minusFlighting 调用,但 plusFlighting 在第 152 行始终被调用。这将导致 flighting 指标持续增长,统计数据严重失真。

分析当前三种路径:

  • 成功:无 minusFlighting
  • 失败 + 回退submitInSingle 内部会调用 minusFlighting(第 194 行)✓
  • 失败 + 不回退:第 165 行调用 minusFlighting
🐛 建议修复
         CompletableFuture.runAsync(task, this.batchService).whenComplete(
             (r, e) -> {
+                boolean fallbackToSingle = false;
                 if (e != null) {
                     LOG.error("Batch insert {} error, interrupting import", mapping.type(), e);
                     if (this.options.batchFailureFallback) {
                         LOG.warn("Batch insert {} error, try single insert",
                                  mapping.type(), e);
                         this.submitInSingle(struct, mapping, batch);
+                        fallbackToSingle = true;
                     } else {
-                        summary.metrics(struct).minusFlighting(batch.size());
                         this.context.occurredError();
                         this.context.stopLoading();
                         Printer.printError("Batch insert %s failed, stop loading. Please check the logs",
                                            mapping.type().string());
                     }
                 }
+                // 仅当未回退到单条插入时扣减指标(回退时由 submitInSingle 负责)
+                if (!fallbackToSingle) {
+                    summary.metrics(struct).minusFlighting(batch.size());
+                }
 
                 this.batchSemaphore.release();
                 long end = System.currentTimeMillis();
                 this.context.summary().addTimeRange(mapping.type(), start, end);
             });

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In
`@hugegraph-loader/src/test/java/org/apache/hugegraph/loader/test/unit/TaskManagerTest.java`:
- Around line 503-507: The test currently swallows InterruptedException after
calling latch.await(5, TimeUnit.SECONDS) and does not restore the thread's
interrupted status; update the catch block that catches InterruptedException in
TaskManagerTest (the latch.await(...) call) to call
Thread.currentThread().interrupt() so the thread's interrupted flag is
preserved, while still allowing the test to proceed or exit as intended.
🧹 Nitpick comments (3)
hugegraph-loader/src/test/java/org/apache/hugegraph/loader/test/unit/TaskManagerTest.java (3)

268-269: 建议添加注释说明范围断言的原因。

batchCalls >= 2 && batchCalls <= 3 这个范围断言可能让后续维护者困惑。建议添加注释解释为什么期望值是一个范围而非精确值(由于并发竞争条件导致)。

♻️ 建议的修改
             int batchCalls = FailingConcurrentGraphManager.BATCH_CALLS.get();
+            // Due to race conditions, the number of batch calls can be 2 or 3:
+            // - At least 2: first success call + second failure call that triggers stop
+            // - At most 3: one more call might slip through before stop is observed
             Assert.assertTrue(batchCalls >= 2 && batchCalls <= 3);

330-338: Thread.sleep(50) 可能导致测试不稳定。

使用固定时间等待来确保线程阻塞在信号量上可能在 CI 环境或负载较高的系统上导致测试失败。建议考虑使用更可靠的同步机制,例如在 submitBatch 入口处添加可观测的同步点,或者增加等待时间并添加注释说明。

请确认此测试在 CI 环境中的稳定性。如果遇到间歇性失败,可考虑以下改进:

-            Thread.sleep(50);
+            // Allow sufficient time for the thread to block on semaphore acquisition.
+            // Increase if flaky on slow CI systems.
+            Thread.sleep(100);

462-467: 建议使用 @Before 方法统一重置静态状态。

测试双工类使用静态字段存储状态,当前在各测试方法中手动重置。建议添加 @Before 方法统一重置,以提高可维护性并防止遗漏导致的测试干扰。

♻️ 建议的修改

在类中添加 @Before 方法:

`@Before`
public void resetTestDoubles() {
    FailingBatchGraphManager.BATCH_CALLS.set(0);
    FailingBatchGraphManager.SINGLE_CALLS.set(0);
    FailingConcurrentGraphManager.BATCH_CALLS.set(0);
    FailingConcurrentGraphManager.FIRST_STARTED = null;
    FailingConcurrentGraphManager.ALLOW_FIRST_FINISH = null;
    FailingConcurrentGraphManager.FAILURE_CALLED = null;
}

同时添加导入:

import org.junit.Before;

Also applies to: 511-514

@kenssa4eedfd
Copy link
Collaborator Author

@codecov-ai-reviewer review

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In @.github/workflows/loader-ci.yml:
- Around line 88-92: The workflow uploads test report artifacts but only
includes failsafe plain text files; add the missing XML failsafe reports so CI
collects full JUnit data by modifying the artifact path list (the path block in
.github/workflows/loader-ci.yml) to include **/target/failsafe-reports/*.xml
alongside the existing **/target/failsafe-reports/*.txt and the surefire globs
so both surefire and failsafe XML reports are uploaded for complete diagnostics.

Comment on lines +88 to +92
path: |
**/target/surefire-reports/*.txt
**/target/surefire-reports/*.xml
**/target/failsafe-reports/*.txt

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

补充 Failsafe 的 XML 报告以便完整诊断。
当前只上传了 failsafe-reports/*.txt,缺少 failsafe-reports/*.xml 会丢失标准 JUnit 报告信息,排查 IT 失败时不完整。

🔧 建议补丁
         path: |
           **/target/surefire-reports/*.txt
           **/target/surefire-reports/*.xml
           **/target/failsafe-reports/*.txt
+          **/target/failsafe-reports/*.xml
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
path: |
**/target/surefire-reports/*.txt
**/target/surefire-reports/*.xml
**/target/failsafe-reports/*.txt
path: |
**/target/surefire-reports/*.txt
**/target/surefire-reports/*.xml
**/target/failsafe-reports/*.txt
**/target/failsafe-reports/*.xml
🤖 Prompt for AI Agents
In @.github/workflows/loader-ci.yml around lines 88 - 92, The workflow uploads
test report artifacts but only includes failsafe plain text files; add the
missing XML failsafe reports so CI collects full JUnit data by modifying the
artifact path list (the path block in .github/workflows/loader-ci.yml) to
include **/target/failsafe-reports/*.xml alongside the existing
**/target/failsafe-reports/*.txt and the surefire globs so both surefire and
failsafe XML reports are uploaded for complete diagnostics.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In @.github/workflows/loader-ci.yml:
- Around line 77-81: Restore the disabled Maven test profiles in the CI workflow
by re-enabling the commented lines for the "unit", "file", "jdbc", and "kafka"
profiles (so the pipeline runs mvn test -P unit, -P file, -P jdbc, -P kafka in
addition to -P hdfs); if they were intentionally disabled only temporarily, add
a TODO comment in the workflow and create a matching issue referencing the
reason and owner to track fixing the failing tests rather than skipping them.
Ensure the CI runs all relevant profiles to validate changes to batch insert
failure handling and LoadOptions parameter changes.

Comment on lines +77 to +81
# mvn test -P unit -ntp
# mvn test -P file
mvn test -P hdfs
mvn test -P jdbc
mvn test -P kafka
# mvn test -P jdbc
# mvn test -P kafka
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

在 CI 中禁用测试会导致回归风险。

当前注释掉了 4 个测试配置文件(unit、file、jdbc、kafka),仅保留 hdfs 测试。这将显著降低 CI 的测试覆盖率,尤其是 PR 中涉及批量插入失败处理逻辑重构和 LoadOptions 参数变更,这些改动应该由被禁用的测试来验证。

如果测试失败,建议修复测试而非跳过;如果是临时措施,请添加 TODO 注释或创建 issue 跟踪,避免将禁用状态合并到 master 分支。

🔧 建议恢复测试执行
-          # mvn test -P unit -ntp
-          # mvn test -P file
+          mvn test -P unit -ntp
+          mvn test -P file
           mvn test -P hdfs
-          # mvn test -P jdbc
-          # mvn test -P kafka
+          mvn test -P jdbc
+          mvn test -P kafka
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# mvn test -P unit -ntp
# mvn test -P file
mvn test -P hdfs
mvn test -P jdbc
mvn test -P kafka
# mvn test -P jdbc
# mvn test -P kafka
mvn test -P unit -ntp
mvn test -P file
mvn test -P hdfs
mvn test -P jdbc
mvn test -P kafka
🤖 Prompt for AI Agents
In @.github/workflows/loader-ci.yml around lines 77 - 81, Restore the disabled
Maven test profiles in the CI workflow by re-enabling the commented lines for
the "unit", "file", "jdbc", and "kafka" profiles (so the pipeline runs mvn test
-P unit, -P file, -P jdbc, -P kafka in addition to -P hdfs); if they were
intentionally disabled only temporarily, add a TODO comment in the workflow and
create a matching issue referencing the reason and owner to track fixing the
failing tests rather than skipping them. Ensure the CI runs all relevant
profiles to validate changes to batch insert failure handling and LoadOptions
parameter changes.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In
`@hugegraph-loader/src/main/java/org/apache/hugegraph/loader/executor/LoadOptions.java`:
- Around line 429-432: The deprecation warning check in LoadOptions.java
currently uses Arrays.asList(args).contains("--parallel-count") which misses
forms like "--parallel-count=1"; update the condition (in the static check where
args are inspected) to detect both exact and assignment forms by using a
prefix/any-match test (e.g. any arg that equals "--parallel-count" or startsWith
"--parallel-count="). Keep the same LOG.warn message and ensure the check covers
both "--parallel-count" and "--parallel-count=...".
🧹 Nitpick comments (1)
hugegraph-loader/src/main/java/org/apache/hugegraph/loader/executor/LoadOptions.java (1)

437-453: 自动调整后可能出现 max-conn-per-route > max-conn。
当用户显式设置了较小的 max-conn、但 max-conn-per-route 保持默认时,当前逻辑可能把 per-route 调大到超过 total。建议在调整后做一次裁剪/同步,以保证配置一致性。

♻️ 一种可选修复(裁剪 per-route)
         if (maxConnPerRoute == DEFAULT_MAX_CONNECTIONS_PER_ROUTE && maxConnPerRoute < batchThreads * 2) {
             options.maxConnectionsPerRoute = batchThreads * 2;
             LOG.info("Auto adjusted max-conn-per-route to {} based on batch-insert-threads({})",
                      options.maxConnectionsPerRoute, batchThreads);
         }
+
+        if (options.maxConnectionsPerRoute > options.maxConnections) {
+            LOG.warn("max-conn-per-route({}) > max-conn({}), clamping to max-conn",
+                     options.maxConnectionsPerRoute, options.maxConnections);
+            options.maxConnectionsPerRoute = options.maxConnections;
+        }

Comment on lines +429 to +432
if (Arrays.asList(args).contains("--parallel-count")) {
LOG.warn("Parameter --parallel-count is deprecated, " +
"please use --parser-threads instead");
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, let's find and read the LoadOptions.java file
fd -n "LoadOptions.java" --type f

Repository: hugegraph/hugegraph-toolchain

Length of output: 303


🏁 Script executed:

# Let's check the structure of the repository
git ls-files | head -20

Repository: hugegraph/hugegraph-toolchain

Length of output: 676


🏁 Script executed:

# Search for the specific file mentioned in the review
find . -name "LoadOptions.java" -type f

Repository: hugegraph/hugegraph-toolchain

Length of output: 159


🏁 Script executed:

# Read the LoadOptions.java file to understand the argument parsing context
wc -l ./hugegraph-loader/src/main/java/org/apache/hugegraph/loader/executor/LoadOptions.java

Repository: hugegraph/hugegraph-toolchain

Length of output: 163


🏁 Script executed:

# Read around the target lines
sed -n '410,450p' ./hugegraph-loader/src/main/java/org/apache/hugegraph/loader/executor/LoadOptions.java

Repository: hugegraph/hugegraph-toolchain

Length of output: 2102


🏁 Script executed:

# Look for the main method and argument parsing logic
grep -n "public static void main\|String\[\] args" ./hugegraph-loader/src/main/java/org/apache/hugegraph/loader/executor/LoadOptions.java | head -20

Repository: hugegraph/hugegraph-toolchain

Length of output: 136


🏁 Script executed:

# Read the entire parseOptions method to understand how arguments are parsed
sed -n '376,432p' ./hugegraph-loader/src/main/java/org/apache/hugegraph/loader/executor/LoadOptions.java

Repository: hugegraph/hugegraph-toolchain

Length of output: 2755


🏁 Script executed:

# Search for how --parallel-count is actually parsed (look for property setting)
rg -n "parallel-count|parallelCount" ./hugegraph-loader/src/main/java/org/apache/hugegraph/loader/executor/LoadOptions.java

Repository: hugegraph/hugegraph-toolchain

Length of output: 393


🏁 Script executed:

# Look for the argument parsing framework being used
head -50 ./hugegraph-loader/src/main/java/org/apache/hugegraph/loader/executor/LoadOptions.java

Repository: hugegraph/hugegraph-toolchain

Length of output: 2122


🏁 Script executed:

# Search for other deprecated parameters to see how they're handled
rg -n "deprecated" -i ./hugegraph-loader/src/main/java/org/apache/hugegraph/loader/executor/LoadOptions.java

Repository: hugegraph/hugegraph-toolchain

Length of output: 245


🏁 Script executed:

# Look for tests of LoadOptions to see how arguments are actually passed
find . -path "*/test*" -name "*LoadOptions*" -type f

Repository: hugegraph/hugegraph-toolchain

Length of output: 164


🏁 Script executed:

# Search for how CLI arguments are invoked/used in tests
rg -l "parallel-count|parser-threads" --type java | head -20

Repository: hugegraph/hugegraph-toolchain

Length of output: 434


🏁 Script executed:

# Check if there are any integration tests or examples
find . -path "*/test*" -name "*.java" -type f | xargs grep -l "parallel-count\|parser-threads" 2>/dev/null

Repository: hugegraph/hugegraph-toolchain

Length of output: 355


🏁 Script executed:

# Read the LoadOptionsTest file to understand how arguments are tested
wc -l ./hugegraph-loader/src/test/java/org/apache/hugegraph/loader/test/unit/LoadOptionsTest.java

Repository: hugegraph/hugegraph-toolchain

Length of output: 168


🏁 Script executed:

# Search for tests related to parallel-count or deprecated parameters
rg -n "parallel-count|parser-threads|deprecated" ./hugegraph-loader/src/test/java/org/apache/hugegraph/loader/test/unit/LoadOptionsTest.java

Repository: hugegraph/hugegraph-toolchain

Length of output: 406


🏁 Script executed:

# Look at the entire test file
cat ./hugegraph-loader/src/test/java/org/apache/hugegraph/loader/test/unit/LoadOptionsTest.java

Repository: hugegraph/hugegraph-toolchain

Length of output: 7577


🏁 Script executed:

# Let's verify JCommander's behavior - search for documentation or check how JCommander handles arguments
rg -n "arity" ./hugegraph-loader/src/main/java/org/apache/hugegraph/loader/executor/LoadOptions.java | head -10

Repository: hugegraph/hugegraph-toolchain

Length of output: 734


🏁 Script executed:

# Check if there are any CLI usage examples or documentation
find . -name "README*" -o -name "*.md" | xargs grep -l "parallel-count\|parser-threads" 2>/dev/null

Repository: hugegraph/hugegraph-toolchain

Length of output: 57


🏁 Script executed:

# Let's check how JCommander is used elsewhere in the codebase with similar parameters
rg -n "@Parameter" ./hugegraph-loader/src/main/java/org/apache/hugegraph/loader/executor/LoadOptions.java | head -20

Repository: hugegraph/hugegraph-toolchain

Length of output: 1345


🏁 Script executed:

# Check if there are any examples showing --param=value format usage
rg -n "=.*\"" ./hugegraph-loader/src/test/java/org/apache/hugegraph/loader/test/unit/LoadOptionsTest.java

Repository: hugegraph/hugegraph-toolchain

Length of output: 444


🏁 Script executed:

# Look for main method in LoadOptions or tests that might use the CLI
grep -rn "public static void main" ./hugegraph-loader/src/main/java/org/apache/hugegraph/loader/executor/

Repository: hugegraph/hugegraph-toolchain

Length of output: 55


🏁 Script executed:

# Check JCommander documentation/behavior - look at how other parameters might be passed
rg -B5 -A5 "arity = 1" ./hugegraph-loader/src/main/java/org/apache/hugegraph/loader/executor/LoadOptions.java | head -30

Repository: hugegraph/hugegraph-toolchain

Length of output: 1513


🏁 Script executed:

# Let's also check if there's any documentation or comments about command-line usage
find . -name "*.md" -o -name "*.txt" | xargs grep -l "loader\|parallel-count\|parser-threads" 2>/dev/null | head -5

Repository: hugegraph/hugegraph-toolchain

Length of output: 301


🏁 Script executed:

# Search for JCommander usage documentation or examples in the codebase
rg -n "JCommander\|parse" ./hugegraph-loader/src/main/java/org/apache/hugegraph/loader/executor/LoadOptions.java | head -10

Repository: hugegraph/hugegraph-toolchain

Length of output: 55


🏁 Script executed:

# Let's create a test to verify JCommander's behavior with both formats
python3 << 'EOF'
# Check JCommander documentation/behavior
# JCommander typically supports both --param value and --param=value formats
# Let's verify by checking if there are any tests with = format

import subprocess
result = subprocess.run(['rg', '--type', 'java', '-n', r'--\w+\s*=', './hugegraph-loader/src/test/java/'], 
                       capture_output=True, text=True)
print("Test files with --param=value format:")
print(result.stdout if result.stdout else "(no matches found)")

# Also search in main code
result2 = subprocess.run(['rg', '--type', 'java', '-n', r'--\w+\s*=', './hugegraph-loader/src/main/java/'], 
                        capture_output=True, text=True)
print("\nMain code files with --param=value format:")
print(result2.stdout if result2.stdout else "(no matches found)")
EOF

Repository: hugegraph/hugegraph-toolchain

Length of output: 159


🏁 Script executed:

# Let's verify by looking at the actual test to see what formats are tested
sed -n '115,135p' ./hugegraph-loader/src/test/java/org/apache/hugegraph/loader/test/unit/LoadOptionsTest.java

Repository: hugegraph/hugegraph-toolchain

Length of output: 755


🏁 Script executed:

# Let's check the JCommander version and documentation about supported formats
rg -n "jcommander\|beust" ./hugegraph-loader/pom.xml

Repository: hugegraph/hugegraph-toolchain

Length of output: 55


🏁 Script executed:

# Let's examine if there's any indication of how arguments are passed in CLI
grep -rn "testDeprecatedParallelCountParameter\|parallel-count" ./hugegraph-loader/src/test/ | head -20

Repository: hugegraph/hugegraph-toolchain

Length of output: 378


🏁 Script executed:

# Let's verify JCommander behavior by checking the dependency version
find . -name "pom.xml" | xargs grep -A2 "jcommander" | head -20

Repository: hugegraph/hugegraph-toolchain

Length of output: 919


🏁 Script executed:

# Let's verify the suggested fix is syntactically correct by checking similar patterns in the codebase
rg -n "anyMatch\|startsWith" ./hugegraph-loader/src/main/java/org/apache/hugegraph/loader/executor/LoadOptions.java

Repository: hugegraph/hugegraph-toolchain

Length of output: 55


🏁 Script executed:

# Let's search for similar deprecation warning patterns in the codebase
rg -rn "deprecated.*LOG.warn\|LOG.warn.*deprecated" ./hugegraph-loader/src/main/java/

Repository: hugegraph/hugegraph-toolchain

Length of output: 55


🏁 Script executed:

# Check if there are other places where args array is checked with similar patterns
rg -rn "Arrays.asList(args)" ./hugegraph-loader/src/main/java/

Repository: hugegraph/hugegraph-toolchain

Length of output: 55


弃用参数告警在 --parallel-count=... 形式下不生效。

当前仅检查完全等于 --parallel-count 的参数,而 JCommander 支持 --parallel-count=1 这种格式。传递该形式参数时,args 数组中的元素为 "--parallel-count=1",contains() 无法匹配,导致告警丢失。建议改为前缀匹配以覆盖两种格式。

🛠️ 建议修复
-        if (Arrays.asList(args).contains("--parallel-count")) {
+        if (Arrays.stream(args).anyMatch(arg ->
+                arg.equals("--parallel-count") || arg.startsWith("--parallel-count="))) {
             LOG.warn("Parameter --parallel-count is deprecated, " +
                      "please use --parser-threads instead");
         }
🤖 Prompt for AI Agents
In
`@hugegraph-loader/src/main/java/org/apache/hugegraph/loader/executor/LoadOptions.java`
around lines 429 - 432, The deprecation warning check in LoadOptions.java
currently uses Arrays.asList(args).contains("--parallel-count") which misses
forms like "--parallel-count=1"; update the condition (in the static check where
args are inspected) to detect both exact and assignment forms by using a
prefix/any-match test (e.g. any arg that equals "--parallel-count" or startsWith
"--parallel-count="). Keep the same LOG.warn message and ensure the check covers
both "--parallel-count" and "--parallel-count=...".

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In
`@hugegraph-loader/src/test/java/org/apache/hugegraph/loader/test/functional/FileLoadTest.java`:
- Around line 2575-2576: The struct.json used by the FileLoadTest is missing a
charset setting for the software vertex, causing GBK-encoded software records
(written in the test data) to be parsed as UTF-8 and fail; update the
struct.json entry for the software vertex's "input" to include "charset": "GBK"
so it matches the other test configs (see the incremental_mode_and_load_failure
and matched_encoding_charset test setups) and allows the test to load 7 vertices
as expected.
🧹 Nitpick comments (8)
hugegraph-loader/src/test/java/org/apache/hugegraph/loader/test/functional/FileLoadTest.java (8)

1200-1203: FIXME 注释应创建正式的 issue 进行跟踪。

FIXME 注释表明存在一个已知的 NPE bug(当 parser-threads > 1 时触发),但这种临时解决方案应该有对应的 issue 来跟踪修复进度。建议在代码库中创建一个 issue,并在注释中引用该 issue 编号。

需要我帮您创建一个 issue 来跟踪这个 NPE bug 吗?


1337-1339: 重复的 FIXME 注释,建议统一管理。

此处的 FIXME 注释与 testMultiFilesHaveHeader 中相同。建议将这些重复的注释替换为对同一个 issue 的引用,以便集中跟踪和修复。


1635-1637: 同上,建议引用统一的 issue 编号。


2067-2068: 缺少 FIXME 注释,与其他测试修改不一致。

其他添加 --parser-threads "1" 的测试都包含 FIXME 注释说明原因,但此处缺失。建议保持一致性,添加相同的注释说明。


2269-2270: 同上,建议引用统一的 issue 编号。


3059-3060: 同上,建议引用统一的 issue 编号。


3074-3075: 同上,建议引用统一的 issue 编号。


1200-1203: 整体建议:这些变更揭示了一个需要根本解决的架构问题。

多个测试需要设置 --parser-threads "1" 来避免 NPE,这表明多线程解析器存在并发安全问题。当前的逐个测试添加 workaround 的方式会:

  1. 降低测试覆盖率(多线程场景未被测试)
  2. 隐藏潜在的生产环境问题
  3. 增加技术债务

建议:

  • 创建一个专门的 issue 跟踪 parser-threads > 1 时的 NPE 问题
  • 在该 issue 中记录所有受影响的测试
  • 优先修复根本原因,而非继续添加 workaround

Comment on lines +2575 to +2576
"--test-mode", "true",
"--parser-threads", "1"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, locate and examine the test file
fd -t f "FileLoadTest.java" | head -5

Repository: hugegraph/hugegraph-toolchain

Length of output: 165


🏁 Script executed:

# Once we find the file, let's examine the specific lines mentioned
fd -t f "FileLoadTest.java" -exec wc -l {} \;

Repository: hugegraph/hugegraph-toolchain

Length of output: 259


🏁 Script executed:

# Search for testSourceOrTargetPrimaryValueNull test method
rg -n "testSourceOrTargetPrimaryValueNull" --type java

Repository: hugegraph/hugegraph-toolchain

Length of output: 225


🏁 Script executed:

# Read the test method and surrounding code
cat -n hugegraph-loader/src/test/java/org/apache/hugegraph/loader/test/functional/FileLoadTest.java | sed -n '2550,2600p'

Repository: hugegraph/hugegraph-toolchain

Length of output: 2657


🏁 Script executed:

# Search for test data configuration files related to source_or_target_pk_value_null
find . -type d -name "*source_or_target_pk_value_null*" 2>/dev/null

Repository: hugegraph/hugegraph-toolchain

Length of output: 141


🏁 Script executed:

# Examine the struct.json configuration for this test
cat ./hugegraph-loader/src/test/resources/source_or_target_pk_value_null/struct.json

Repository: hugegraph/hugegraph-toolchain

Length of output: 829


🏁 Script executed:

# Also check the schema.groovy to understand vertex/edge definitions
cat ./hugegraph-loader/src/test/resources/source_or_target_pk_value_null/schema.groovy

Repository: hugegraph/hugegraph-toolchain

Length of output: 2063


🏁 Script executed:

# Check git history for changes to the struct.json file
git log --oneline -20 -- "**/source_or_target_pk_value_null/struct.json"

Repository: hugegraph/hugegraph-toolchain

Length of output: 103


🏁 Script executed:

# Check git diff to see recent changes to struct.json
git diff HEAD~5 -- "**/source_or_target_pk_value_null/struct.json" 2>/dev/null || echo "No recent changes"

Repository: hugegraph/hugegraph-toolchain

Length of output: 90


🏁 Script executed:

# Look for any charset or encoding configuration that might have been removed
rg -n "charset" ./hugegraph-loader/src/test/resources/source_or_target_pk_value_null/ -A 2 -B 2

Repository: hugegraph/hugegraph-toolchain

Length of output: 1245


🏁 Script executed:

# Check if there are other test data directories with similar patterns to compare
ls -la ./hugegraph-loader/src/test/resources/ | grep -E "source_|target_" | head -10

Repository: hugegraph/hugegraph-toolchain

Length of output: 402


🏁 Script executed:

# Check similar test configurations to see how they handle charset for software vertices
find ./hugegraph-loader/src/test/resources -name "struct.json" -exec grep -l "software" {} \; | head -5

Repository: hugegraph/hugegraph-toolchain

Length of output: 1089


🏁 Script executed:

# Check one of these files to see if they specify charset for software vertex
cat ./hugegraph-loader/src/test/resources/*/struct.json | grep -A 10 '"label": "software"' | head -30

Repository: hugegraph/hugegraph-toolchain

Length of output: 685


🏁 Script executed:

# Look for GBK encoding in test files to understand the pattern
rg -n "GBK" ./hugegraph-loader/src/test/java/org/apache/hugegraph/loader/test/functional/FileLoadTest.java | head -20

Repository: hugegraph/hugegraph-toolchain

Length of output: 801


🏁 Script executed:

# Check git history of the struct.json file more thoroughly
git log --oneline --all -- "**/source_or_target_pk_value_null/struct.json" | head -20

Repository: hugegraph/hugegraph-toolchain

Length of output: 214


🏁 Script executed:

# Get the full diff of changes to this specific file
git show 3706a3b4:"./hugegraph-loader/src/test/resources/source_or_target_pk_value_null/struct.json" 2>/dev/null | head -50

Repository: hugegraph/hugegraph-toolchain

Length of output: 829


🏁 Script executed:

# Compare with a working example that has charset GBK for software
cat ./hugegraph-loader/src/test/resources/customized_schema/struct.json | grep -A 10 '"label": "software"'

Repository: hugegraph/hugegraph-toolchain

Length of output: 289


🏁 Script executed:

# Check the commit that added unit tests to see what was in struct.json originally
git show 3706a3b4:"./hugegraph-loader/src/test/resources/source_or_target_pk_value_null/struct.json" 2>/dev/null

Repository: hugegraph/hugegraph-toolchain

Length of output: 829


🏁 Script executed:

# Check if there's a PR or commit that modified the charset configuration
git log -p -- "**/source_or_target_pk_value_null/struct.json" | grep -A 15 -B 5 "charset" | head -50

Repository: hugegraph/hugegraph-toolchain

Length of output: 55


🏁 Script executed:

# Check the parent commit to see if charset was there
git show 5579e18c:"./hugegraph-loader/src/test/resources/source_or_target_pk_value_null/struct.json" 2>/dev/null || echo "File may not exist in that commit"

Repository: hugegraph/hugegraph-toolchain

Length of output: 106


🏁 Script executed:

# Check the current HEAD to see if there's been any recent change to add charset for software
git log --oneline -1 -- "**/source_or_target_pk_value_null/struct.json"

Repository: hugegraph/hugegraph-toolchain

Length of output: 103


🏁 Script executed:

# Look at the actual test to understand better - check lines around 2580-2593 more carefully
cat -n ./hugegraph-loader/src/test/java/org/apache/hugegraph/loader/test/functional/FileLoadTest.java | sed -n '2578,2593p'

Repository: hugegraph/hugegraph-toolchain

Length of output: 942


🏁 Script executed:

# Search for other similar tests with GBK encoding to see if they have charset defined
rg -B 15 'ioUtil.write.*vertex_software.*GBK' ./hugegraph-loader/src/test/java/org/apache/hugegraph/loader/test/functional/FileLoadTest.java | grep -A 20 'testCustomizedSchema\|testBatchInsertThreads' | head -50

Repository: hugegraph/hugegraph-toolchain

Length of output: 797


🏁 Script executed:

# Check customized_schema struct.json to see the full configuration
cat ./hugegraph-loader/src/test/resources/customized_schema/struct.json | head -50

Repository: hugegraph/hugegraph-toolchain

Length of output: 1172


🏁 Script executed:

# Look for any recent changes or issues related to charset in struct.json files
git log --all -p -- "**struct.json" | grep -A 5 -B 5 "charset.*GBK" | head -80

Repository: hugegraph/hugegraph-toolchain

Length of output: 2120


🏁 Script executed:

# Check if there's a pattern - search how the test framework handles default charset
rg -n "defaultCharset\|UTF-8" ./hugegraph-loader/src/main/java/org/apache/hugegraph/loader/parser/*.java | head -20

Repository: hugegraph/hugegraph-toolchain

Length of output: 55


🏁 Script executed:

# Check the very latest state of the struct.json file to see if charset has been added for software
cat ./hugegraph-loader/src/test/resources/source_or_target_pk_value_null/struct.json | grep -A 8 '"label": "software"'

Repository: hugegraph/hugegraph-toolchain

Length of output: 259


🏁 Script executed:

# Let's trace back - check what the exact commit message says
git show --stat 3706a3b4 | head -30

Repository: hugegraph/hugegraph-toolchain

Length of output: 1707


🏁 Script executed:

# Check if there have been any uncommitted changes to this file
git status ./hugegraph-loader/src/test/resources/source_or_target_pk_value_null/struct.json

Repository: hugegraph/hugegraph-toolchain

Length of output: 138


🏁 Script executed:

# Look at the full test context - what test names mention parser-threads
rg -n "parser-threads.*1" ./hugegraph-loader/src/test/java/org/apache/hugegraph/loader/test/functional/FileLoadTest.java

Repository: hugegraph/hugegraph-toolchain

Length of output: 798


🏁 Script executed:

# Let's verify the exact expectation - look at the test assertion more carefully
cat -n ./hugegraph-loader/src/test/java/org/apache/hugegraph/loader/test/functional/FileLoadTest.java | sed -n '2550,2593p'

Repository: hugegraph/hugegraph-toolchain

Length of output: 2370


🏁 Script executed:

# Check if software vertex should have charset GBK by looking at similar passing tests
rg -A 8 '"label": "software"' ./hugegraph-loader/src/test/resources/customized_schema/struct.json

Repository: hugegraph/hugegraph-toolchain

Length of output: 259


🏁 Script executed:

# Compare: check if there are passing tests with the same pattern - write software.csv in GBK with charset GBK in config
rg -B 5 'ioUtil.write.*vertex_software.*GBK' ./hugegraph-loader/src/test/java/org/apache/hugegraph/loader/test/functional/FileLoadTest.java | grep -A 50 "testCustomizedSchema" | head -60

Repository: hugegraph/hugegraph-toolchain

Length of output: 55


🏁 Script executed:

# Let's check if this struct.json should have charset: GBK for software by looking at pattern
# Check what happens when we look at the exact lines where software vertex is written in GBK in the test
rg -B 20 "testCustomizedSchema" ./hugegraph-loader/src/test/java/org/apache/hugegraph/loader/test/functional/FileLoadTest.java | grep -A 25 "public void testCustomizedSchema" | head -35

Repository: hugegraph/hugegraph-toolchain

Length of output: 113


🏁 Script executed:

# Get the customized_schema struct.json to compare patterns
cat ./hugegraph-loader/src/test/resources/customized_schema/struct.json | grep -A 10 '"label": "software"' | head -15

Repository: hugegraph/hugegraph-toolchain

Length of output: 289


🏁 Script executed:

# Check the FIXME comments to understand the parser-threads issue better
rg -B 2 "FIXME.*parser-threads" ./hugegraph-loader/src/test/java/org/apache/hugegraph/loader/test/functional/FileLoadTest.java | head -20

Repository: hugegraph/hugegraph-toolchain

Length of output: 636


🏁 Script executed:

# Let's check if customized_schema has the same issue or is it configured differently
cat ./hugegraph-loader/src/test/resources/customized_schema/struct.json | python3 << 'EOF'
import json
import sys

data = json.load(sys.stdin)
for vertex in data.get('vertices', []):
    if vertex.get('label') == 'software':
        print(f"Software vertex config: {json.dumps(vertex, indent=2)}")
EOF

Repository: hugegraph/hugegraph-toolchain

Length of output: 159


🏁 Script executed:

# Now let's verify the test expectation - does testCustomizedSchema expect 7 vertices?
rg -A 20 "public void testCustomizedSchema" ./hugegraph-loader/src/test/java/org/apache/hugegraph/loader/test/functional/FileLoadTest.java | grep -E "Assert.assertEquals.*vertices"

Repository: hugegraph/hugegraph-toolchain

Length of output: 55


🏁 Script executed:

# Check if this is the initial commit that added these tests
git show --stat 3706a3b4 | grep -E "FileLoadTest|struct.json" | head -20

Repository: hugegraph/hugegraph-toolchain

Length of output: 1316


🏁 Script executed:

# Let's check if the test is supposed to expect an exception and then still load vertices
# This would help understand the intent
rg -B 5 -A 15 "AsyncThrowsAssert.assertThrows" ./hugegraph-loader/src/test/java/org/apache/hugegraph/loader/test/functional/FileLoadTest.java | grep -A 20 "testSourceOrTargetPrimaryValueNull" | head -30

Repository: hugegraph/hugegraph-toolchain

Length of output: 55


🏁 Script executed:

# Check if there are any comments in the test about why 7 vertices are expected after an exception
rg -B 30 "Assert.assertEquals\(7, vertices.size" ./hugegraph-loader/src/test/java/org/apache/hugegraph/loader/test/functional/FileLoadTest.java | tail -40

Repository: hugegraph/hugegraph-toolchain

Length of output: 1782


🏁 Script executed:

# Most importantly, let's verify: does the software vertex config need charset GBK?
# Check another test that writes software in GBK and passes
rg -B 30 "testBatchInsertThreads\|testMultipleSourcesAndTargets" ./hugegraph-loader/src/test/java/org/apache/hugegraph/loader/test/functional/FileLoadTest.java | grep -A 30 "ioUtil.write.*vertex_software.*GBK" | head -50

Repository: hugegraph/hugegraph-toolchain

Length of output: 55


🏁 Script executed:

# Let me check what the default charset behavior is when not specified
rg -n "defaultCharset\|Charset.defaultCharset" ./hugegraph-loader/src/main/java/org/apache/hugegraph/loader/ -A 2

Repository: hugegraph/hugegraph-toolchain

Length of output: 55


🏁 Script executed:

# Verify: does the struct.json missing charset config for software cause parsing to fail?
# Check similar tests that work - do they have charset specified for software when written in GBK?
find ./hugegraph-loader/src/test/resources -name "struct.json" -exec grep -l '"label": "software"' {} \; | while read f; do echo "=== $f ==="; cat "$f" | grep -A 6 '"label": "software"'; done | head -80

Repository: hugegraph/hugegraph-toolchain

Length of output: 2977


🏁 Script executed:

# The key question: is the struct.json missing "charset": "GBK" for software vertex?
diff <(cat ./hugegraph-loader/src/test/resources/customized_schema/struct.json | jq '.vertices[] | select(.label == "software")') \
     <(cat ./hugegraph-loader/src/test/resources/source_or_target_pk_value_null/struct.json | jq '.vertices[] | select(.label == "software")')

Repository: hugegraph/hugegraph-toolchain

Length of output: 160


🏁 Script executed:

# Confirm: check if incremental_mode_and_load_failure has the same test pattern with GBK charset
rg -B 10 "incremental_mode_and_load_failure" ./hugegraph-loader/src/test/java/org/apache/hugegraph/loader/test/functional/FileLoadTest.java | grep -A 30 "ioUtil.write.*vertex_software" | head -40

Repository: hugegraph/hugegraph-toolchain

Length of output: 1536


🏁 Script executed:

# Also check if matched_encoding_charset test has the same structure and if it passes
rg -B 10 "matched_encoding_charset" ./hugegraph-loader/src/test/java/org/apache/hugegraph/loader/test/functional/FileLoadTest.java | grep -A 30 "ioUtil.write.*vertex_software" | head -40

Repository: hugegraph/hugegraph-toolchain

Length of output: 833


struct.json 中缺少 software 顶点的字符集配置。

测试数据中 software 顶点以 GBK 编码写入(行 2558),但 struct.json 配置未指定字符集,导致默认使用 UTF-8 解析,造成 software 顶点解析失败。实际仅加载 5 个顶点(person),而非预期的 7 个。

在 struct.json 中为 software 顶点的 input 配置添加 "charset": "GBK",与 incremental_mode_and_load_failurematched_encoding_charset 中的配置保持一致。

--parser-threads "1" 是针对已知 NPE 问题的独立解决方案(见代码注释中的 FIXME),与此测试失败无关。

🤖 Prompt for AI Agents
In
`@hugegraph-loader/src/test/java/org/apache/hugegraph/loader/test/functional/FileLoadTest.java`
around lines 2575 - 2576, The struct.json used by the FileLoadTest is missing a
charset setting for the software vertex, causing GBK-encoded software records
(written in the test data) to be parsed as UTF-8 and fail; update the
struct.json entry for the software vertex's "input" to include "charset": "GBK"
so it matches the other test configs (see the incremental_mode_and_load_failure
and matched_encoding_charset test setups) and allows the test to load 7 vertices
as expected.

@github-actions
Copy link

Due to the lack of activity, the current pr is marked as stale and will be closed after 180 days, any update will remove the stale label

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants