Skip to content

[hotfix] Make cross-language e2e test order-insensitive and add request time#713

Merged
xintongsong merged 3 commits into
apache:mainfrom
rosemarYuan:fix/cross-language-test-order
May 31, 2026
Merged

[hotfix] Make cross-language e2e test order-insensitive and add request time#713
xintongsong merged 3 commits into
apache:mainfrom
rosemarYuan:fix/cross-language-test-order

Conversation

@rosemarYuan
Copy link
Copy Markdown
Contributor

@rosemarYuan rosemarYuan commented May 29, 2026

Summary

Fix flaky YamlCrossLanguageTest Python e2e test on CI.

Root cause

  1. StreamingFileSink emits multiple part files across checkpoints, and Path.iterdir() returns them in non-deterministic order — the original assertion indexed into actual_result[0] / actual_result[1] and broke whenever traversal order differed.
  2. On slow CI runners, the default 30s request timeout on the Python Ollama chat model connection can elapse before the model finishes generating.

Changes

  • Python yaml_cross_language_test.py — join all part-file outputs and search for expected substrings, instead of relying on positional index.
  • Java ChatModelCrossLanguageAgent.java — extend the request timeout on pythonChatModelConnection() to 240s, mirroring javaChatModelConnection().

Linked issue

N/A (CI flaky test fix).

Tests

Re-runs existing e2e yaml_cross_language_test.py. No new tests needed.

API

No public API changes.

Documentation

  • doc-not-needed

…timeout

- Python: check content presence in joined output instead of positional
  indexing, since StreamingFileSink + checkpoint may produce multiple
  part files whose iterdir() order is non-deterministic.
- Java: add requestTimeout=240 to pythonChatModelConnection to prevent
  timeouts on slow CI runners.
@github-actions github-actions Bot added doc-not-needed Your PR changes do not impact docs fixVersion/0.3.0 The feature or bug should be implemented/fixed in the 0.3.0 version. priority/major Default priority of the PR or issue. labels May 29, 2026
@rosemarYuan rosemarYuan marked this pull request as ready for review May 29, 2026 07:13
Copy link
Copy Markdown
Collaborator

@wzhero1 wzhero1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pythonChatModelConnection() resolves to the Python OllamaChatModelConnection, whose field is request_timeout (snake_case). The Java→Python kwargs path doesn't rename keys and pydantic v2 defaults to extra="ignore", so requestTimeout=240 is silently dropped — request_timeout stays at the default 30s.

- .addInitialArgument("requestTimeout", 240)
+ .addInitialArgument("request_timeout", 240)

The Python OllamaChatModelConnection uses snake_case field names.
Java→Python kwargs path doesn't rename keys, so requestTimeout was
silently dropped. Changed to request_timeout to match Python schema.
@github-actions github-actions Bot added doc-not-needed Your PR changes do not impact docs and removed doc-not-needed Your PR changes do not impact docs labels May 29, 2026
@rosemarYuan rosemarYuan changed the title fix(e2e): make cross-language test order-insensitive and add request … [hotfix]: make cross-language test order-insensitive and add request … May 29, 2026
@rosemarYuan rosemarYuan changed the title [hotfix]: make cross-language test order-insensitive and add request … [hotfix] make cross-language test order-insensitive and add request … May 29, 2026
@github-actions github-actions Bot added doc-not-needed Your PR changes do not impact docs and removed doc-not-needed Your PR changes do not impact docs labels May 29, 2026
@rosemarYuan rosemarYuan changed the title [hotfix] make cross-language test order-insensitive and add request … [hotfix] Make cross-language e2e test order-insensitive and add request time May 29, 2026
@github-actions github-actions Bot added doc-not-needed Your PR changes do not impact docs and removed doc-not-needed Your PR changes do not impact docs labels May 29, 2026
@rosemarYuan rosemarYuan force-pushed the fix/cross-language-test-order branch from 577806b to 224557c Compare May 29, 2026 11:09
@rosemarYuan
Copy link
Copy Markdown
Contributor Author

Update summary:

  • The output read-order issue was a real test bug and has been fixed by making the assertion order-insensitive.
  • The model request timeout has been increased to 240s, but timeout may still happen occasionally in CI due to slow runner/model responses.
  • The LLM/tool-calling instability is not fully resolved. The model may still hallucinate, such as repeating a calculation, missing a step, or returning output that does not match the expected schema.

assert "cat" in actual_result[1].lower(), (
f"creative answer missing 'cat': {actual_result[1]!r}"
)
# NOTE: We join all results and search without relying on order, because
Copy link
Copy Markdown
Contributor

@addu390 addu390 May 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like there's a similar test in chat_model_cross_language_test.py that's worth fixing.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Fixed in the same way — joined all results and made the assertion order-insensitive in chat_model_cross_language_test.py as well.

Copy link
Copy Markdown
Collaborator

@weiqingy weiqingy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for chasing this one down — the order-insensitive join is a clean fix for the non-deterministic iterdir() part-file ordering, and folding in both the request_timeout key correction and the sibling chat_model_cross_language_test.py fix in the same pass tied off the earlier threads nicely. One optional question inline.

assert "3" in actual_result[0]
assert "cat" in actual_result[1]
joined = "\n".join(actual_result).lower()
assert "3" in joined, f"math answer missing '3': {actual_result!r}"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that the check searches the whole joined, lowercased output, "3" is a single digit that's very likely to appear somewhere regardless of the math result — in the cat answer, an incidental number, or the qwen3 reasoning trace — so this assertion may pass even if the math path regressed. The yaml test's "22" is specific enough to avoid that, but "3" (the answer to "1+2") doesn't have an obvious more-specific token, and the join is exactly what makes it order-insensitive. Is accepting the weaker math signal the intended trade-off here for a flaky-test hotfix, or worth a stricter check?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for flagging this — the concern is valid. A stricter math assertion would re-introduce flakiness, and those failures would be model-capability noise rather than actual cross-language regressions. So I think accepting the weaker math signal is a reasonable trade-off for this hotfix.

The way I see it, this is two different problems:
(1) E2E cross-language behavioral consistency — the primary goal of this test. The order-insensitive join + lowercased check addresses this, and the current approach prioritizes it.
(2) Model output quality validation — a harder problem that a 1.7b model on unstable CI hardware is fundamentally ill-suited for. If we want to strengthen this later, some possible directions might be:

  • Upgrading the CI model to one with more reliable arithmetic capability;
  • Structuring and formalizing the prompt (e.g., explicit chain-of-thought with strict output formatting);
  • Adding a post-inference verification step to verify whether the model output meets the Prompt expectation before the assertion is run.
    These improvements to (2) are out of scope for this hotfix. Would love to hear your thoughts on whether this trade-off works for now, or if you'd prefer a different approach.

Copy link
Copy Markdown
Collaborator

@weiqingy weiqingy May 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed explanation — looks good as is. One out-of-scope idea for later: if the test harness can surface it — right now it only reads the file-sink text output, not tool-invocation events — asserting the add tool was actually invoked would be an order- and capability-independent signal, stronger than scanning the text for "3".

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, agreed. Given that the current harness only reads the file-sink text output, keeping the weak "3" check is reasonable for this hotfix and avoids turning it into another 1.7b model-capability flaky test.

For a follow-up, I agree that surfacing tool-invocation events would be a stronger signal. One nuance is that tool invocation and final-answer correctness are separate dimensions. From previous runs, we have seen several different behaviors: the model may answer directly without tools, call the tool with correct arguments, call the tool with hallucinated/wrong arguments, miss a later calculation step, emit a tool call as plain text instead of an actual tool call, get the correct tool result but still produce a wrong final answer, or return a response that does not match the expected schema. In conclusion, successfully calling a tool does not necessarily equate to outputting the correct answer.

So if the harness can expose tool events later, checking that add was invoked would be a stronger signal for the tool-calling path than scanning the text output alone. To make that check more meaningful, we may also want to validate the tool arguments, e.g. that add was invoked with the expected inputs, and keep final-output validation as a separate concern.

Copy link
Copy Markdown
Contributor

@xintongsong xintongsong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@xintongsong xintongsong merged commit 1da56c9 into apache:main May 31, 2026
49 of 50 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

doc-not-needed Your PR changes do not impact docs fixVersion/0.3.0 The feature or bug should be implemented/fixed in the 0.3.0 version. priority/major Default priority of the PR or issue.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants