[hotfix] Make cross-language e2e test order-insensitive and add request time by rosemarYuan · Pull Request #713 · apache/flink-agents

rosemarYuan · 2026-05-29T06:32:18Z

Summary

Fix flaky YamlCrossLanguageTest Python e2e test on CI.

Root cause

StreamingFileSink emits multiple part files across checkpoints, and Path.iterdir() returns them in non-deterministic order — the original assertion indexed into actual_result[0] / actual_result[1] and broke whenever traversal order differed.
On slow CI runners, the default 30s request timeout on the Python Ollama chat model connection can elapse before the model finishes generating.

Changes

Python yaml_cross_language_test.py — join all part-file outputs and search for expected substrings, instead of relying on positional index.
Java ChatModelCrossLanguageAgent.java — extend the request timeout on pythonChatModelConnection() to 240s, mirroring javaChatModelConnection().

Linked issue

N/A (CI flaky test fix).

Tests

Re-runs existing e2e yaml_cross_language_test.py. No new tests needed.

API

No public API changes.

Documentation

doc-not-needed

…timeout - Python: check content presence in joined output instead of positional indexing, since StreamingFileSink + checkpoint may produce multiple part files whose iterdir() order is non-deterministic. - Java: add requestTimeout=240 to pythonChatModelConnection to prevent timeouts on slow CI runners.

wzhero1

pythonChatModelConnection() resolves to the Python OllamaChatModelConnection, whose field is request_timeout (snake_case). The Java→Python kwargs path doesn't rename keys and pydantic v2 defaults to extra="ignore", so requestTimeout=240 is silently dropped — request_timeout stays at the default 30s.

- .addInitialArgument("requestTimeout", 240)
+ .addInitialArgument("request_timeout", 240)

The Python OllamaChatModelConnection uses snake_case field names. Java→Python kwargs path doesn't rename keys, so requestTimeout was silently dropped. Changed to request_timeout to match Python schema.

rosemarYuan · 2026-05-29T11:13:31Z

Update summary:

The output read-order issue was a real test bug and has been fixed by making the assertion order-insensitive.
The model request timeout has been increased to 240s, but timeout may still happen occasionally in CI due to slow runner/model responses.
The LLM/tool-calling instability is not fully resolved. The model may still hallucinate, such as repeating a calculation, missing a step, or returning output that does not match the expected schema.

addu390 · 2026-05-29T17:54:03Z

-    assert "cat" in actual_result[1].lower(), (
-        f"creative answer missing 'cat': {actual_result[1]!r}"
-    )
+    # NOTE: We join all results and search without relying on order, because


Looks like there's a similar test in chat_model_cross_language_test.py that's worth fixing.

Good catch! Fixed in the same way — joined all results and made the assertion order-insensitive in chat_model_cross_language_test.py as well.

…2e test

weiqingy

Thanks for chasing this one down — the order-insensitive join is a clean fix for the non-deterministic iterdir() part-file ordering, and folding in both the request_timeout key correction and the sibling chat_model_cross_language_test.py fix in the same pass tied off the earlier threads nicely. One optional question inline.

weiqingy · 2026-05-30T07:10:19Z

-    assert "3" in actual_result[0]
-    assert "cat" in actual_result[1]
+    joined = "\n".join(actual_result).lower()
+    assert "3" in joined, f"math answer missing '3': {actual_result!r}"


Now that the check searches the whole joined, lowercased output, "3" is a single digit that's very likely to appear somewhere regardless of the math result — in the cat answer, an incidental number, or the qwen3 reasoning trace — so this assertion may pass even if the math path regressed. The yaml test's "22" is specific enough to avoid that, but "3" (the answer to "1+2") doesn't have an obvious more-specific token, and the join is exactly what makes it order-insensitive. Is accepting the weaker math signal the intended trade-off here for a flaky-test hotfix, or worth a stricter check?

Thanks for flagging this — the concern is valid. A stricter math assertion would re-introduce flakiness, and those failures would be model-capability noise rather than actual cross-language regressions. So I think accepting the weaker math signal is a reasonable trade-off for this hotfix.

The way I see it, this is two different problems:
(1) E2E cross-language behavioral consistency — the primary goal of this test. The order-insensitive join + lowercased check addresses this, and the current approach prioritizes it.
(2) Model output quality validation — a harder problem that a 1.7b model on unstable CI hardware is fundamentally ill-suited for. If we want to strengthen this later, some possible directions might be:

Upgrading the CI model to one with more reliable arithmetic capability;

Structuring and formalizing the prompt (e.g., explicit chain-of-thought with strict output formatting);

Adding a post-inference verification step to verify whether the model output meets the Prompt expectation before the assertion is run.
These improvements to (2) are out of scope for this hotfix. Would love to hear your thoughts on whether this trade-off works for now, or if you'd prefer a different approach.

Thanks for the detailed explanation — looks good as is. One out-of-scope idea for later: if the test harness can surface it — right now it only reads the file-sink text output, not tool-invocation events — asserting the add tool was actually invoked would be an order- and capability-independent signal, stronger than scanning the text for "3".

Thanks, agreed. Given that the current harness only reads the file-sink text output, keeping the weak "3" check is reasonable for this hotfix and avoids turning it into another 1.7b model-capability flaky test.

For a follow-up, I agree that surfacing tool-invocation events would be a stronger signal. One nuance is that tool invocation and final-answer correctness are separate dimensions. From previous runs, we have seen several different behaviors: the model may answer directly without tools, call the tool with correct arguments, call the tool with hallucinated/wrong arguments, miss a later calculation step, emit a tool call as plain text instead of an actual tool call, get the correct tool result but still produce a wrong final answer, or return a response that does not match the expected schema. In conclusion, successfully calling a tool does not necessarily equate to outputting the correct answer.

So if the harness can expose tool events later, checking that add was invoked would be a stronger signal for the tool-calling path than scanning the text output alone. To make that check more meaningful, we may also want to validate the tool arguments, e.g. that add was invoked with the expected inputs, and keep final-output validation as a separate concern.

xintongsong

LGTM

github-actions Bot added doc-not-needed Your PR changes do not impact docs fixVersion/0.3.0 The feature or bug should be implemented/fixed in the 0.3.0 version. priority/major Default priority of the PR or issue. labels May 29, 2026

rosemarYuan marked this pull request as ready for review May 29, 2026 07:13

wzhero1 reviewed May 29, 2026

View reviewed changes

fix(java): use snake_case for pythonChatModelConnection request_timeout

224557c

The Python OllamaChatModelConnection uses snake_case field names. Java→Python kwargs path doesn't rename keys, so requestTimeout was silently dropped. Changed to request_timeout to match Python schema.

github-actions Bot added doc-not-needed Your PR changes do not impact docs and removed doc-not-needed Your PR changes do not impact docs labels May 29, 2026

rosemarYuan changed the title ~~fix(e2e): make cross-language test order-insensitive and add request …~~ [hotfix]: make cross-language test order-insensitive and add request … May 29, 2026

rosemarYuan changed the title ~~[hotfix]: make cross-language test order-insensitive and add request …~~ [hotfix] make cross-language test order-insensitive and add request … May 29, 2026

github-actions Bot added doc-not-needed Your PR changes do not impact docs and removed doc-not-needed Your PR changes do not impact docs labels May 29, 2026

rosemarYuan changed the title ~~[hotfix] make cross-language test order-insensitive and add request …~~ [hotfix] Make cross-language e2e test order-insensitive and add request time May 29, 2026

github-actions Bot added doc-not-needed Your PR changes do not impact docs and removed doc-not-needed Your PR changes do not impact docs labels May 29, 2026

rosemarYuan force-pushed the fix/cross-language-test-order branch from 577806b to 224557c Compare May 29, 2026 11:09

addu390 reviewed May 29, 2026

View reviewed changes

[hotfix] Fix order-sensitive assertion in cross-language chat model e…

d43f7f2

…2e test

weiqingy reviewed May 30, 2026

View reviewed changes

xintongsong approved these changes May 31, 2026

View reviewed changes

xintongsong merged commit 1da56c9 into apache:main May 31, 2026
49 of 50 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[hotfix] Make cross-language e2e test order-insensitive and add request time#713

[hotfix] Make cross-language e2e test order-insensitive and add request time#713
xintongsong merged 3 commits into
apache:mainfrom
rosemarYuan:fix/cross-language-test-order

rosemarYuan commented May 29, 2026 •

edited

Loading

Uh oh!

wzhero1 left a comment

Uh oh!

rosemarYuan commented May 29, 2026

Uh oh!

addu390 May 29, 2026 •

edited

Loading

Uh oh!

rosemarYuan May 30, 2026

Uh oh!

weiqingy left a comment

Uh oh!

weiqingy May 30, 2026

Uh oh!

rosemarYuan May 30, 2026

Uh oh!

weiqingy May 30, 2026 •

edited

Loading

Uh oh!

rosemarYuan May 31, 2026

Uh oh!

xintongsong left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

rosemarYuan commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root cause

Changes

Linked issue

Tests

API

Documentation

Uh oh!

wzhero1 left a comment

Choose a reason for hiding this comment

Uh oh!

rosemarYuan commented May 29, 2026

Uh oh!

addu390 May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rosemarYuan May 30, 2026

Choose a reason for hiding this comment

Uh oh!

weiqingy left a comment

Choose a reason for hiding this comment

Uh oh!

weiqingy May 30, 2026

Choose a reason for hiding this comment

Uh oh!

rosemarYuan May 30, 2026

Choose a reason for hiding this comment

Uh oh!

weiqingy May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rosemarYuan May 31, 2026

Choose a reason for hiding this comment

Uh oh!

xintongsong left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

rosemarYuan commented May 29, 2026 •

edited

Loading

addu390 May 29, 2026 •

edited

Loading

weiqingy May 30, 2026 •

edited

Loading