[hotfix] Make cross-language e2e test order-insensitive and add request time#713
Conversation
…timeout - Python: check content presence in joined output instead of positional indexing, since StreamingFileSink + checkpoint may produce multiple part files whose iterdir() order is non-deterministic. - Java: add requestTimeout=240 to pythonChatModelConnection to prevent timeouts on slow CI runners.
wzhero1
left a comment
There was a problem hiding this comment.
pythonChatModelConnection() resolves to the Python OllamaChatModelConnection, whose field is request_timeout (snake_case). The Java→Python kwargs path doesn't rename keys and pydantic v2 defaults to extra="ignore", so requestTimeout=240 is silently dropped — request_timeout stays at the default 30s.
- .addInitialArgument("requestTimeout", 240)
+ .addInitialArgument("request_timeout", 240)The Python OllamaChatModelConnection uses snake_case field names. Java→Python kwargs path doesn't rename keys, so requestTimeout was silently dropped. Changed to request_timeout to match Python schema.
577806b to
224557c
Compare
|
Update summary:
|
| assert "cat" in actual_result[1].lower(), ( | ||
| f"creative answer missing 'cat': {actual_result[1]!r}" | ||
| ) | ||
| # NOTE: We join all results and search without relying on order, because |
There was a problem hiding this comment.
Looks like there's a similar test in chat_model_cross_language_test.py that's worth fixing.
There was a problem hiding this comment.
Good catch! Fixed in the same way — joined all results and made the assertion order-insensitive in chat_model_cross_language_test.py as well.
weiqingy
left a comment
There was a problem hiding this comment.
Thanks for chasing this one down — the order-insensitive join is a clean fix for the non-deterministic iterdir() part-file ordering, and folding in both the request_timeout key correction and the sibling chat_model_cross_language_test.py fix in the same pass tied off the earlier threads nicely. One optional question inline.
| assert "3" in actual_result[0] | ||
| assert "cat" in actual_result[1] | ||
| joined = "\n".join(actual_result).lower() | ||
| assert "3" in joined, f"math answer missing '3': {actual_result!r}" |
There was a problem hiding this comment.
Now that the check searches the whole joined, lowercased output, "3" is a single digit that's very likely to appear somewhere regardless of the math result — in the cat answer, an incidental number, or the qwen3 reasoning trace — so this assertion may pass even if the math path regressed. The yaml test's "22" is specific enough to avoid that, but "3" (the answer to "1+2") doesn't have an obvious more-specific token, and the join is exactly what makes it order-insensitive. Is accepting the weaker math signal the intended trade-off here for a flaky-test hotfix, or worth a stricter check?
There was a problem hiding this comment.
Thanks for flagging this — the concern is valid. A stricter math assertion would re-introduce flakiness, and those failures would be model-capability noise rather than actual cross-language regressions. So I think accepting the weaker math signal is a reasonable trade-off for this hotfix.
The way I see it, this is two different problems:
(1) E2E cross-language behavioral consistency — the primary goal of this test. The order-insensitive join + lowercased check addresses this, and the current approach prioritizes it.
(2) Model output quality validation — a harder problem that a 1.7b model on unstable CI hardware is fundamentally ill-suited for. If we want to strengthen this later, some possible directions might be:
- Upgrading the CI model to one with more reliable arithmetic capability;
- Structuring and formalizing the prompt (e.g., explicit chain-of-thought with strict output formatting);
- Adding a post-inference verification step to verify whether the model output meets the Prompt expectation before the assertion is run.
These improvements to (2) are out of scope for this hotfix. Would love to hear your thoughts on whether this trade-off works for now, or if you'd prefer a different approach.
There was a problem hiding this comment.
Thanks for the detailed explanation — looks good as is. One out-of-scope idea for later: if the test harness can surface it — right now it only reads the file-sink text output, not tool-invocation events — asserting the add tool was actually invoked would be an order- and capability-independent signal, stronger than scanning the text for "3".
There was a problem hiding this comment.
Thanks, agreed. Given that the current harness only reads the file-sink text output, keeping the weak "3" check is reasonable for this hotfix and avoids turning it into another 1.7b model-capability flaky test.
For a follow-up, I agree that surfacing tool-invocation events would be a stronger signal. One nuance is that tool invocation and final-answer correctness are separate dimensions. From previous runs, we have seen several different behaviors: the model may answer directly without tools, call the tool with correct arguments, call the tool with hallucinated/wrong arguments, miss a later calculation step, emit a tool call as plain text instead of an actual tool call, get the correct tool result but still produce a wrong final answer, or return a response that does not match the expected schema. In conclusion, successfully calling a tool does not necessarily equate to outputting the correct answer.
So if the harness can expose tool events later, checking that add was invoked would be a stronger signal for the tool-calling path than scanning the text output alone. To make that check more meaningful, we may also want to validate the tool arguments, e.g. that add was invoked with the expected inputs, and keep final-output validation as a separate concern.
Summary
Fix flaky
YamlCrossLanguageTestPython e2e test on CI.Root cause
StreamingFileSinkemits multiple part files across checkpoints, andPath.iterdir()returns them in non-deterministic order — the original assertion indexed intoactual_result[0]/actual_result[1]and broke whenever traversal order differed.Changes
yaml_cross_language_test.py— join all part-file outputs and search for expected substrings, instead of relying on positional index.ChatModelCrossLanguageAgent.java— extend the request timeout onpythonChatModelConnection()to 240s, mirroringjavaChatModelConnection().Linked issue
N/A (CI flaky test fix).
Tests
Re-runs existing e2e
yaml_cross_language_test.py. No new tests needed.API
No public API changes.
Documentation
doc-not-needed