-
Notifications
You must be signed in to change notification settings - Fork 123
[hotfix] Make cross-language e2e test order-insensitive and add request time #713
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
9c3231b
224557c
d43f7f2
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -154,8 +154,10 @@ def test_yaml_cross_language_agent( | |
|
|
||
| # Math path went through the Java ``calculateBMI`` tool: | ||
| # 70 / (1.75 * 1.75) ≈ 22.86, so the final answer should mention 22. | ||
| assert "22" in actual_result[0], f"math answer missing '22': {actual_result[0]!r}" | ||
| # Creative path doesn't use any tool. | ||
| assert "cat" in actual_result[1].lower(), ( | ||
| f"creative answer missing 'cat': {actual_result[1]!r}" | ||
| ) | ||
| # NOTE: We join all results and search without relying on order, because | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Looks like there's a similar test in
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good catch! Fixed in the same way — joined all results and made the assertion order-insensitive in chat_model_cross_language_test.py as well. |
||
| # StreamingFileSink may produce multiple part files and iterdir() does not | ||
| # guarantee a deterministic traversal order across platforms. | ||
| joined = "\n".join(actual_result).lower() | ||
| assert "22" in joined, f"math answer missing '22': {actual_result!r}" | ||
| assert "cat" in joined, f"creative answer missing 'cat': {actual_result!r}" | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that the check searches the whole joined, lowercased output,
"3"is a single digit that's very likely to appear somewhere regardless of the math result — in the cat answer, an incidental number, or the qwen3 reasoning trace — so this assertion may pass even if the math path regressed. The yaml test's"22"is specific enough to avoid that, but"3"(the answer to "1+2") doesn't have an obvious more-specific token, and the join is exactly what makes it order-insensitive. Is accepting the weaker math signal the intended trade-off here for a flaky-test hotfix, or worth a stricter check?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for flagging this — the concern is valid. A stricter math assertion would re-introduce flakiness, and those failures would be model-capability noise rather than actual cross-language regressions. So I think accepting the weaker math signal is a reasonable trade-off for this hotfix.
The way I see it, this is two different problems:
(1) E2E cross-language behavioral consistency — the primary goal of this test. The order-insensitive join + lowercased check addresses this, and the current approach prioritizes it.
(2) Model output quality validation — a harder problem that a 1.7b model on unstable CI hardware is fundamentally ill-suited for. If we want to strengthen this later, some possible directions might be:
These improvements to (2) are out of scope for this hotfix. Would love to hear your thoughts on whether this trade-off works for now, or if you'd prefer a different approach.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed explanation — looks good as is. One out-of-scope idea for later: if the test harness can surface it — right now it only reads the file-sink text output, not tool-invocation events — asserting the
addtool was actually invoked would be an order- and capability-independent signal, stronger than scanning the text for"3".There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, agreed. Given that the current harness only reads the file-sink text output, keeping the weak
"3"check is reasonable for this hotfix and avoids turning it into another 1.7b model-capability flaky test.For a follow-up, I agree that surfacing tool-invocation events would be a stronger signal. One nuance is that tool invocation and final-answer correctness are separate dimensions. From previous runs, we have seen several different behaviors: the model may answer directly without tools, call the tool with correct arguments, call the tool with hallucinated/wrong arguments, miss a later calculation step, emit a tool call as plain text instead of an actual tool call, get the correct tool result but still produce a wrong final answer, or return a response that does not match the expected schema. In conclusion, successfully calling a tool does not necessarily equate to outputting the correct answer.
So if the harness can expose tool events later, checking that
addwas invoked would be a stronger signal for the tool-calling path than scanning the text output alone. To make that check more meaningful, we may also want to validate the tool arguments, e.g. thataddwas invoked with the expected inputs, and keep final-output validation as a separate concern.