Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -78,6 +78,7 @@ public static ResourceDescriptor pythonChatModelConnection() {
return ResourceDescriptor.Builder.newBuilder(
ResourceName.ChatModel.PYTHON_WRAPPER_CONNECTION)
.addInitialArgument("pythonClazz", ResourceName.ChatModel.Python.OLLAMA_CONNECTION)
.addInitialArgument("request_timeout", 240)
.build();
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -106,5 +106,6 @@ def test_java_chat_model_integration(
with file.open() as f:
actual_result.extend(f.readlines())

assert "3" in actual_result[0]
assert "cat" in actual_result[1]
joined = "\n".join(actual_result).lower()
assert "3" in joined, f"math answer missing '3': {actual_result!r}"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that the check searches the whole joined, lowercased output, "3" is a single digit that's very likely to appear somewhere regardless of the math result — in the cat answer, an incidental number, or the qwen3 reasoning trace — so this assertion may pass even if the math path regressed. The yaml test's "22" is specific enough to avoid that, but "3" (the answer to "1+2") doesn't have an obvious more-specific token, and the join is exactly what makes it order-insensitive. Is accepting the weaker math signal the intended trade-off here for a flaky-test hotfix, or worth a stricter check?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for flagging this — the concern is valid. A stricter math assertion would re-introduce flakiness, and those failures would be model-capability noise rather than actual cross-language regressions. So I think accepting the weaker math signal is a reasonable trade-off for this hotfix.

The way I see it, this is two different problems:
(1) E2E cross-language behavioral consistency — the primary goal of this test. The order-insensitive join + lowercased check addresses this, and the current approach prioritizes it.
(2) Model output quality validation — a harder problem that a 1.7b model on unstable CI hardware is fundamentally ill-suited for. If we want to strengthen this later, some possible directions might be:

  • Upgrading the CI model to one with more reliable arithmetic capability;
  • Structuring and formalizing the prompt (e.g., explicit chain-of-thought with strict output formatting);
  • Adding a post-inference verification step to verify whether the model output meets the Prompt expectation before the assertion is run.
    These improvements to (2) are out of scope for this hotfix. Would love to hear your thoughts on whether this trade-off works for now, or if you'd prefer a different approach.

Copy link
Copy Markdown
Collaborator

@weiqingy weiqingy May 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed explanation — looks good as is. One out-of-scope idea for later: if the test harness can surface it — right now it only reads the file-sink text output, not tool-invocation events — asserting the add tool was actually invoked would be an order- and capability-independent signal, stronger than scanning the text for "3".

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, agreed. Given that the current harness only reads the file-sink text output, keeping the weak "3" check is reasonable for this hotfix and avoids turning it into another 1.7b model-capability flaky test.

For a follow-up, I agree that surfacing tool-invocation events would be a stronger signal. One nuance is that tool invocation and final-answer correctness are separate dimensions. From previous runs, we have seen several different behaviors: the model may answer directly without tools, call the tool with correct arguments, call the tool with hallucinated/wrong arguments, miss a later calculation step, emit a tool call as plain text instead of an actual tool call, get the correct tool result but still produce a wrong final answer, or return a response that does not match the expected schema. In conclusion, successfully calling a tool does not necessarily equate to outputting the correct answer.

So if the harness can expose tool events later, checking that add was invoked would be a stronger signal for the tool-calling path than scanning the text output alone. To make that check more meaningful, we may also want to validate the tool arguments, e.g. that add was invoked with the expected inputs, and keep final-output validation as a separate concern.

assert "cat" in joined, f"creative answer missing 'cat': {actual_result!r}"
Original file line number Diff line number Diff line change
Expand Up @@ -154,8 +154,10 @@ def test_yaml_cross_language_agent(

# Math path went through the Java ``calculateBMI`` tool:
# 70 / (1.75 * 1.75) ≈ 22.86, so the final answer should mention 22.
assert "22" in actual_result[0], f"math answer missing '22': {actual_result[0]!r}"
# Creative path doesn't use any tool.
assert "cat" in actual_result[1].lower(), (
f"creative answer missing 'cat': {actual_result[1]!r}"
)
# NOTE: We join all results and search without relying on order, because
Copy link
Copy Markdown
Contributor

@addu390 addu390 May 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like there's a similar test in chat_model_cross_language_test.py that's worth fixing.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Fixed in the same way — joined all results and made the assertion order-insensitive in chat_model_cross_language_test.py as well.

# StreamingFileSink may produce multiple part files and iterdir() does not
# guarantee a deterministic traversal order across platforms.
joined = "\n".join(actual_result).lower()
assert "22" in joined, f"math answer missing '22': {actual_result!r}"
assert "cat" in joined, f"creative answer missing 'cat': {actual_result!r}"
Loading