test: stateful multi-turn tool-call cassettes with context retention#77
Conversation
de781c1 to
9a6d54b
Compare
| and turn depths (3-5 turns) from a coherent SRE debugging story. | ||
|
|
||
| Usage: | ||
| python3 record_pipeline_debug_cassettes.py <vllm_url> <model> <output_dir> |
There was a problem hiding this comment.
why not using the current recoder we have https://github.com/vllm-project/agentic-api/blob/main/crates/agentic-core/tests/cassettes/record_cassette.py
it supports both vllm and openai. we should keep one true source of cassettes recorder. if you need to add anything additional that current recoder does not support you can add in.
There was a problem hiding this comment.
Agreed — removed the custom script entirely. All cassettes now use the existing record_cassette.py --mode responses with --tools and --tool-choice auto. Added a record.sh that documents the exact invocations for reproducibility.
| } | ||
|
|
||
| fn default_completed_status() -> String { | ||
| "completed".to_string() |
There was a problem hiding this comment.
you can use message status enum in
agentic-api/crates/agentic-core/src/types/event.rs
Lines 55 to 85 in 2c32b0d
There was a problem hiding this comment.
Done — FunctionToolCall.status is now MessageStatus (imported from event.rs). Added a custom serde deserializer that maps null → MessageStatus::Completed since gpt-oss emits null for completed calls.
|
@ashwing thank you for the test support. examples of simple multi-turn conversation : conversation 2: conversation 3: |
9a6d54b to
06ded6c
Compare
…nse_id Replace stateless multi-turn cassettes with properly stateful ones recorded against gpt-oss-20b (vLLM 0.22) with VLLM_ENABLE_RESPONSES_API_STORE=1. Each turn chains via previous_response_id with store=true, exercising the accumulator on real server-side conversation state. Cassettes recorded using the existing record_cassette.py (--mode responses): - 3-turn non-streaming: get_job_status → get_error_logs → search_runbook - 5-turn non-streaming: full investigation pipeline (5 different tools) - 3-turn streaming: SSE events with previous_response_id chaining - 3-turn branch: turn 3 diverges from turn 1 (not turn 2) Also includes: - FunctionToolCall.status changed from String to MessageStatus enum - Custom deserializer handles gpt-oss `status: null` → Completed - record.sh + pipeline_tools.json for reproducibility Signed-off-by: Ashwin Giridharan <girida@amazon.com>
| D: serde::Deserializer<'de>, | ||
| { | ||
| let opt: Option<MessageStatus> = Option::deserialize(deserializer)?; | ||
| Ok(opt.unwrap_or(MessageStatus::Completed)) |
There was a problem hiding this comment.
important note for gpt-oss serving if you used vLLM upstream serving you might make sure that you have passed correct tool parser. maybe the tool parser that you served the model on vLLM was not able to catch current status.
There was a problem hiding this comment.
Confirmed — this isn't a tool parser config issue, it's how vLLM's Responses API serializes completed function calls (status field is always null in the output). The fix is a deserialize_with handler that defaults null to Completed. All 4 cassettes have "status": null in responses and the tests verify they deserialize to MessageStatus::Completed.
There was a problem hiding this comment.
@ashwing this behavior only when running gpt-oss on vLLM right? if so can you check running gpt-oss on OpenAI and see the behavior is the same there just to confirm that this is not a bug from vLLM upstream.
There was a problem hiding this comment.
Confirmed — vLLM-specific. The OpenAI cassettes in this PR (recorded against gpt-4o) show "status": "completed" on function_call items. vLLM returns "status": null. The deserializer handles both: null defaults to Completed, string "completed" maps directly.
There was a problem hiding this comment.
@ashwing I mean if such behavior is only on a harmony model being served on vLLM it might be bug on vLLM upstream we need to note about to open issue. otherwise we should not fix bug from upstream by mapping into a default. ordinarily the behaviors of responses on vLLM should match OpenAI Responses API. would need to confirm if "status": null also occurs for serving other non harmony models on vLLM.
for now it should be fine to handle this way but later if there is a bug from vLLM we need to address that. would need to confirm this.
There was a problem hiding this comment.
@ashwing I think the tool parser you selected is wrong. https://docs.vllm.ai/projects/recipes/en/latest/OpenAI/GPT-OSS.html#tool-use based on this vllm's recipe it has to be openai
There was a problem hiding this comment.
I tested both --tool-call-parser openai (per the vLLM recipe) and hermes — same result with both. The earlier table only showed hermes but I had already validated both.
Did root-cause investigation. Confirmed bug in vLLM's Harmony codepath on current main — no existing issue or PR for it.
Test matrix
All tests use /v1/responses, store=true, non-streaming:
| Model | --tool-call-parser |
vLLM | function_call.status |
|---|---|---|---|
| Gemma4 26B | functiongemma | v0.21.0 | "completed" |
| Hermes-3-Llama-3.1-8B | hermes | v0.22.0 | "completed" |
| gpt-oss-20b | openai | v0.22.0 | null |
| gpt-oss-20b | hermes | v0.22.0 | null |
| OpenAI gpt-4o (ground truth) | — | — | "completed" |
| OpenAI gpt-4o-mini (ground truth) | — | — | "completed" |
Parser doesn't matter — gpt-oss returns null with both openai and hermes. Non-Harmony models (Gemma4, Hermes-3) return "completed" correctly.
Root cause in vLLM source
Non-Harmony path (responses_parser.py:101) — correctly sets status:
ResponseFunctionToolCall(
id=f"fc_{random_uuid()}",
call_id=f"call_{random_uuid()}",
type="function_call",
status="completed", # ✓ present
name=tool_call.function.name,
arguments=tool_call.function.arguments,
)Harmony path (harmony.py:315) — omits status:
ResponseFunctionToolCall(
arguments=content.text,
call_id=f"call_{random_id}",
type="function_call",
name=function_name,
id=f"fc_{random_id}",
# ← no status, defaults to None via Pydantic
)Since ResponseFunctionToolCall.status is Optional[...] = None in the OpenAI SDK, it silently serializes as null. One-line fix: add status="completed" to _parse_function_call.
Current state
- Not fixed on vLLM
main— verified on HEAD (f59db63, June 23)._parse_function_callis unchanged. - No existing issue or PR — searched multiple queries, nothing filed.
- Will open an upstream vLLM issue for this.
Our deserializer handles both null and "completed" correctly in the meantime.
There was a problem hiding this comment.
@ashwing thanks for double checking. I do understand the default deserialization is handling the case but is not an actual fix. we need to make sure vLLM Responses API is mirroring OpenAI if not in cases like this means there is a bug on vLLM. worth taking note instead of us silently resolving issues on agentic-api this way. BTW the harmony models like gpt-oss are not really part of MVP either.It's good to have them in cassettes to know where we stand but we should rely on other models to serve on vLLM for agentic-api verification.
There was a problem hiding this comment.
Filed upstream: vllm-project/vllm#46940
06ded6c to
d1b0bec
Compare
|
Good catch — the original cassettes were indeed stateless (store=false, self-contained prompts). Re-recorded all 4 cassettes against gpt-oss-20b with To verify retention isn't just structural, turns 2+ use ambiguous prompts ("pull the error logs for that job", "restart it with 64GB") — the model resolves to Coverage: non-streaming (3-turn + 5-turn), streaming (3-turn), and branching (turn 3 diverges from turn 1). OpenAI-recorded cassettes will follow in a subsequent push. |
…erns Record 6 new OpenAI (gpt-4o) cassettes structurally aligned with the existing vLLM cassettes. Add 2 new vLLM cassettes for parallel tool calls and tool-output-only turns. Re-record existing vLLM cassettes with function_call_output items in request inputs for realistic flow. Patch record_cassette.py with --tool-outputs support to automatically inject function_call_output items between turns (required by OpenAI, aligned for vLLM). New test coverage: - OpenAI responses parse identically (status="completed" not null) - Parallel tool calls (2 function_calls in single response) - Tool-output-only turn (no user message, model continues) - function_call_output items present in request inputs - Context retention verified across both backends Signed-off-by: Ashwin Giridharan <girida@amazon.com>
Add missing vLLM recordings (parallel, tool-output-only) and all 6 OpenAI recordings. Support ./record.sh [vllm|openai|all] to select backend. All commands now include --tool-outputs for realistic flow. Signed-off-by: Ashwin Giridharan <girida@amazon.com>
Re-recorded all vLLM and OpenAI cassettes using the updated record.sh script to ensure reproducibility. Fixed record.sh path resolution and removed nonexistent --tool-output-only-turn flag (empty stdin line already achieves this). Final summary now shows actual file count. Signed-off-by: Ashwin Giridharan <girida@amazon.com>
@ashwing thank you ashwin for the adjustments. |
|
|
||
| /// Verifies that the cassette has proper `previous_response_id` chaining: | ||
| /// turn 1 has no `prev_id`, subsequent turns reference the prior response. | ||
| fn assert_stateful_chaining(cassette: &TurnCassette) { |
There was a problem hiding this comment.
@ashwing this is not an intuitive test. this is what we intended to record the cassettes based off. the previous id chaining is not a concern for accumulator functionality
There was a problem hiding this comment.
You're right — the accumulator only processes response bodies, not request structure. I added assert_stateful_chaining originally as a recording integrity check to prove the cassettes actually captured multi-turn state (not just self-contained prompts that happen to mention job-382), but it doesn't belong in the accumulator test suite. The context retention assertions (t2_args.contains("job-382")) already prove the cassettes captured real stateful conversations without inspecting requests. Removed it along with the two tests that only called it in 8ca36bb.
|
@ashwing it would be great test coverage perhaps in another PR to use these cassettes to validate |
The accumulator only processes response bodies — request structure (previous_response_id, store) is not its concern. Context retention is already proven by content assertions (e.g. "that job" resolving to job-382). Remove the helper and 2 tests that only validated recording integrity rather than accumulator behavior. Signed-off-by: Ashwin Giridharan <girida@amazon.com>
|
Agreed — these tool-call cassettes slot directly into a |
Summary
Adds 12 cassette recordings (6 vLLM + 6 OpenAI) that exercise the accumulator against real stateful multi-turn tool-call conversations using
previous_response_idwithstore: true.Why: The accumulator must correctly parse responses from multi-turn conversations where the server maintains state. These cassettes prove the accumulator handles both vLLM (gpt-oss-20b) and OpenAI (gpt-4o) response formats, including patterns like parallel tool calls and tool-output-only turns that only one backend reliably produces.
What's tested:
previous_response_iddivergesfunction_callitems in single responseAll cassettes include
function_call_outputitems in request inputs between turns (matching the realistic Responses API flow where tool results are fed back).Context retention proof: Ambiguous pronouns ("that job", "those errors", "restart it") in later turns can ONLY resolve correctly if
previous_response_idpreserves server-side conversation state. Tests assert the resolved arguments containjob-382.Changes
record_cassette.py— Added--tool-outputs FILEoption that auto-injectsfunction_call_outputitems into subsequent turn inputs (required by OpenAI, aligned for vLLM)pipeline_tool_outputs.json— Deterministic fake tool responses for reproducible recordingsaccumulator_cassette_test.rs— 7 new tests (32 total), covering OpenAI parsing, parallel calls, tool-output-only, and cross-backend parse verificationScenario
All cassettes share the same SRE debugging scenario: investigating a failed ETL pipeline job (job-382, OOM in transform stage). Tools:
get_job_status,get_error_logs,search_runbook,run_analysis,restart_job,web_search.Test Plan
cargo test --package agentic-core --test accumulator_cassette_test— 32 tests passcargo clippy --workspace --all-targets -- -D warnings— cleancargo fmt -- --check— cleanRelates to #67 (stateful conversation support in executor)