Skip to content

test: stateful multi-turn tool-call cassettes with context retention#77

Merged
franciscojavierarceo merged 5 commits into
vllm-project:mainfrom
ashwing:test/multi-turn-tool-cassettes
Jun 25, 2026
Merged

test: stateful multi-turn tool-call cassettes with context retention#77
franciscojavierarceo merged 5 commits into
vllm-project:mainfrom
ashwing:test/multi-turn-tool-cassettes

Conversation

@ashwing

@ashwing ashwing commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

Summary

Adds 12 cassette recordings (6 vLLM + 6 OpenAI) that exercise the accumulator against real stateful multi-turn tool-call conversations using previous_response_id with store: true.

Why: The accumulator must correctly parse responses from multi-turn conversations where the server maintains state. These cassettes prove the accumulator handles both vLLM (gpt-oss-20b) and OpenAI (gpt-4o) response formats, including patterns like parallel tool calls and tool-output-only turns that only one backend reliably produces.

What's tested:

Pattern vLLM OpenAI Key assertion
3-turn linear (non-streaming) "that job" resolves to job-382
5-turn linear (non-streaming) "restart it" resolves to job-382 + 64GB
3-turn streaming (SSE) Context retained in streaming mode
Branching (turn 3 from turn 1) previous_response_id diverges
Parallel tool calls ⚠️ 1 call ✅ 2 calls Multiple function_call items in single response
Tool-output-only turn No user message → model produces text

All cassettes include function_call_output items in request inputs between turns (matching the realistic Responses API flow where tool results are fed back).

Context retention proof: Ambiguous pronouns ("that job", "those errors", "restart it") in later turns can ONLY resolve correctly if previous_response_id preserves server-side conversation state. Tests assert the resolved arguments contain job-382.

Changes

  • record_cassette.py — Added --tool-outputs FILE option that auto-injects function_call_output items into subsequent turn inputs (required by OpenAI, aligned for vLLM)
  • 12 cassette YAML files — Recorded against live vLLM (localhost:8100) and OpenAI API (gpt-4o)
  • pipeline_tool_outputs.json — Deterministic fake tool responses for reproducible recordings
  • accumulator_cassette_test.rs — 7 new tests (32 total), covering OpenAI parsing, parallel calls, tool-output-only, and cross-backend parse verification

Scenario

All cassettes share the same SRE debugging scenario: investigating a failed ETL pipeline job (job-382, OOM in transform stage). Tools: get_job_status, get_error_logs, search_runbook, run_analysis, restart_job, web_search.

Test Plan

  • cargo test --package agentic-core --test accumulator_cassette_test — 32 tests pass
  • cargo clippy --workspace --all-targets -- -D warnings — clean
  • cargo fmt -- --check — clean

Relates to #67 (stateful conversation support in executor)

and turn depths (3-5 turns) from a coherent SRE debugging story.

Usage:
python3 record_pipeline_debug_cassettes.py <vllm_url> <model> <output_dir>

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not using the current recoder we have https://github.com/vllm-project/agentic-api/blob/main/crates/agentic-core/tests/cassettes/record_cassette.py
it supports both vllm and openai. we should keep one true source of cassettes recorder. if you need to add anything additional that current recoder does not support you can add in.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed — removed the custom script entirely. All cassettes now use the existing record_cassette.py --mode responses with --tools and --tool-choice auto. Added a record.sh that documents the exact invocations for reproducibility.

Comment thread crates/agentic-core/src/types/io.rs Outdated
}

fn default_completed_status() -> String {
"completed".to_string()

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use message status enum in

pub enum MessageStatus {
/// Message is being generated.
#[default]
InProgress,
/// Message generation completed.
Completed,
}
impl MessageStatus {
/// Returns the canonical wire string for this status.
#[must_use]
pub fn as_str(self) -> &'static str {
match self {
Self::InProgress => "in_progress",
Self::Completed => "completed",
}
}
}
impl FromStr for MessageStatus {
type Err = Infallible;
fn from_str(s: &str) -> Result<Self, Self::Err> {
Ok(match s {
"completed" => Self::Completed,
_ => Self::InProgress,
})
}
}

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — FunctionToolCall.status is now MessageStatus (imported from event.rs). Added a custom serde deserializer that maps nullMessageStatus::Completed since gpt-oss emits null for completed calls.

@maralbahari

maralbahari commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator

@ashwing thank you for the test support.
I added two inline comments. I noticed you've intended to record stateful multi-turn conversation cassettes but the request and responses are actually stateless not multi-turn each turn in the yaml file is a standalone prompt not a turn based on previous_response_id, or conversation_id while setting store to true. the current recoded cassettes shows store: false. for stateful multi turn conversation/responses we can only rely on OpenAI and use their model. while for vLLM upstream it would be always be single turn stateless results which we already have them covered in https://github.com/vllm-project/agentic-api/tree/main/crates/agentic-core/tests/cassettes/tool_calls.
what is the current test coverage is missing is stateful conversation/responses with tool calls to be recorded from OpenAI to you can use https://github.com/vllm-project/agentic-api/blob/main/crates/agentic-core/tests/cassettes/record_cassette.py to record conversation session or responses on OpenAI with different setting of streaming, non streaming , different response_id divergence (branching from different response in a conversation) etc. the script already supports passing tool choices as well but if it needs additional usage that your case require you can modify the python recorder. and keep a bash script of recoded cases for reference.

examples of simple multi-turn conversation :
conversation 1:
turn 1: what is the time in Singapore and in New York (get_time)?
turn 2: what is the time difference between them? (follow up)

conversation 2:
turn 1: search online (web_search ) cost of living in Singapore and New York
turn 2: based on search result what is the average rent in Singapore vs New York
turn 3: calculate the difference of average rent costs between them in percentage in a python script.

conversation 3:
turn 1: translate "Hi , my name is Alice" to Italian
turn2 : now translate it to Spanish.

@ashwing ashwing force-pushed the test/multi-turn-tool-cassettes branch from 9a6d54b to 06ded6c Compare June 24, 2026 06:11
@ashwing ashwing marked this pull request as draft June 24, 2026 06:13
…nse_id

Replace stateless multi-turn cassettes with properly stateful ones recorded
against gpt-oss-20b (vLLM 0.22) with VLLM_ENABLE_RESPONSES_API_STORE=1.
Each turn chains via previous_response_id with store=true, exercising the
accumulator on real server-side conversation state.

Cassettes recorded using the existing record_cassette.py (--mode responses):
- 3-turn non-streaming: get_job_status → get_error_logs → search_runbook
- 5-turn non-streaming: full investigation pipeline (5 different tools)
- 3-turn streaming: SSE events with previous_response_id chaining
- 3-turn branch: turn 3 diverges from turn 1 (not turn 2)

Also includes:
- FunctionToolCall.status changed from String to MessageStatus enum
- Custom deserializer handles gpt-oss `status: null` → Completed
- record.sh + pipeline_tools.json for reproducibility

Signed-off-by: Ashwin Giridharan <girida@amazon.com>
D: serde::Deserializer<'de>,
{
let opt: Option<MessageStatus> = Option::deserialize(deserializer)?;
Ok(opt.unwrap_or(MessageStatus::Completed))

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

important note for gpt-oss serving if you used vLLM upstream serving you might make sure that you have passed correct tool parser. maybe the tool parser that you served the model on vLLM was not able to catch current status.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed — this isn't a tool parser config issue, it's how vLLM's Responses API serializes completed function calls (status field is always null in the output). The fix is a deserialize_with handler that defaults null to Completed. All 4 cassettes have "status": null in responses and the tests verify they deserialize to MessageStatus::Completed.

@maralbahari maralbahari Jun 25, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ashwing this behavior only when running gpt-oss on vLLM right? if so can you check running gpt-oss on OpenAI and see the behavior is the same there just to confirm that this is not a bug from vLLM upstream.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed — vLLM-specific. The OpenAI cassettes in this PR (recorded against gpt-4o) show "status": "completed" on function_call items. vLLM returns "status": null. The deserializer handles both: null defaults to Completed, string "completed" maps directly.

@maralbahari maralbahari Jun 25, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ashwing I mean if such behavior is only on a harmony model being served on vLLM it might be bug on vLLM upstream we need to note about to open issue. otherwise we should not fix bug from upstream by mapping into a default. ordinarily the behaviors of responses on vLLM should match OpenAI Responses API. would need to confirm if "status": null also occurs for serving other non harmony models on vLLM.
for now it should be fine to handle this way but later if there is a bug from vLLM we need to address that. would need to confirm this.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ashwing I think the tool parser you selected is wrong. https://docs.vllm.ai/projects/recipes/en/latest/OpenAI/GPT-OSS.html#tool-use based on this vllm's recipe it has to be openai

@ashwing ashwing Jun 25, 2026

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested both --tool-call-parser openai (per the vLLM recipe) and hermes — same result with both. The earlier table only showed hermes but I had already validated both.

Did root-cause investigation. Confirmed bug in vLLM's Harmony codepath on current main — no existing issue or PR for it.

Test matrix

All tests use /v1/responses, store=true, non-streaming:

Model --tool-call-parser vLLM function_call.status
Gemma4 26B functiongemma v0.21.0 "completed"
Hermes-3-Llama-3.1-8B hermes v0.22.0 "completed"
gpt-oss-20b openai v0.22.0 null
gpt-oss-20b hermes v0.22.0 null
OpenAI gpt-4o (ground truth) "completed"
OpenAI gpt-4o-mini (ground truth) "completed"

Parser doesn't matter — gpt-oss returns null with both openai and hermes. Non-Harmony models (Gemma4, Hermes-3) return "completed" correctly.

Root cause in vLLM source

Non-Harmony path (responses_parser.py:101) — correctly sets status:

ResponseFunctionToolCall(
    id=f"fc_{random_uuid()}",
    call_id=f"call_{random_uuid()}",
    type="function_call",
    status="completed",  # ✓ present
    name=tool_call.function.name,
    arguments=tool_call.function.arguments,
)

Harmony path (harmony.py:315) — omits status:

ResponseFunctionToolCall(
    arguments=content.text,
    call_id=f"call_{random_id}",
    type="function_call",
    name=function_name,
    id=f"fc_{random_id}",
    # ← no status, defaults to None via Pydantic
)

Since ResponseFunctionToolCall.status is Optional[...] = None in the OpenAI SDK, it silently serializes as null. One-line fix: add status="completed" to _parse_function_call.

Current state

  • Not fixed on vLLM main — verified on HEAD (f59db63, June 23). _parse_function_call is unchanged.
  • No existing issue or PR — searched multiple queries, nothing filed.
  • Will open an upstream vLLM issue for this.

Our deserializer handles both null and "completed" correctly in the meantime.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ashwing thanks for double checking. I do understand the default deserialization is handling the case but is not an actual fix. we need to make sure vLLM Responses API is mirroring OpenAI if not in cases like this means there is a bug on vLLM. worth taking note instead of us silently resolving issues on agentic-api this way. BTW the harmony models like gpt-oss are not really part of MVP either.It's good to have them in cassettes to know where we stand but we should rely on other models to serve on vLLM for agentic-api verification.

@tjtanaa tjtanaa Jun 25, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ashwing please file an issue to vLLM upstream.

CC @bbrowning @chaunceyjiang

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed upstream: vllm-project/vllm#46940

@ashwing ashwing force-pushed the test/multi-turn-tool-cassettes branch from 06ded6c to d1b0bec Compare June 24, 2026 17:47
@ashwing

ashwing commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator Author

Good catch — the original cassettes were indeed stateless (store=false, self-contained prompts). Re-recorded all 4 cassettes against gpt-oss-20b with VLLM_ENABLE_RESPONSES_API_STORE=1 — this enables vLLM's in-memory store so previous_response_id actually chains server-side state.

To verify retention isn't just structural, turns 2+ use ambiguous prompts ("pull the error logs for that job", "restart it with 64GB") — the model resolves to job-382 only because it has prior context via previous_response_id. Tests assert this behaviorally.

Coverage: non-streaming (3-turn + 5-turn), streaming (3-turn), and branching (turn 3 diverges from turn 1). OpenAI-recorded cassettes will follow in a subsequent push.

@ashwing ashwing changed the title test: add multi-turn pipeline-debug cassettes for accumulator validation test: add multi-turn cassettes for accumulator validation Jun 24, 2026
…erns

Record 6 new OpenAI (gpt-4o) cassettes structurally aligned with the
existing vLLM cassettes. Add 2 new vLLM cassettes for parallel tool
calls and tool-output-only turns. Re-record existing vLLM cassettes
with function_call_output items in request inputs for realistic flow.

Patch record_cassette.py with --tool-outputs support to automatically
inject function_call_output items between turns (required by OpenAI,
aligned for vLLM).

New test coverage:
- OpenAI responses parse identically (status="completed" not null)
- Parallel tool calls (2 function_calls in single response)
- Tool-output-only turn (no user message, model continues)
- function_call_output items present in request inputs
- Context retention verified across both backends

Signed-off-by: Ashwin Giridharan <girida@amazon.com>
@ashwing ashwing changed the title test: add multi-turn cassettes for accumulator validation test: stateful multi-turn tool-call cassettes with context retention Jun 24, 2026
ashwing added 2 commits June 24, 2026 13:26
Add missing vLLM recordings (parallel, tool-output-only) and all 6
OpenAI recordings. Support ./record.sh [vllm|openai|all] to select
backend. All commands now include --tool-outputs for realistic flow.

Signed-off-by: Ashwin Giridharan <girida@amazon.com>
Re-recorded all vLLM and OpenAI cassettes using the updated record.sh
script to ensure reproducibility. Fixed record.sh path resolution and
removed nonexistent --tool-output-only-turn flag (empty stdin line
already achieves this). Final summary now shows actual file count.

Signed-off-by: Ashwin Giridharan <girida@amazon.com>
@ashwing ashwing marked this pull request as ready for review June 24, 2026 20:35
@maralbahari

Copy link
Copy Markdown
Collaborator

Good catch — the original cassettes were indeed stateless (store=false, self-contained prompts). Re-recorded all 4 cassettes against gpt-oss-20b with VLLM_ENABLE_RESPONSES_API_STORE=1 — this enables vLLM's in-memory store so previous_response_id actually chains server-side state.

To verify retention isn't just structural, turns 2+ use ambiguous prompts ("pull the error logs for that job", "restart it with 64GB") — the model resolves to job-382 only because it has prior context via previous_response_id. Tests assert this behaviorally.

Coverage: non-streaming (3-turn + 5-turn), streaming (3-turn), and branching (turn 3 diverges from turn 1). OpenAI-recorded cassettes will follow in a subsequent push.

@ashwing thank you ashwin for the adjustments.


/// Verifies that the cassette has proper `previous_response_id` chaining:
/// turn 1 has no `prev_id`, subsequent turns reference the prior response.
fn assert_stateful_chaining(cassette: &TurnCassette) {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ashwing this is not an intuitive test. this is what we intended to record the cassettes based off. the previous id chaining is not a concern for accumulator functionality

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right — the accumulator only processes response bodies, not request structure. I added assert_stateful_chaining originally as a recording integrity check to prove the cassettes actually captured multi-turn state (not just self-contained prompts that happen to mention job-382), but it doesn't belong in the accumulator test suite. The context retention assertions (t2_args.contains("job-382")) already prove the cassettes captured real stateful conversations without inspecting requests. Removed it along with the two tests that only called it in 8ca36bb.

@maralbahari

Copy link
Copy Markdown
Collaborator

@ashwing it would be great test coverage perhaps in another PR to use these cassettes to validate execution like https://github.com/vllm-project/agentic-api/blob/main/crates/agentic-core/tests/stateful_responses_integration.rs the test cases here are based on the text only scenarios without tool calling that targets to test everything from agentic-core/executor/ which also verifies the correctness of rehydration and persistence in agentic-core/storage.

The accumulator only processes response bodies — request structure
(previous_response_id, store) is not its concern. Context retention
is already proven by content assertions (e.g. "that job" resolving
to job-382). Remove the helper and 2 tests that only validated
recording integrity rather than accumulator behavior.

Signed-off-by: Ashwin Giridharan <girida@amazon.com>
@ashwing

ashwing commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator Author

Agreed — these tool-call cassettes slot directly into a stateful_responses_integration.rs extension that exercises the full executor loop (dispatch → store → rehydrate → continue). Will target that as a follow-up once the executor handles function_call output items.

@maralbahari maralbahari left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@franciscojavierarceo franciscojavierarceo merged commit aff5e0d into vllm-project:main Jun 25, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants