test: stateful multi-turn tool-call cassettes with context retention by ashwing · Pull Request #77 · vllm-project/agentic-api

ashwing · 2026-06-23T23:22:42Z

Summary

Adds 12 cassette recordings (6 vLLM + 6 OpenAI) that exercise the accumulator against real stateful multi-turn tool-call conversations using previous_response_id with store: true.

Why: The accumulator must correctly parse responses from multi-turn conversations where the server maintains state. These cassettes prove the accumulator handles both vLLM (gpt-oss-20b) and OpenAI (gpt-4o) response formats, including patterns like parallel tool calls and tool-output-only turns that only one backend reliably produces.

What's tested:

Pattern	vLLM	OpenAI	Key assertion
3-turn linear (non-streaming)	✅	✅	"that job" resolves to job-382
5-turn linear (non-streaming)	✅	✅	"restart it" resolves to job-382 + 64GB
3-turn streaming (SSE)	✅	✅	Context retained in streaming mode
Branching (turn 3 from turn 1)	✅	✅	`previous_response_id` diverges
Parallel tool calls	⚠️ 1 call	✅ 2 calls	Multiple `function_call` items in single response
Tool-output-only turn	✅	✅	No user message → model produces text

All cassettes include function_call_output items in request inputs between turns (matching the realistic Responses API flow where tool results are fed back).

Context retention proof: Ambiguous pronouns ("that job", "those errors", "restart it") in later turns can ONLY resolve correctly if previous_response_id preserves server-side conversation state. Tests assert the resolved arguments contain job-382.

Changes

record_cassette.py — Added --tool-outputs FILE option that auto-injects function_call_output items into subsequent turn inputs (required by OpenAI, aligned for vLLM)
12 cassette YAML files — Recorded against live vLLM (localhost:8100) and OpenAI API (gpt-4o)
pipeline_tool_outputs.json — Deterministic fake tool responses for reproducible recordings
accumulator_cassette_test.rs — 7 new tests (32 total), covering OpenAI parsing, parallel calls, tool-output-only, and cross-backend parse verification

Scenario

All cassettes share the same SRE debugging scenario: investigating a failed ETL pipeline job (job-382, OOM in transform stage). Tools: get_job_status, get_error_logs, search_runbook, run_analysis, restart_job, web_search.

Test Plan

cargo test --package agentic-core --test accumulator_cassette_test — 32 tests pass
cargo clippy --workspace --all-targets -- -D warnings — clean
cargo fmt -- --check — clean

Relates to #67 (stateful conversation support in executor)

maralbahari · 2026-06-24T02:37:10Z

+and turn depths (3-5 turns) from a coherent SRE debugging story.
+
+Usage:
+    python3 record_pipeline_debug_cassettes.py <vllm_url> <model> <output_dir>


why not using the current recoder we have https://github.com/vllm-project/agentic-api/blob/main/crates/agentic-core/tests/cassettes/record_cassette.py
it supports both vllm and openai. we should keep one true source of cassettes recorder. if you need to add anything additional that current recoder does not support you can add in.

Agreed — removed the custom script entirely. All cassettes now use the existing record_cassette.py --mode responses with --tools and --tool-choice auto. Added a record.sh that documents the exact invocations for reproducibility.

maralbahari · 2026-06-24T02:46:34Z

 }

+fn default_completed_status() -> String {
+    "completed".to_string()


you can use message status enum in

agentic-api/crates/agentic-core/src/types/event.rs

Lines 55 to 85 in 2c32b0d

pub enum MessageStatus {

/// Message is being generated.

#[default]

InProgress,

/// Message generation completed.

Completed,

}

impl MessageStatus {

/// Returns the canonical wire string for this status.

#[must_use]

pub fn as_str(self) -> &'static str {

match self {

Self::InProgress => "in_progress",

Self::Completed => "completed",

}

}

}

impl FromStr for MessageStatus {

type Err = Infallible;

fn from_str(s: &str) -> Result<Self, Self::Err> {

Ok(match s {

"completed" => Self::Completed,

_ => Self::InProgress,

})

}

}

Done — FunctionToolCall.status is now MessageStatus (imported from event.rs). Added a custom serde deserializer that maps null → MessageStatus::Completed since gpt-oss emits null for completed calls.

maralbahari · 2026-06-24T03:03:05Z

@ashwing thank you for the test support.
I added two inline comments. I noticed you've intended to record stateful multi-turn conversation cassettes but the request and responses are actually stateless not multi-turn each turn in the yaml file is a standalone prompt not a turn based on previous_response_id, or conversation_id while setting store to true. the current recoded cassettes shows store: false. for stateful multi turn conversation/responses we can only rely on OpenAI and use their model. while for vLLM upstream it would be always be single turn stateless results which we already have them covered in https://github.com/vllm-project/agentic-api/tree/main/crates/agentic-core/tests/cassettes/tool_calls.
what is the current test coverage is missing is stateful conversation/responses with tool calls to be recorded from OpenAI to you can use https://github.com/vllm-project/agentic-api/blob/main/crates/agentic-core/tests/cassettes/record_cassette.py to record conversation session or responses on OpenAI with different setting of streaming, non streaming , different response_id divergence (branching from different response in a conversation) etc. the script already supports passing tool choices as well but if it needs additional usage that your case require you can modify the python recorder. and keep a bash script of recoded cases for reference.

examples of simple multi-turn conversation :
conversation 1:
turn 1: what is the time in Singapore and in New York (get_time)?
turn 2: what is the time difference between them? (follow up)

conversation 2:
turn 1: search online (web_search ) cost of living in Singapore and New York
turn 2: based on search result what is the average rent in Singapore vs New York
turn 3: calculate the difference of average rent costs between them in percentage in a python script.

conversation 3:
turn 1: translate "Hi , my name is Alice" to Italian
turn2 : now translate it to Spanish.

…nse_id Replace stateless multi-turn cassettes with properly stateful ones recorded against gpt-oss-20b (vLLM 0.22) with VLLM_ENABLE_RESPONSES_API_STORE=1. Each turn chains via previous_response_id with store=true, exercising the accumulator on real server-side conversation state. Cassettes recorded using the existing record_cassette.py (--mode responses): - 3-turn non-streaming: get_job_status → get_error_logs → search_runbook - 5-turn non-streaming: full investigation pipeline (5 different tools) - 3-turn streaming: SSE events with previous_response_id chaining - 3-turn branch: turn 3 diverges from turn 1 (not turn 2) Also includes: - FunctionToolCall.status changed from String to MessageStatus enum - Custom deserializer handles gpt-oss `status: null` → Completed - record.sh + pipeline_tools.json for reproducibility Signed-off-by: Ashwin Giridharan <girida@amazon.com>

maralbahari · 2026-06-24T07:42:31Z

+    D: serde::Deserializer<'de>,
+{
+    let opt: Option<MessageStatus> = Option::deserialize(deserializer)?;
+    Ok(opt.unwrap_or(MessageStatus::Completed))


important note for gpt-oss serving if you used vLLM upstream serving you might make sure that you have passed correct tool parser. maybe the tool parser that you served the model on vLLM was not able to catch current status.

Confirmed — this isn't a tool parser config issue, it's how vLLM's Responses API serializes completed function calls (status field is always null in the output). The fix is a deserialize_with handler that defaults null to Completed. All 4 cassettes have "status": null in responses and the tests verify they deserialize to MessageStatus::Completed.

@ashwing this behavior only when running gpt-oss on vLLM right? if so can you check running gpt-oss on OpenAI and see the behavior is the same there just to confirm that this is not a bug from vLLM upstream.

Confirmed — vLLM-specific. The OpenAI cassettes in this PR (recorded against gpt-4o) show "status": "completed" on function_call items. vLLM returns "status": null. The deserializer handles both: null defaults to Completed, string "completed" maps directly.

@ashwing I mean if such behavior is only on a harmony model being served on vLLM it might be bug on vLLM upstream we need to note about to open issue. otherwise we should not fix bug from upstream by mapping into a default. ordinarily the behaviors of responses on vLLM should match OpenAI Responses API. would need to confirm if "status": null also occurs for serving other non harmony models on vLLM.
for now it should be fine to handle this way but later if there is a bug from vLLM we need to address that. would need to confirm this.

@ashwing I think the tool parser you selected is wrong. https://docs.vllm.ai/projects/recipes/en/latest/OpenAI/GPT-OSS.html#tool-use based on this vllm's recipe it has to be openai

I tested both --tool-call-parser openai (per the vLLM recipe) and hermes — same result with both. The earlier table only showed hermes but I had already validated both.

Did root-cause investigation. Confirmed bug in vLLM's Harmony codepath on current main — no existing issue or PR for it.

Test matrix

All tests use /v1/responses, store=true, non-streaming:

Model --tool-call-parser vLLM function_call.status

Gemma4 26B functiongemma v0.21.0 "completed"

Hermes-3-Llama-3.1-8B hermes v0.22.0 "completed"

gpt-oss-20b openai v0.22.0 null

gpt-oss-20b hermes v0.22.0 null

OpenAI gpt-4o (ground truth) — — "completed"

OpenAI gpt-4o-mini (ground truth) — — "completed"

Parser doesn't matter — gpt-oss returns null with both openai and hermes. Non-Harmony models (Gemma4, Hermes-3) return "completed" correctly.

Root cause in vLLM source

Non-Harmony path (responses_parser.py:101) — correctly sets status:

ResponseFunctionToolCall( id=f"fc_{random_uuid()}", call_id=f"call_{random_uuid()}", type="function_call", status="completed", # ✓ present name=tool_call.function.name, arguments=tool_call.function.arguments, )

Harmony path (harmony.py:315) — omits status:

ResponseFunctionToolCall( arguments=content.text, call_id=f"call_{random_id}", type="function_call", name=function_name, id=f"fc_{random_id}", # ← no status, defaults to None via Pydantic )

Since ResponseFunctionToolCall.status is Optional[...] = None in the OpenAI SDK, it silently serializes as null. One-line fix: add status="completed" to _parse_function_call.

Current state

Not fixed on vLLM main — verified on HEAD (f59db63, June 23). _parse_function_call is unchanged.

No existing issue or PR — searched multiple queries, nothing filed.

Will open an upstream vLLM issue for this.

Our deserializer handles both null and "completed" correctly in the meantime.

@ashwing thanks for double checking. I do understand the default deserialization is handling the case but is not an actual fix. we need to make sure vLLM Responses API is mirroring OpenAI if not in cases like this means there is a bug on vLLM. worth taking note instead of us silently resolving issues on agentic-api this way. BTW the harmony models like gpt-oss are not really part of MVP either.It's good to have them in cassettes to know where we stand but we should rely on other models to serve on vLLM for agentic-api verification.

@ashwing please file an issue to vLLM upstream.

CC @bbrowning @chaunceyjiang

Filed upstream: vllm-project/vllm#46940

ashwing · 2026-06-24T17:52:12Z

Good catch — the original cassettes were indeed stateless (store=false, self-contained prompts). Re-recorded all 4 cassettes against gpt-oss-20b with VLLM_ENABLE_RESPONSES_API_STORE=1 — this enables vLLM's in-memory store so previous_response_id actually chains server-side state.

To verify retention isn't just structural, turns 2+ use ambiguous prompts ("pull the error logs for that job", "restart it with 64GB") — the model resolves to job-382 only because it has prior context via previous_response_id. Tests assert this behaviorally.

Coverage: non-streaming (3-turn + 5-turn), streaming (3-turn), and branching (turn 3 diverges from turn 1). OpenAI-recorded cassettes will follow in a subsequent push.

…erns Record 6 new OpenAI (gpt-4o) cassettes structurally aligned with the existing vLLM cassettes. Add 2 new vLLM cassettes for parallel tool calls and tool-output-only turns. Re-record existing vLLM cassettes with function_call_output items in request inputs for realistic flow. Patch record_cassette.py with --tool-outputs support to automatically inject function_call_output items between turns (required by OpenAI, aligned for vLLM). New test coverage: - OpenAI responses parse identically (status="completed" not null) - Parallel tool calls (2 function_calls in single response) - Tool-output-only turn (no user message, model continues) - function_call_output items present in request inputs - Context retention verified across both backends Signed-off-by: Ashwin Giridharan <girida@amazon.com>

Add missing vLLM recordings (parallel, tool-output-only) and all 6 OpenAI recordings. Support ./record.sh [vllm|openai|all] to select backend. All commands now include --tool-outputs for realistic flow. Signed-off-by: Ashwin Giridharan <girida@amazon.com>

Re-recorded all vLLM and OpenAI cassettes using the updated record.sh script to ensure reproducibility. Fixed record.sh path resolution and removed nonexistent --tool-output-only-turn flag (empty stdin line already achieves this). Final summary now shows actual file count. Signed-off-by: Ashwin Giridharan <girida@amazon.com>

maralbahari · 2026-06-25T02:58:05Z

Good catch — the original cassettes were indeed stateless (store=false, self-contained prompts). Re-recorded all 4 cassettes against gpt-oss-20b with VLLM_ENABLE_RESPONSES_API_STORE=1 — this enables vLLM's in-memory store so previous_response_id actually chains server-side state.

To verify retention isn't just structural, turns 2+ use ambiguous prompts ("pull the error logs for that job", "restart it with 64GB") — the model resolves to job-382 only because it has prior context via previous_response_id. Tests assert this behaviorally.

Coverage: non-streaming (3-turn + 5-turn), streaming (3-turn), and branching (turn 3 diverges from turn 1). OpenAI-recorded cassettes will follow in a subsequent push.

@ashwing thank you ashwin for the adjustments.

maralbahari · 2026-06-25T03:36:19Z

+
+/// Verifies that the cassette has proper `previous_response_id` chaining:
+/// turn 1 has no `prev_id`, subsequent turns reference the prior response.
+fn assert_stateful_chaining(cassette: &TurnCassette) {


@ashwing this is not an intuitive test. this is what we intended to record the cassettes based off. the previous id chaining is not a concern for accumulator functionality

You're right — the accumulator only processes response bodies, not request structure. I added assert_stateful_chaining originally as a recording integrity check to prove the cassettes actually captured multi-turn state (not just self-contained prompts that happen to mention job-382), but it doesn't belong in the accumulator test suite. The context retention assertions (t2_args.contains("job-382")) already prove the cassettes captured real stateful conversations without inspecting requests. Removed it along with the two tests that only called it in 8ca36bb.

maralbahari · 2026-06-25T03:44:41Z

@ashwing it would be great test coverage perhaps in another PR to use these cassettes to validate execution like https://github.com/vllm-project/agentic-api/blob/main/crates/agentic-core/tests/stateful_responses_integration.rs the test cases here are based on the text only scenarios without tool calling that targets to test everything from agentic-core/executor/ which also verifies the correctness of rehydration and persistence in agentic-core/storage.

The accumulator only processes response bodies — request structure (previous_response_id, store) is not its concern. Context retention is already proven by content assertions (e.g. "that job" resolving to job-382). Remove the helper and 2 tests that only validated recording integrity rather than accumulator behavior. Signed-off-by: Ashwin Giridharan <girida@amazon.com>

ashwing · 2026-06-25T05:33:11Z

Agreed — these tool-call cassettes slot directly into a stateful_responses_integration.rs extension that exercises the full executor loop (dispatch → store → rehydrate → continue). Will target that as a follow-up once the executor handles function_call output items.

maralbahari

LGTM

ashwing requested review from bbrowning, franciscojavierarceo, jiahuei, leseb, maralbahari, noobHappylife, qandrew and tjtanaa as code owners June 23, 2026 23:22

ashwing force-pushed the test/multi-turn-tool-cassettes branch from de781c1 to 9a6d54b Compare June 23, 2026 23:38

maralbahari reviewed Jun 24, 2026

View reviewed changes

ashwing force-pushed the test/multi-turn-tool-cassettes branch from 9a6d54b to 06ded6c Compare June 24, 2026 06:11

ashwing marked this pull request as draft June 24, 2026 06:13

maralbahari reviewed Jun 24, 2026

View reviewed changes

ashwing force-pushed the test/multi-turn-tool-cassettes branch from 06ded6c to d1b0bec Compare June 24, 2026 17:47

ashwing changed the title ~~test: add multi-turn pipeline-debug cassettes for accumulator validation~~ test: add multi-turn cassettes for accumulator validation Jun 24, 2026

ashwing changed the title ~~test: add multi-turn cassettes for accumulator validation~~ test: stateful multi-turn tool-call cassettes with context retention Jun 24, 2026

ashwing added 2 commits June 24, 2026 13:26

ashwing marked this pull request as ready for review June 24, 2026 20:35

maralbahari reviewed Jun 25, 2026

View reviewed changes

ashwing mentioned this pull request Jun 25, 2026

test: extend stateful_responses_integration with tool-call scenarios #78

Open

maralbahari approved these changes Jun 25, 2026

View reviewed changes

ashwing mentioned this pull request Jun 25, 2026

docs: Codex integration design #68

Merged

franciscojavierarceo merged commit aff5e0d into vllm-project:main Jun 25, 2026
3 checks passed

	pub enum MessageStatus {
	/// Message is being generated.
	#[default]
	InProgress,

	/// Message generation completed.
	Completed,
	}

	impl MessageStatus {
	/// Returns the canonical wire string for this status.
	#[must_use]
	pub fn as_str(self) -> &'static str {
	match self {
	Self::InProgress => "in_progress",
	Self::Completed => "completed",
	}
	}
	}

	impl FromStr for MessageStatus {
	type Err = Infallible;

	fn from_str(s: &str) -> Result<Self, Self::Err> {
	Ok(match s {
	"completed" => Self::Completed,
	_ => Self::InProgress,
	})
	}
	}

Model	`--tool-call-parser`	vLLM	`function_call.status`
Gemma4 26B	functiongemma	v0.21.0	`"completed"`
Hermes-3-Llama-3.1-8B	hermes	v0.22.0	`"completed"`
gpt-oss-20b	openai	v0.22.0	`null`
gpt-oss-20b	hermes	v0.22.0	`null`
OpenAI gpt-4o (ground truth)	—	—	`"completed"`
OpenAI gpt-4o-mini (ground truth)	—	—	`"completed"`

Uh oh!

Conversation

ashwing commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Scenario

Test Plan

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maralbahari commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maralbahari Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maralbahari Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ashwing Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Test matrix

Root cause in vLLM source

Current state

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tjtanaa Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ashwing commented Jun 24, 2026

Uh oh!

maralbahari commented Jun 25, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maralbahari commented Jun 25, 2026

Uh oh!

ashwing commented Jun 25, 2026

Uh oh!

maralbahari left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ashwing commented Jun 23, 2026 •

edited

Loading

maralbahari commented Jun 24, 2026 •

edited

Loading

maralbahari Jun 25, 2026 •

edited

Loading

maralbahari Jun 25, 2026 •

edited

Loading

ashwing Jun 25, 2026 •

edited

Loading

tjtanaa Jun 25, 2026 •

edited

Loading