ReActV2: Production-ready agent module with native FC, semantic history, and forced submit#19
ReActV2: Production-ready agent module with native FC, semantic history, and forced submit#19isaacbmiller wants to merge 25 commits intomainfrom
Conversation
Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
… tests - Fix submit tool to return kwargs dict instead of 'Completed.' string - Build _build_submit_tool(signature) helper with output field args - Remove debug print(dspy.inspect_history()) - Fix History frozen=True to allow mutation (messages list append) - Implement History.add_message() with structured ACTION events - Fix circular import in history.py (removed dspy.predict.predict import) - Add error handling: AdapterParseError, ValueError, None tool_calls, unknown tool names - Add forced submit fallback when max_iters exhausts - Support per-call max_iters override via forward(**kwargs) - Export ReActV2 from dspy/__init__.py and dspy/predict/__init__.py - Add 9 focused unit tests in tests/predict/test_reactv2.py Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
… truncation - Implement append_request/append_action/append_final helpers in History - Add has_open_episode() to track open episodes - Add truncate_oldest_actions() with chars/4 heuristic (no tiktoken) - Add make_truncate_oldest_actions() factory function - Wire REQUEST/FINAL events into ReActV2 forward loop - Remove tiktoken dependency (estimate_tokens, summarize_if_needed) - 5 new tests for history events, episode tracking, truncation, compaction Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
…itization, provider FC fallback
- ChatAdapter: natural language guidance for native FC path (no [[ ## completed ## ]] markers)
- ChatAdapter.parse: handle native FC text as free-form reasoning for single str output field
- base.py: tag processed signature with __dspy_native_fc__ when native FC active
- tool.py: normalize OpenAI {type:'function', function:{name, arguments}} format in ToolCalls
- tool.py: sanitize tool names to match OpenAI ^[a-zA-Z0-9_-]+$ pattern
- lm.py: provider-based fallback for supports_function_calling (openai, anthropic, etc.)
- 7 new tests: native/non-native format, ToolCalls normalization, sanitization, FC fallback, GEPA
Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
… loop - BrowseComp: v2 recall 0.139 vs v1 0.168 (within noise, 0 crashes both) - Tau-banking: both v1/v2 score 0.0 (gpt-5-nano too weak, 0 crashes) - Compaction: qwen3-32b 2/2 completed with truncation, no overflow - inspect_history: native + non-native outputs captured for gpt-5-nano - LOC: +464/-279 (net +185, well under +1000 budget) - Fix: pass tools=list(self.tools.values()) to predict calls Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
BrowseComp n=30: v2 recall (0.150) >= v1 recall (0.148), 0 crashes, 120s timeout enforced. Tau-banking with groq/openai/gpt-oss-120b: both v1 and v2 achieve 0.200 avg reward. Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
…alEvent) Replace plain dicts with __dspy_history_event__ string tags with pydantic models using a discriminated union on the 'event' field. Update all append methods, isinstance checks, and test assertions accordingly. Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
Tool now extends Parameter, enabling named_parameters() discovery. GEPA seed candidate includes tool descs, build_program applies optimized descs, and ReActV2._rebuild_instructions() syncs both text-mode and native FC paths. Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
When GEPA selects a tool desc component (e.g. tools['add']) for reflective mutation, make_reflective_dataset() no longer asserts it must be a predictor. Instead, it falls back to the first predictor's traces, which contain the relevant signal about how the tool was used by the parent predictor. Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
- Display tool_calls on assistant messages in conversation history - Show tool_call_id on tool role messages - Handle content=None for native FC assistant messages Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
Don't fall back to JSONAdapter when lm_kwargs contains 'tools', since JSONAdapter sets response_format: json_object which conflicts with native function calling on providers like Groq. Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
- When max_iters exhausts with native FC active, set tool_choice to mechanically force the submit tool call (reasoning models ignore text-only directives) - Add try/except fallback: if provider rejects tool_choice, retry without it (graceful degradation for Cohere, Mistral, etc.) - Bypass self.react in _forced_submit to control message ordering directly -- the directive must be the LAST user message - Extract reasoning from model_extra when content is None (Groq reasoning models return content=None with reasoning in extras) - Fix open-episode detection: check has_open_episode before format_conversation_history deletes history from inputs - Fix tests to exercise submit-within-loop path (DummyLM can't produce raw OpenAI tool_call format needed by _forced_submit) Benchmark: native FC + tool_choice on BrowseComp n=20 (gpt-5-nano) v2 recall 0.240 vs v1 0.142 (+69%), submitted 20/20 vs 2/20 Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
…llback
Bug 1 - History rendering leaked native FC format into non-native path:
- ActionEvents were serialized with tool_calls JSON (OpenAI format)
even in text mode, causing 'Missing tool_calls[0].id' API errors
- Fix: render as plain text 'Thought: ... / Action: tool(args)'
- Also fix native path: pre-generate stable deterministic IDs for
tool calls that lack them (hash-based, not object id)
Bug 2 - Non-native _forced_submit returned empty predictions:
- LM returns text in non-native mode but _forced_submit only
handled native FC dict responses with tool_calls key
- Fix: add text parsing fallback using adapter.parse() to extract
submit tool call from text, plus last-resort extraction of output
fields directly from the response
Before: 3/20 crashes, 4/20 submitted, 0.150 recall
After: 0/20 crashes, 14/20 submitted, 0.190 recall
Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
Like v1's self.extract, adds a dedicated LM call that reads the agent's trajectory and produces output fields directly. Fires only as a last resort in _forced_submit after all submit attempts fail. - Add self.extract (ChainOfThought) in __init__ with trajectory input - Add _render_history_as_text() to convert History events to text - In _forced_submit, after native FC and text parsing both fail, render history and call self.extract to recover the answer - Append FinalEvent so history looks like submit was called cleanly Submit rate: 14/20 -> 19/20 in non-native mode Native FC path unchanged (submit already works 100%) Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
Runtime state rename: 'final' -> 'output' better describes the event's purpose (storing output field values, not signaling finality). - FinalEvent class -> OutputEvent, discriminator 'final' -> 'output' - append_final() -> append_output() - Updated all imports, isinstance checks, and tests Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
Greptile SummaryThis PR introduces
Confidence Score: 3/5The new non-native FC history formatting appends two consecutive user messages after each tool step, which causes hard API failures on Anthropic and other strict alternating-turn providers. The core ReActV2 loop, forced-submit tiers, and ToolCalls normalizer are solid and well-tested. The main risk is in Adapter.format(): when a non-native FC call has an open episode, format_conversation_history closes the last ActionEvent with a user-role observation message, and then format() immediately appends user_message_output_requirements as a second consecutive user message. Anthropic's API returns a 400 for this message ordering. Any user running ReActV2 in text/non-native FC mode against Anthropic will see API failures after the first tool step. dspy/adapters/base.py — the format() method's has_open_episode branch produces consecutive user messages in non-native FC mode. Important Files Changed
Sequence DiagramsequenceDiagram
participant Caller
participant ReActV2
participant History
participant Adapter
participant LM
participant Tool
Caller->>ReActV2: forward(**input_args)
ReActV2->>History: append_input(input_args)
loop Each iteration (max_iters)
ReActV2->>Adapter: format(signature, history, tools)
Adapter->>LM: chat/FC request
LM-->>Adapter: next_thought + tool_calls
Adapter-->>ReActV2: Prediction(next_thought, tool_calls)
loop Each tool call
ReActV2->>Tool: __call__(**args)
Tool-->>ReActV2: result / error
end
ReActV2->>History: append_action(thought, tool_calls, observations)
alt submit tool called successfully
ReActV2->>History: append_output(result)
ReActV2-->>Caller: Prediction(history, **result)
end
end
ReActV2->>ReActV2: _forced_submit()
note over ReActV2: Tier 1 - react with tool_choice=submit
note over ReActV2: Tier 2 - ChainOfThought extract
ReActV2-->>Caller: Prediction(history, termination_reason)
|
Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
644d972 to
5acc1a8
Compare
- Extract _build_instructions() (single source of truth for instruction string) - Replace ~80-line shadow pipeline in _forced_submit with 2-tier: submit_predict + extract - Add termination_reason to all Prediction returns - Remove dead compact_if_needed() call (compaction is caller responsibility) - Create Observation pydantic model replacing tuple[Any, bool] - Type tool_calls field in ActionEvent with proper ToolCalls import - Move runtime imports to top-level; remove json_repair dependency Net -43 lines. 20/20 tests pass. Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
_forced_submit tier 1 now calls self.react (same pipeline as main loop) with tool_choice temporarily forced to submit. No duplicate Predict module. Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
- Thread-safe _forced_submit: pass config kwarg instead of mutating shared state - Add logging to all except blocks in _forced_submit (debug level) - Handle ContextWindowExceededError with retry-after-compaction - Map output field types properly in _build_submit_tool (not all strings) - Sync extract signature in _rebuild_instructions for GEPA - Revert _convert_chat_request_to_responses_request regression in lm.py - Stop mutating inputs dict in format_conversation_history - Deterministic tool call IDs (hashlib.md5 instead of hash()) - Fix double 'is' typo in chat_adapter docstring - Add test for extract fallback path (tier 2) 21/21 tests pass. Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
| for msg in request.pop("messages"): | ||
| content_blocks = [] | ||
| c = msg.get("content") | ||
| if isinstance(c, str): | ||
| content_blocks.append({"type": "input_text", "text": c}) | ||
| elif isinstance(c, list): | ||
| # Convert each content item from Chat API format to Responses API format | ||
| for item in c: | ||
| content_blocks.append(_convert_content_item_to_responses_format(item)) | ||
| request["input"] = [{"role": msg.get("role", "user"), "content": content_blocks}] | ||
| input_items.append({"role": msg.get("role", "user"), "content": content_blocks}) |
There was a problem hiding this comment.
Responses API converter silently drops native FC messages
When native function calling is active, the conversation history contains {"role": "assistant", "tool_calls": [...]} messages (no content key) and {"role": "tool", "content": "...", "tool_call_id": "..."} messages. The converter only inspects msg.get("content"), so for assistant FC messages content_blocks is always empty and the entire tool_calls array is silently dropped. For tool role messages, tool_call_id is discarded. Any caller that uses native FC against a Responses API backend (e.g. use_responses_api=True on a reasoning model) will send a malformed request where the model has no memory of what tools it previously called.
for msg in request.pop("messages"):
content_blocks = []
c = msg.get("content")
# assistant FC messages have no "content" → content_blocks stays []
# tool_calls array is never read
# tool role messages lose tool_call_id
input_items.append({"role": msg.get("role", "user"), "content": content_blocks})The Responses API equivalent for an assistant FC turn is a function_call output item, and for the tool result it is a function_call_output input item. Both require the call_id. The converter needs to handle these shapes before native FC and Responses API can be used together safely.
…e artifacts - Add LegacyEvent wrapper + model_validator so History(messages=[dict]) still works - Guard load_state against missing keys (Tool now extends Parameter) - Remove history.json, benchmark_results.md, cmpnd-sdk submodule - Restore uv.lock to main Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
| def _build_submit_tool(signature: type["Signature"]) -> Tool: | ||
| outputs = ", ".join([f"`{k}`" for k in signature.output_fields.keys()]) | ||
| output_args = {} | ||
| output_arg_types = {} | ||
| for k, v in signature.output_fields.items(): | ||
| annotation = v.annotation if hasattr(v, "annotation") else str | ||
| json_type = _ANNOTATION_TO_JSON_TYPE.get(annotation, "string") | ||
| output_args[k] = {"type": json_type} | ||
| output_arg_types[k] = annotation | ||
|
|
||
| return Tool( | ||
| func=lambda **kwargs: kwargs, | ||
| name="submit", | ||
| desc=f"Call this tool to end the task and return your final answer. Takes: {outputs}.", | ||
| args=output_args, | ||
| arg_types=output_arg_types, | ||
| ) |
There was a problem hiding this comment.
Submit tool schema maps non-primitive output types to "string", breaking agent termination
_ANNOTATION_TO_JSON_TYPE only covers bare Python primitives (str, int, float, bool, list). For any real generic alias such as list[str], list[MyModel], or a Pydantic model, .get(annotation, "string") falls back to "string" and the submit schema is {"type": "string"}.
When the model later calls submit(answers=["a", "b"]), Tool._validate_and_parse_args runs jsonschema.validate(instance=["a", "b"], schema={"type": "string"}), which raises ValidationError. That is caught and re-raised as ValueError, which forward() catches in the tool-execution except block and records as an error observation (is_error=True). Because the loop checks if tool_call.name == "submit" and not obs.is_error, the submit is silently swallowed and the agent eventually exhausts max_iters without ever terminating — for any signature whose output field is not a bare primitive.
A minimal fix is to use get_origin to produce a richer schema for generic aliases:
from typing import get_origin
def _annotation_to_json_schema(annotation) -> dict:
origin = get_origin(annotation)
if origin is list:
return {"type": "array"}
return {"type": _ANNOTATION_TO_JSON_TYPE.get(annotation, "string")}Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
ReActV2
A ground-up rewrite of the ReAct agent module with better tool calling, history management, and optimization support.
Key Features
out of scope:
Maybe measure:
Benchmarks
BrowseComp n=50 (gpt-5-nano): v2 native FC vs v2 non-native FC v1
Tau-Banking (gpt-oss-120b)
TBD
Compaction (qwen3-32b)
On browsecomp, max_iters > x failed before with long context. Now we are able to go N max_iters and see Y increase recall.
Prompt Caching (Haiku 4.5)
Whereas before we would get 0 Anthropic cache utilization, we now get up to X% by including the field:
...in our LM init.