Skip to content

ReActV2: Production-ready agent module with native FC, semantic history, and forced submit#19

Open
isaacbmiller wants to merge 25 commits intomainfrom
isaac/react-v2
Open

ReActV2: Production-ready agent module with native FC, semantic history, and forced submit#19
isaacbmiller wants to merge 25 commits intomainfrom
isaac/react-v2

Conversation

@isaacbmiller
Copy link
Copy Markdown

@isaacbmiller isaacbmiller commented Apr 21, 2026

ReActV2

A ground-up rewrite of the ReAct agent module with better tool calling, history management, and optimization support.

Key Features

  • Better small model support: X model goes form Y % correct to Z % correct. (Mostly enabled by native tool calling favored models aka GPT-oss-20b)
  • When you include a tool, GEPA will now optimize it and include it as a native tool call.
  • ReACt trajectory is no longer a dict, its a structured object
  • Compaction now lets you increase max_iters without running into issues. You can even optimize the compaction step.
  • Parallel tool calls
  • More ergonomic conversation structure for telling the model when to use tools.
  • Chat history continue
  • submit function w forced submit
  • prompt caching now works for anthropic because the message blocks dont change

out of scope:

  • Optimize in a multi (user) turn setting? (idk probably not)
  • Restart a trajectory from the middle of a step

Maybe measure:

  • Long threads w lots of tools
  • Short with few tools - 6-8 turns;
  • Tau Banking

Benchmarks

BrowseComp n=50 (gpt-5-nano): v2 native FC vs v2 non-native FC v1

Metric v1 ReAct (non-native only) v2 Native FC v2 non-Native FC
Avg Recall x x x
Submitted x x x
Crashes x x x
parsing failures x x x
Fallbacks to Json Adapter x x x

Tau-Banking (gpt-oss-120b)

TBD

Compaction (qwen3-32b)

On browsecomp, max_iters > x failed before with long context. Now we are able to go N max_iters and see Y increase recall.

Prompt Caching (Haiku 4.5)

Whereas before we would get 0 Anthropic cache utilization, we now get up to X% by including the field: ... in our LM init.

isaacbmiller and others added 19 commits April 3, 2026 14:35
Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
… tests

- Fix submit tool to return kwargs dict instead of 'Completed.' string
- Build _build_submit_tool(signature) helper with output field args
- Remove debug print(dspy.inspect_history())
- Fix History frozen=True to allow mutation (messages list append)
- Implement History.add_message() with structured ACTION events
- Fix circular import in history.py (removed dspy.predict.predict import)
- Add error handling: AdapterParseError, ValueError, None tool_calls, unknown tool names
- Add forced submit fallback when max_iters exhausts
- Support per-call max_iters override via forward(**kwargs)
- Export ReActV2 from dspy/__init__.py and dspy/predict/__init__.py
- Add 9 focused unit tests in tests/predict/test_reactv2.py

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
… truncation

- Implement append_request/append_action/append_final helpers in History
- Add has_open_episode() to track open episodes
- Add truncate_oldest_actions() with chars/4 heuristic (no tiktoken)
- Add make_truncate_oldest_actions() factory function
- Wire REQUEST/FINAL events into ReActV2 forward loop
- Remove tiktoken dependency (estimate_tokens, summarize_if_needed)
- 5 new tests for history events, episode tracking, truncation, compaction

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
…itization, provider FC fallback

- ChatAdapter: natural language guidance for native FC path (no [[ ## completed ## ]] markers)
- ChatAdapter.parse: handle native FC text as free-form reasoning for single str output field
- base.py: tag processed signature with __dspy_native_fc__ when native FC active
- tool.py: normalize OpenAI {type:'function', function:{name, arguments}} format in ToolCalls
- tool.py: sanitize tool names to match OpenAI ^[a-zA-Z0-9_-]+$ pattern
- lm.py: provider-based fallback for supports_function_calling (openai, anthropic, etc.)
- 7 new tests: native/non-native format, ToolCalls normalization, sanitization, FC fallback, GEPA

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
… loop

- BrowseComp: v2 recall 0.139 vs v1 0.168 (within noise, 0 crashes both)
- Tau-banking: both v1/v2 score 0.0 (gpt-5-nano too weak, 0 crashes)
- Compaction: qwen3-32b 2/2 completed with truncation, no overflow
- inspect_history: native + non-native outputs captured for gpt-5-nano
- LOC: +464/-279 (net +185, well under +1000 budget)
- Fix: pass tools=list(self.tools.values()) to predict calls

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
BrowseComp n=30: v2 recall (0.150) >= v1 recall (0.148), 0 crashes, 120s timeout enforced.
Tau-banking with groq/openai/gpt-oss-120b: both v1 and v2 achieve 0.200 avg reward.

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
…alEvent)

Replace plain dicts with __dspy_history_event__ string tags with pydantic
models using a discriminated union on the 'event' field. Update all
append methods, isinstance checks, and test assertions accordingly.

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
Tool now extends Parameter, enabling named_parameters() discovery.
GEPA seed candidate includes tool descs, build_program applies
optimized descs, and ReActV2._rebuild_instructions() syncs both
text-mode and native FC paths.

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
When GEPA selects a tool desc component (e.g. tools['add']) for reflective
mutation, make_reflective_dataset() no longer asserts it must be a predictor.
Instead, it falls back to the first predictor's traces, which contain the
relevant signal about how the tool was used by the parent predictor.

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
- Display tool_calls on assistant messages in conversation history
- Show tool_call_id on tool role messages
- Handle content=None for native FC assistant messages

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
Don't fall back to JSONAdapter when lm_kwargs contains 'tools',
since JSONAdapter sets response_format: json_object which conflicts
with native function calling on providers like Groq.

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
- When max_iters exhausts with native FC active, set tool_choice to
  mechanically force the submit tool call (reasoning models ignore
  text-only directives)
- Add try/except fallback: if provider rejects tool_choice, retry
  without it (graceful degradation for Cohere, Mistral, etc.)
- Bypass self.react in _forced_submit to control message ordering
  directly -- the directive must be the LAST user message
- Extract reasoning from model_extra when content is None (Groq
  reasoning models return content=None with reasoning in extras)
- Fix open-episode detection: check has_open_episode before
  format_conversation_history deletes history from inputs
- Fix tests to exercise submit-within-loop path (DummyLM can't
  produce raw OpenAI tool_call format needed by _forced_submit)

Benchmark: native FC + tool_choice on BrowseComp n=20 (gpt-5-nano)
  v2 recall 0.240 vs v1 0.142 (+69%), submitted 20/20 vs 2/20

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
…llback

Bug 1 - History rendering leaked native FC format into non-native path:
  - ActionEvents were serialized with tool_calls JSON (OpenAI format)
    even in text mode, causing 'Missing tool_calls[0].id' API errors
  - Fix: render as plain text 'Thought: ... / Action: tool(args)'
  - Also fix native path: pre-generate stable deterministic IDs for
    tool calls that lack them (hash-based, not object id)

Bug 2 - Non-native _forced_submit returned empty predictions:
  - LM returns text in non-native mode but _forced_submit only
    handled native FC dict responses with tool_calls key
  - Fix: add text parsing fallback using adapter.parse() to extract
    submit tool call from text, plus last-resort extraction of output
    fields directly from the response

Before: 3/20 crashes, 4/20 submitted, 0.150 recall
After:  0/20 crashes, 14/20 submitted, 0.190 recall

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
Like v1's self.extract, adds a dedicated LM call that reads the
agent's trajectory and produces output fields directly. Fires only
as a last resort in _forced_submit after all submit attempts fail.

- Add self.extract (ChainOfThought) in __init__ with trajectory input
- Add _render_history_as_text() to convert History events to text
- In _forced_submit, after native FC and text parsing both fail,
  render history and call self.extract to recover the answer
- Append FinalEvent so history looks like submit was called cleanly

Submit rate: 14/20 -> 19/20 in non-native mode
Native FC path unchanged (submit already works 100%)

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
Runtime state rename: 'final' -> 'output' better describes the
event's purpose (storing output field values, not signaling finality).

- FinalEvent class -> OutputEvent, discriminator 'final' -> 'output'
- append_final() -> append_output()
- Updated all imports, isinstance checks, and tests

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented Apr 27, 2026

Greptile Summary

This PR introduces ReActV2, a ground-up rewrite of the DSPy ReAct agent with native function calling, a structured semantic history model (InputEvent/ActionEvent/OutputEvent), pluggable compaction, two-tier forced submit, and GEPA support for optimizing tool descriptions.

  • New ReActV2 module (dspy/predict/reactv2.py): replaces flat-dict trajectories with typed history events, adds parallel tool calls, a submit tool for explicit termination, and a two-tier forced-submit fallback (forced react call → ChainOfThought extract).
  • Semantic history (dspy/adapters/types/history.py): History is now a Pydantic model with discriminated-union events, backward-compatible LegacyEvent wrapping, and a compact_fn hook invoked on ContextWindowExceededError.
  • Native FC adapter path (dspy/adapters/base.py, chat_adapter.py): use_native_function_calling=True strips ToolCalls/tools fields from the signature, tags it with __dspy_native_fc__, sends tool schemas to the LM, and takes a dedicated parse path; ToolCalls.model_validate now normalizes OpenAI and Responses API wire formats including JSON-string arguments.

Confidence Score: 3/5

The new non-native FC history formatting appends two consecutive user messages after each tool step, which causes hard API failures on Anthropic and other strict alternating-turn providers.

The core ReActV2 loop, forced-submit tiers, and ToolCalls normalizer are solid and well-tested. The main risk is in Adapter.format(): when a non-native FC call has an open episode, format_conversation_history closes the last ActionEvent with a user-role observation message, and then format() immediately appends user_message_output_requirements as a second consecutive user message. Anthropic's API returns a 400 for this message ordering. Any user running ReActV2 in text/non-native FC mode against Anthropic will see API failures after the first tool step.

dspy/adapters/base.py — the format() method's has_open_episode branch produces consecutive user messages in non-native FC mode.

Important Files Changed

Filename Overview
dspy/adapters/base.py New history formatting and native FC preprocessing; consecutive user messages produced in non-native FC mode after each tool observation will break Anthropic and other strict alternating-turn providers.
dspy/predict/reactv2.py New ReActV2 module with semantic history, two-tier forced submit, and parallel tool calls; submit tool lambda allows unknown args to silently propagate into the final Prediction.
dspy/adapters/chat_adapter.py Native FC formatting path added; misleading AdapterParseError message when zero str output fields are present in _parse_native_fc.
dspy/adapters/types/history.py New typed semantic event model (InputEvent, ActionEvent, OutputEvent, LegacyEvent) with backward-compatible coercion and pluggable compaction; logic is clean.
dspy/adapters/types/tool.py Adds ToolCalls model validator that normalizes OpenAI/Responses-API wire formats, including proper JSON string decoding via json_repair; looks correct.
dspy/clients/lm.py Fixes _convert_chat_request_to_responses_request loop bug where only the last message was emitted; native FC messages (tool_calls/tool roles) are still not converted for the Responses API path.
dspy/teleprompt/gepa/gepa.py Extends GEPA to optimize tool descriptions alongside predictor instructions; tool feedback always inherits from the first predictor, which is correct for ReActV2 but fragile for multi-predictor modules.
tests/predict/test_reactv2.py Comprehensive tests covering submit, history semantics, compaction, native FC format, tool normalization, GEPA integration, and forced-submit tiers; good coverage.

Sequence Diagram

sequenceDiagram
    participant Caller
    participant ReActV2
    participant History
    participant Adapter
    participant LM
    participant Tool

    Caller->>ReActV2: forward(**input_args)
    ReActV2->>History: append_input(input_args)
    loop Each iteration (max_iters)
        ReActV2->>Adapter: format(signature, history, tools)
        Adapter->>LM: chat/FC request
        LM-->>Adapter: next_thought + tool_calls
        Adapter-->>ReActV2: Prediction(next_thought, tool_calls)
        loop Each tool call
            ReActV2->>Tool: __call__(**args)
            Tool-->>ReActV2: result / error
        end
        ReActV2->>History: append_action(thought, tool_calls, observations)
        alt submit tool called successfully
            ReActV2->>History: append_output(result)
            ReActV2-->>Caller: Prediction(history, **result)
        end
    end
    ReActV2->>ReActV2: _forced_submit()
    note over ReActV2: Tier 1 - react with tool_choice=submit
    note over ReActV2: Tier 2 - ChainOfThought extract
    ReActV2-->>Caller: Prediction(history, termination_reason)
Loading

Comments Outside Diff (1)

  1. dspy/adapters/base.py, line 285-296 (link)

    P1 Consecutive user messages in non-native FC mode breaks Anthropic (and strict alternating-turn) providers

    When has_open_episode is True and the conversation already contains at least one ActionEvent, format_conversation_history appends the last observation as a {"role": "user", ...} message. Immediately after, user_message_output_requirements is also appended as another {"role": "user", ...} message. Anthropic's API rejects consecutive user-role messages with a 400 error, and any provider that enforces strict turn alternation will also fail.

    Concrete failing path: user calls ReActV2.forward() with a non-native FC adapter (e.g. default ChatAdapter) targeting an Anthropic model, after at least one tool call has been recorded in history. The final messages list becomes […, assistant, user(obs), user(output_req)], which Anthropic rejects.

    Consider merging the observation and output-requirements into a single user message, or appending output_req as a suffix to the last observation message.

Reviews (7): Last reviewed commit: "Clean up ReActV2 branch" | Re-trigger Greptile

Comment thread .factory/init.sh Outdated
Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
Comment thread dspy/adapters/types/tool.py
isaacbmiller and others added 2 commits April 27, 2026 14:51
- Extract _build_instructions() (single source of truth for instruction string)
- Replace ~80-line shadow pipeline in _forced_submit with 2-tier: submit_predict + extract
- Add termination_reason to all Prediction returns
- Remove dead compact_if_needed() call (compaction is caller responsibility)
- Create Observation pydantic model replacing tuple[Any, bool]
- Type tool_calls field in ActionEvent with proper ToolCalls import
- Move runtime imports to top-level; remove json_repair dependency

Net -43 lines. 20/20 tests pass.

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
_forced_submit tier 1 now calls self.react (same pipeline as main loop)
with tool_choice temporarily forced to submit. No duplicate Predict module.

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
Comment thread dspy/clients/lm.py Outdated
- Thread-safe _forced_submit: pass config kwarg instead of mutating shared state
- Add logging to all except blocks in _forced_submit (debug level)
- Handle ContextWindowExceededError with retry-after-compaction
- Map output field types properly in _build_submit_tool (not all strings)
- Sync extract signature in _rebuild_instructions for GEPA
- Revert _convert_chat_request_to_responses_request regression in lm.py
- Stop mutating inputs dict in format_conversation_history
- Deterministic tool call IDs (hashlib.md5 instead of hash())
- Fix double 'is' typo in chat_adapter docstring
- Add test for extract fallback path (tier 2)

21/21 tests pass.

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
Comment thread dspy/clients/lm.py
Comment on lines 524 to +533
for msg in request.pop("messages"):
content_blocks = []
c = msg.get("content")
if isinstance(c, str):
content_blocks.append({"type": "input_text", "text": c})
elif isinstance(c, list):
# Convert each content item from Chat API format to Responses API format
for item in c:
content_blocks.append(_convert_content_item_to_responses_format(item))
request["input"] = [{"role": msg.get("role", "user"), "content": content_blocks}]
input_items.append({"role": msg.get("role", "user"), "content": content_blocks})
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Responses API converter silently drops native FC messages

When native function calling is active, the conversation history contains {"role": "assistant", "tool_calls": [...]} messages (no content key) and {"role": "tool", "content": "...", "tool_call_id": "..."} messages. The converter only inspects msg.get("content"), so for assistant FC messages content_blocks is always empty and the entire tool_calls array is silently dropped. For tool role messages, tool_call_id is discarded. Any caller that uses native FC against a Responses API backend (e.g. use_responses_api=True on a reasoning model) will send a malformed request where the model has no memory of what tools it previously called.

for msg in request.pop("messages"):
    content_blocks = []
    c = msg.get("content")
    # assistant FC messages have no "content" → content_blocks stays []
    # tool_calls array is never read
    # tool role messages lose tool_call_id
    input_items.append({"role": msg.get("role", "user"), "content": content_blocks})

The Responses API equivalent for an assistant FC turn is a function_call output item, and for the tool result it is a function_call_output input item. Both require the call_id. The converter needs to handle these shapes before native FC and Responses API can be used together safely.

…e artifacts

- Add LegacyEvent wrapper + model_validator so History(messages=[dict]) still works
- Guard load_state against missing keys (Tool now extends Parameter)
- Remove history.json, benchmark_results.md, cmpnd-sdk submodule
- Restore uv.lock to main

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
Comment thread dspy/predict/reactv2.py
Comment on lines +26 to +42
def _build_submit_tool(signature: type["Signature"]) -> Tool:
outputs = ", ".join([f"`{k}`" for k in signature.output_fields.keys()])
output_args = {}
output_arg_types = {}
for k, v in signature.output_fields.items():
annotation = v.annotation if hasattr(v, "annotation") else str
json_type = _ANNOTATION_TO_JSON_TYPE.get(annotation, "string")
output_args[k] = {"type": json_type}
output_arg_types[k] = annotation

return Tool(
func=lambda **kwargs: kwargs,
name="submit",
desc=f"Call this tool to end the task and return your final answer. Takes: {outputs}.",
args=output_args,
arg_types=output_arg_types,
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Submit tool schema maps non-primitive output types to "string", breaking agent termination

_ANNOTATION_TO_JSON_TYPE only covers bare Python primitives (str, int, float, bool, list). For any real generic alias such as list[str], list[MyModel], or a Pydantic model, .get(annotation, "string") falls back to "string" and the submit schema is {"type": "string"}.

When the model later calls submit(answers=["a", "b"]), Tool._validate_and_parse_args runs jsonschema.validate(instance=["a", "b"], schema={"type": "string"}), which raises ValidationError. That is caught and re-raised as ValueError, which forward() catches in the tool-execution except block and records as an error observation (is_error=True). Because the loop checks if tool_call.name == "submit" and not obs.is_error, the submit is silently swallowed and the agent eventually exhausts max_iters without ever terminating — for any signature whose output field is not a bare primitive.

A minimal fix is to use get_origin to produce a richer schema for generic aliases:

from typing import get_origin

def _annotation_to_json_schema(annotation) -> dict:
    origin = get_origin(annotation)
    if origin is list:
        return {"type": "array"}
    return {"type": _ANNOTATION_TO_JSON_TYPE.get(annotation, "string")}

Comment thread dspy/adapters/types/tool.py
Comment thread dspy/clients/lm.py Outdated
Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant