Skip to content

feat(chat): show and persist tool calls in assistant messages#322

Open
zaira-bibi wants to merge 1 commit into
mainfrom
zaira/enhance-tool-call-ui
Open

feat(chat): show and persist tool calls in assistant messages#322
zaira-bibi wants to merge 1 commit into
mainfrom
zaira/enhance-tool-call-ui

Conversation

@zaira-bibi
Copy link
Copy Markdown
Contributor

What

Show tool call names during LLM execution, collapse to a persistent "X operations completed" summary after, and save them in message metadata so the summary survives page refresh.

Changes

  • feat(chat): collect tool_end events during streaming and save tool call names in message metadata alongside rag_sections
  • feat(api): add tool_calls field to MessageResponse and expose via conversation history endpoint
  • feat(frontend): display live tool names during streaming, then collapse to a dotted-underline summary with a hover tooltip listing each call
  • feat(frontend): restore toolCalls from API response when loading conversation history
  • test(chat): add tests for _parse_tool_calls, metadata persistence, done payload, and MessageResponse.tool_calls

How to Test

  1. Open a conversation that uses tools (e.g. ask a question that triggers RAG or an OpenedX tool)
  2. During generation, verify tool names appear above the response bubble with a pulsing indicator
  3. After generation completes, verify the list collapses to "X operations completed" with a dotted underline
  4. Hover over "X operations completed" — confirm a tooltip lists each tool call in order
  5. Refresh the page — confirm "X operations completed" and the tooltip are still present

Notes

No migration required — tool_calls is stored in the existing model_metadata JSON column alongside rag_sections.

Tool call progress was not displayed during streaming and was lost on
page refresh. Now shows live tool names while executing, collapses to
"X operations completed" on finish, and persists by saving call names
in message metadata so the summary survives a reload.
@zaira-bibi zaira-bibi self-assigned this May 13, 2026
@claude
Copy link
Copy Markdown

claude Bot commented May 13, 2026

Claude finished @zaira-bibi's task in 3m 6s —— View job


PR Review: feat(chat): show and persist tool calls in assistant messages

  • Gather context — read all changed files
  • Check for previously addressed review comments (none found)
  • Security review — no issues found
  • Correctness and logic review
  • Performance analysis
  • React anti-patterns check (frontend)
  • Test coverage validation
  • Post inline comments for specific issues
  • Post final summary

Overview

This PR adds tool call visibility to the chat UI: tool names display during LLM execution with a pulsing indicator, collapse to an "X operations completed" summary after, and persist in model_metadata for page-refresh survival. The implementation spans backend SSE events, metadata persistence, schema updates, frontend state management, and rendering.

Overall this is well-structured work with good test coverage and a clean protocol design. A few issues below.


Findings

High Priority

1. Final response no longer streams token-by-token (app/llm/providers.py:345-346)

The refactor of _stream_message_with_tools replaced astream() with ainvoke() for ALL rounds, including the final response. This means users get the entire final response dumped as a single {"type": "token"} event instead of seeing it build up progressively. For long answers after tool use, this is a noticeable UX regression.

Recommendation: Use ainvoke for intermediate tool-calling rounds (good — silences chatter), but switch to astream via StreamingCallbackHandler for the final round. See inline comment for a concrete approach.

Medium Priority

2. tool_calls from done payload not consumed by frontend (useChatStream.ts)

The backend sends tool_calls in the done SSE message (routes.py:780), but readStream doesn't extract it — unlike rag_sections which IS extracted and reconciled via doneRagSections. The live-streamed tool calls survive via ...msg spread in handleSend, but if a tool_end SSE event is dropped during streaming, there's no reconciliation path from the authoritative done payload.

Fix: Extract doneToolCalls from the done message (like doneRagSections), and apply them in handleSend's finalization:

// In readStream, inside parsed.done block:
const toolCalls = (parsed.message as Record<string, unknown> | undefined)?.tool_calls;
if (Array.isArray(toolCalls) && toolCalls.length > 0) {
  doneToolCalls = toolCalls as { name: string }[];
}

// In handleSend finalization:
...(doneToolCalls && {
  toolCalls: doneToolCalls.map((t) => ({ ...t, status: "done" as const })),
}),

Low Priority

3. Near-duplicate parsing functions (app/core_plugins/chat/routes.py:129-137)

_parse_tool_calls is nearly identical to _parse_rag_sections, differing only in the key name. A shared _parse_metadata_list(model_metadata, key) helper would reduce duplication. See inline comment.


What's Done Well

  • Clean event protocoltool_start/tool_end events are well-designed and consistent with the existing status event pattern
  • Good test coverage — 7 new tests covering _parse_tool_calls edge cases, metadata persistence, done payload, and schema validation
  • Graceful history restorationuseConversation.ts maps persisted tool calls to { status: "done" } correctly
  • UI details — singular/plural handling ("1 operation" vs "N operations"), dotted-underline tooltip, slicing to last 8 entries during live progress
  • No migration required — leveraging the existing model_metadata JSON column is the right call

Assessment

The feature logic is sound and well-tested. The two substantive issues are (1) the streaming regression for tool-enabled conversations and (2) the missing done-payload reconciliation for tool calls. Item 1 is the more impactful one from a user-experience perspective. Neither is a blocking correctness bug — the feature works — but both would improve reliability and UX.

Comment thread app/llm/providers.py
Comment on lines +345 to 346
llm = self._create_llm(streaming=False)
llm_with_tools = llm.bind_tools(tools)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High: Final response no longer streams token-by-token

Switching from astream() to ainvoke() here means the final LLM response (after all tool calls) is returned as a single blob and emitted as one giant {"type": "token", "content": <entire-response>} event (line 390). For long responses this is a noticeable UX regression — users see nothing until the full response is generated, then it appears all at once.

The old code streamed every round, which had the opposite problem (intermediate chatter leaked to the user). A middle ground would be to use ainvoke for intermediate tool-calling rounds but switch back to astream for the final round:

# After the while-loop detects no tool calls, stream the final response:
callback = StreamingCallbackHandler()
llm_streaming = self._create_llm(streaming=True, callbacks=[callback])
llm_streaming_with_tools = llm_streaming.bind_tools(tools)
task = asyncio.create_task(llm_streaming_with_tools.ainvoke(langchain_messages))
async for token in callback.aiter():
    yield {"type": "token", "content": token}
await task

This preserves the nice tool-start/tool-end events while keeping token-by-token streaming for the user-facing final response.

Fix this →

Comment on lines +129 to +137
def _parse_tool_calls(model_metadata: str | None) -> list[dict[str, Any]] | None:
if not model_metadata:
return None
try:
meta = json.loads(model_metadata)
calls = meta.get("tool_calls")
return calls if isinstance(calls, list) else None
except (json.JSONDecodeError, AttributeError):
return None
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Low: Near-duplicate of _parse_rag_sections

This function is nearly identical to _parse_rag_sections above (lines 118-126), differing only in the dict key. Consider a shared helper to reduce duplication:

def _parse_metadata_list(model_metadata: str | None, key: str) -> list[dict[str, Any]] | None:
    if not model_metadata:
        return None
    try:
        meta = json.loads(model_metadata)
        value = meta.get(key)
        return value if isinstance(value, list) else None
    except (json.JSONDecodeError, AttributeError):
        return None

Then: _parse_metadata_list(msg.model_metadata, "rag_sections") and _parse_metadata_list(msg.model_metadata, "tool_calls").

Comment thread app/llm/providers.py
Comment on lines +340 to +345
async def _stream_message_with_tools(
self, messages: list[dict[str, Any]], tools: list[Any]
) -> AsyncIterator[dict[str, Any]]:
"""Tool-loop streaming: silences intermediate LLM chatter, emits tool events,
and yields only the final response as token events."""
llm = self._create_llm(streaming=False)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switching from astream + streaming=True to ainvoke + streaming=False looks aimed at silencing the LLM's intermediate chatter between tool calls, which is a real improvement. But it also drops streaming for the final answer — that reply now arrives as a single event after the model has fully finished generating (see the yield {"type": "token", "content": content} near the bottom of the function). So when a tool is involved, the bubble will sit on the thinking dots and then the whole response will pop in at once, instead of streaming in token-by-token like it does today for non-tool chats. Was the intent to suppress only the intermediate text, or also the final-response streaming? Worth calling this out explicitly either way, since it's a noticeable UX shift for any chat that uses tools.

Comment on lines +129 to +137
def _parse_tool_calls(model_metadata: str | None) -> list[dict[str, Any]] | None:
if not model_metadata:
return None
try:
meta = json.loads(model_metadata)
calls = meta.get("tool_calls")
return calls if isinstance(calls, list) else None
except (json.JSONDecodeError, AttributeError):
return None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exception-handling section of the project's CLAUDE.md asks for every caught exception to be logged with enough context to diagnose the failure. Here, if the metadata JSON stored on a message is ever corrupted (bad upgrade, manual DB edit, etc.), this helper quietly returns None and a message that did have tool calls will look like it had none — with no log trail at all to point at the bad row. The neighboring _parse_rag_sections has the same gap, so this is consistent with what's already there, but worth surfacing in case you want to address both together.

# --- Phase 2: LLM streaming ---
full_response = ""
conversation_id: int = conversation.id # type: ignore[assignment]
completed_tool_calls: list[dict[str, str]] = []
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small inconsistency to flag: this is annotated list[dict[str, str]], but the matching schema field on MessageResponse is list[dict[str, Any]] | None. They agree today because every entry is just a {"name": ...} dict, but if a future event ever carries a non-string value (numeric tool id, structured args, etc.), the two types will silently disagree and mypy won't catch it on this side.

Comment on lines +157 to +166
// Mark the first running entry for this tool name as done
let marked = false;
const updated = existing.map((t) => {
if (!marked && t.name === toolName && t.status === "running") {
marked = true;
return { ...t, status: "done" as const };
}
return t;
});
return { ...msg, toolCalls: updated };
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pairs tool_start and tool_end events by tool name alone. Today the backend yields them in strict sequential order so it works fine, but if the model is ever allowed to call the same tool in parallel (e.g., two search_web calls firing at once), there's no way for this code to tell which tool_end belongs to which running entry — whichever running card the loop hits first will flip to "done", even if it isn't the one that actually finished. Not a today-bug, but worth being aware of as tool-use grows.

Comment on lines +38 to +42
const toolCalls = message.toolCalls ?? [];
const hasRunningTools = toolCalls.some((t) => t.status === "running");
const isThinking = message.isTyping && !displayText && !hasRunningTools;
// While tools are running and no response text yet, suppress the card entirely
const showCard = !hasRunningTools || !!displayText || message.isError || !message.isTyping;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Between toolCalls, hasRunningTools, isThinking, showCard, and the inline tool-call JSX further down, what's actually visible inside the assistant bubble now depends on four or five interacting booleans. It reads okay today if you follow it line-by-line, but every future state added here (error while tools are still running, tools-only response with no text, etc.) will have to be reasoned about against all the others. Worth keeping an eye on before this grows further.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants