feat(chat): show and persist tool calls in assistant messages#322
feat(chat): show and persist tool calls in assistant messages#322zaira-bibi wants to merge 1 commit into
Conversation
Tool call progress was not displayed during streaming and was lost on page refresh. Now shows live tool names while executing, collapses to "X operations completed" on finish, and persists by saving call names in message metadata so the summary survives a reload.
|
Claude finished @zaira-bibi's task in 3m 6s —— View job PR Review: feat(chat): show and persist tool calls in assistant messages
OverviewThis PR adds tool call visibility to the chat UI: tool names display during LLM execution with a pulsing indicator, collapse to an "X operations completed" summary after, and persist in Overall this is well-structured work with good test coverage and a clean protocol design. A few issues below. FindingsHigh Priority1. Final response no longer streams token-by-token ( The refactor of Recommendation: Use Medium Priority2. The backend sends Fix: Extract // In readStream, inside parsed.done block:
const toolCalls = (parsed.message as Record<string, unknown> | undefined)?.tool_calls;
if (Array.isArray(toolCalls) && toolCalls.length > 0) {
doneToolCalls = toolCalls as { name: string }[];
}
// In handleSend finalization:
...(doneToolCalls && {
toolCalls: doneToolCalls.map((t) => ({ ...t, status: "done" as const })),
}),Low Priority3. Near-duplicate parsing functions (
What's Done Well
AssessmentThe feature logic is sound and well-tested. The two substantive issues are (1) the streaming regression for tool-enabled conversations and (2) the missing done-payload reconciliation for tool calls. Item 1 is the more impactful one from a user-experience perspective. Neither is a blocking correctness bug — the feature works — but both would improve reliability and UX. |
| llm = self._create_llm(streaming=False) | ||
| llm_with_tools = llm.bind_tools(tools) |
There was a problem hiding this comment.
High: Final response no longer streams token-by-token
Switching from astream() to ainvoke() here means the final LLM response (after all tool calls) is returned as a single blob and emitted as one giant {"type": "token", "content": <entire-response>} event (line 390). For long responses this is a noticeable UX regression — users see nothing until the full response is generated, then it appears all at once.
The old code streamed every round, which had the opposite problem (intermediate chatter leaked to the user). A middle ground would be to use ainvoke for intermediate tool-calling rounds but switch back to astream for the final round:
# After the while-loop detects no tool calls, stream the final response:
callback = StreamingCallbackHandler()
llm_streaming = self._create_llm(streaming=True, callbacks=[callback])
llm_streaming_with_tools = llm_streaming.bind_tools(tools)
task = asyncio.create_task(llm_streaming_with_tools.ainvoke(langchain_messages))
async for token in callback.aiter():
yield {"type": "token", "content": token}
await taskThis preserves the nice tool-start/tool-end events while keeping token-by-token streaming for the user-facing final response.
| def _parse_tool_calls(model_metadata: str | None) -> list[dict[str, Any]] | None: | ||
| if not model_metadata: | ||
| return None | ||
| try: | ||
| meta = json.loads(model_metadata) | ||
| calls = meta.get("tool_calls") | ||
| return calls if isinstance(calls, list) else None | ||
| except (json.JSONDecodeError, AttributeError): | ||
| return None |
There was a problem hiding this comment.
Low: Near-duplicate of _parse_rag_sections
This function is nearly identical to _parse_rag_sections above (lines 118-126), differing only in the dict key. Consider a shared helper to reduce duplication:
def _parse_metadata_list(model_metadata: str | None, key: str) -> list[dict[str, Any]] | None:
if not model_metadata:
return None
try:
meta = json.loads(model_metadata)
value = meta.get(key)
return value if isinstance(value, list) else None
except (json.JSONDecodeError, AttributeError):
return NoneThen: _parse_metadata_list(msg.model_metadata, "rag_sections") and _parse_metadata_list(msg.model_metadata, "tool_calls").
| async def _stream_message_with_tools( | ||
| self, messages: list[dict[str, Any]], tools: list[Any] | ||
| ) -> AsyncIterator[dict[str, Any]]: | ||
| """Tool-loop streaming: silences intermediate LLM chatter, emits tool events, | ||
| and yields only the final response as token events.""" | ||
| llm = self._create_llm(streaming=False) |
There was a problem hiding this comment.
Switching from astream + streaming=True to ainvoke + streaming=False looks aimed at silencing the LLM's intermediate chatter between tool calls, which is a real improvement. But it also drops streaming for the final answer — that reply now arrives as a single event after the model has fully finished generating (see the yield {"type": "token", "content": content} near the bottom of the function). So when a tool is involved, the bubble will sit on the thinking dots and then the whole response will pop in at once, instead of streaming in token-by-token like it does today for non-tool chats. Was the intent to suppress only the intermediate text, or also the final-response streaming? Worth calling this out explicitly either way, since it's a noticeable UX shift for any chat that uses tools.
| def _parse_tool_calls(model_metadata: str | None) -> list[dict[str, Any]] | None: | ||
| if not model_metadata: | ||
| return None | ||
| try: | ||
| meta = json.loads(model_metadata) | ||
| calls = meta.get("tool_calls") | ||
| return calls if isinstance(calls, list) else None | ||
| except (json.JSONDecodeError, AttributeError): | ||
| return None |
There was a problem hiding this comment.
The exception-handling section of the project's CLAUDE.md asks for every caught exception to be logged with enough context to diagnose the failure. Here, if the metadata JSON stored on a message is ever corrupted (bad upgrade, manual DB edit, etc.), this helper quietly returns None and a message that did have tool calls will look like it had none — with no log trail at all to point at the bad row. The neighboring _parse_rag_sections has the same gap, so this is consistent with what's already there, but worth surfacing in case you want to address both together.
| # --- Phase 2: LLM streaming --- | ||
| full_response = "" | ||
| conversation_id: int = conversation.id # type: ignore[assignment] | ||
| completed_tool_calls: list[dict[str, str]] = [] |
There was a problem hiding this comment.
Small inconsistency to flag: this is annotated list[dict[str, str]], but the matching schema field on MessageResponse is list[dict[str, Any]] | None. They agree today because every entry is just a {"name": ...} dict, but if a future event ever carries a non-string value (numeric tool id, structured args, etc.), the two types will silently disagree and mypy won't catch it on this side.
| // Mark the first running entry for this tool name as done | ||
| let marked = false; | ||
| const updated = existing.map((t) => { | ||
| if (!marked && t.name === toolName && t.status === "running") { | ||
| marked = true; | ||
| return { ...t, status: "done" as const }; | ||
| } | ||
| return t; | ||
| }); | ||
| return { ...msg, toolCalls: updated }; |
There was a problem hiding this comment.
This pairs tool_start and tool_end events by tool name alone. Today the backend yields them in strict sequential order so it works fine, but if the model is ever allowed to call the same tool in parallel (e.g., two search_web calls firing at once), there's no way for this code to tell which tool_end belongs to which running entry — whichever running card the loop hits first will flip to "done", even if it isn't the one that actually finished. Not a today-bug, but worth being aware of as tool-use grows.
| const toolCalls = message.toolCalls ?? []; | ||
| const hasRunningTools = toolCalls.some((t) => t.status === "running"); | ||
| const isThinking = message.isTyping && !displayText && !hasRunningTools; | ||
| // While tools are running and no response text yet, suppress the card entirely | ||
| const showCard = !hasRunningTools || !!displayText || message.isError || !message.isTyping; |
There was a problem hiding this comment.
Between toolCalls, hasRunningTools, isThinking, showCard, and the inline tool-call JSX further down, what's actually visible inside the assistant bubble now depends on four or five interacting booleans. It reads okay today if you follow it line-by-line, but every future state added here (error while tools are still running, tools-only response with no text, etc.) will have to be reasoned about against all the others. Worth keeping an eye on before this grows further.
What
Show tool call names during LLM execution, collapse to a persistent "X operations completed" summary after, and save them in message metadata so the summary survives page refresh.
Changes
feat(chat): collecttool_endevents during streaming and save tool call names in message metadata alongsiderag_sectionsfeat(api): addtool_callsfield toMessageResponseand expose via conversation history endpointfeat(frontend): display live tool names during streaming, then collapse to a dotted-underline summary with a hover tooltip listing each callfeat(frontend): restoretoolCallsfrom API response when loading conversation historytest(chat): add tests for_parse_tool_calls, metadata persistence, done payload, andMessageResponse.tool_callsHow to Test
Notes
No migration required —
tool_callsis stored in the existingmodel_metadataJSON column alongsiderag_sections.