Conversation
…rupt() works over AG-UI (#92) * fix(copilotkit): compile evaluator_graph with a checkpointer for interrupt() The evaluator agent calls ``interrupt()`` in ``confirm_judge_node`` for human-in-the-loop judge confirmation. A checkpointer is mandatory for that to work -- without one, a ``Command(resume=...)`` call produces zero events, ``ag-ui-langgraph`` never emits ``RUN_FINISHED``, and the CopilotKit frontend blocks all subsequent messages with "Cannot send 'RUN_STARTED' while a run is still active". Changes: - ``evaluator_agent.py``: compile with ``InMemorySaver`` and a commented Postgres swap block. Convert ``JudgeInfo`` / ``TraceInfo`` / ``EvaluationInfo`` from ``@dataclass`` to ``pydantic.BaseModel`` so LangGraph's default ``JsonPlusSerializer`` can persist state across the pause boundary (dataclasses raise ``TypeError: Type is not msgpack serializable``). - ``samples/copilotkit/README.md``: add full FastAPI backend wiring with ``add_langgraph_fastapi_endpoint``, Next.js frontend wiring with ``LangGraphHttpAgent``, a checkpointer options matrix (InMemory / SQLite / Postgres / Redis / LangGraph Platform) with per-option migration snippets, a version-compatibility table pinning the versions the bug reporter used, and a troubleshooting section mapping the observed frontend errors back to the backend cause. - ``docs/samples-guide.md``: cross-reference the checkpointer requirement. - ``tests/test_samples_e2e.py``: add ``test_copilotkit_evaluator_interrupt_resume`` that imports the real ``langgraph`` (not ``MagicMock``), asserts the compiled graph has a non-None checkpointer, and drives a full ``astream -> interrupt -> Command(resume=...) -> astream`` cycle with a patched Stratix client. Confirmed this test fails on the pre-fix code and passes on the fix. Also extended the existing mock-modules dicts so the import-smoke tests include ``langgraph.checkpoint.memory``. The existing tests missed this because they mock ``langgraph``, ``langgraph.graph``, and ``langgraph.types`` with ``MagicMock()`` and then only call ``main()`` (which prints usage). They never build or execute the graph, so they cannot observe the missing checkpointer. * fix(copilotkit): register DTOs with serde, add AG-UI wire e2e test Follow-ups to the interrupt/checkpointer fix, addressing the open items flagged in the prior commit: 1. Deserialize warning resolved. The Pydantic DTOs (``JudgeInfo`` / ``TraceInfo`` / ``EvaluationInfo``) are now registered on ``JsonPlusSerializer(allowed_msgpack_modules=...)`` via a custom serde passed to ``InMemorySaver``. Verified the sample passes with ``LANGGRAPH_STRICT_MSGPACK=true``, so it survives LangGraph's planned tightening of checkpoint deserialization. 2. End-to-end AG-UI wire validation. New integration test ``test_copilotkit_evaluator_agui_wire`` wires the evaluator graph into a FastAPI app through ``ag_ui_langgraph.add_langgraph_fastapi_endpoint``, drives the full user flow in-process via ``httpx.ASGITransport``, and asserts: - Phase 1 (initial run, hits ``interrupt()``): emits RUN_STARTED and RUN_FINISHED on the SSE stream. - Phase 2 (resume with user confirmation, same ``threadId``): emits RUN_STARTED and RUN_FINISHED. - Phase 3 (follow-up message after resume): not blocked -- RUN_STARTED and RUN_FINISHED fire again. This is the exact symptom the reporter hit, tested through the exact protocol path. Gated on ``pytest.importorskip`` for the heavy deps so the test skips cleanly when they are absent. Side benefit: running the same scenario against pre-fix code produces ``ValueError: No checkpointer set`` directly from ``graph.aget_state()``, giving operators a much louder error than the silent "stream ends without RUN_FINISHED" path. 3. README backend-wiring snippet corrected. The actual ``add_langgraph_fastapi_endpoint`` signature takes an ``agent=LangGraphAgent(...)`` wrapper, not a bare ``graph=`` kwarg -- the example in the previous commit would have failed at import. Also expanded the SSE-protocol explanation to match what the new e2e test observes on the wire. 4. Investigator graph annotated. ``investigator_graph`` does not call ``interrupt()`` so it does not need a checkpointer, but without an explicit note future contributors adding a HITL step would silently regress. Added a short comment at the ``.compile()`` call pointing at the evaluator pattern. * fix(copilotkit): resolve error/field collision + browser e2e harness Follow-ups addressing the remaining open items from the previous two commits: 1. Rename ``error`` node to ``handle_error`` (evaluator + investigator). The old name collided with the ``error`` field on the state dataclass. LangGraph 1.x accepts the collision; earlier versions reject it with "'error' is already being used as a state key". Renaming the node (and the conditional-edge routing targets) keeps the routing token ``"error"`` purely an edge key and sidesteps the conflict on any LangGraph version the sample may be copied into. 2. Guard the ``allowed_msgpack_modules`` kwarg behind try/except so the sample still imports cleanly on langgraph<1.0 (where the kwarg does not exist and the strict-msgpack warning is not emitted either). Verified the sample now imports on both langgraph 0.2.56 and 1.1.9. 3. Ruff-clean the changed files (import sort I001 fixes on the new test additions; unrelated warnings in pre-existing ``main()`` / ``error_node`` code are out of scope per "only change what was asked"). 4. New ``samples/copilotkit/tests/browser/`` harness: - ``backend/server.py`` -- FastAPI app that patches ``layerlens.Stratix`` before importing the evaluator module and mounts ``evaluator_graph`` via ``add_langgraph_fastapi_endpoint(..., path="/evaluator")``. - ``frontend/`` -- Next.js 16.2.4 app pinned to the reporter's exact CopilotKit versions (``@copilotkit/react-core``, ``@copilotkit/react-ui``, ``@copilotkit/runtime`` all at 1.56.3), with the CopilotKit runtime wired to ``LangGraphHttpAgent`` against the FastAPI backend. - ``frontend/tests/interrupt-resume.spec.ts`` -- Playwright spec that drives CopilotChat through the three-turn scenario the reporter hit ("evaluate" -> "ok" -> "thanks") and asserts the exact string "Cannot send 'RUN_STARTED' while a run is still active" appears in neither the visible DOM nor the browser console. Known limitation documented in the harness README: CopilotChat 1.56's textarea reports as aria-hidden / non-"visible" under Playwright strict actionability checks in **headless** Chromium, and multiple input-driving patterns (``fill``, ``keyboard.type + Enter``, ``pressSequentially``, DOM-setter + bubbled input event) failed to reliably enable the Send button headlessly. The harness works with ``--headed`` for human verification and is structurally complete. The authoritative regression coverage for the fix is the Python test suite (``test_copilotkit_evaluator_interrupt_resume`` and ``test_copilotkit_evaluator_agui_wire``); the browser harness is corroborating / demo value, not gate-keeping. * fix(copilotkit): mirror CopilotKit's sample + workaround for ag-ui-langgraph runId bug DevRel surfaced that the backend fix in the previous commits got the Python side working but the frontend still locked up with "Cannot send 'RUN_STARTED' while a run is still active. ... INCOMPLETE_STREAM" on the second message. Raw SSE capture confirmed the root cause: RUN_STARTED runId = "r1_aca59ad1" (client-supplied) RUN_FINISHED runId = "019dc049-14ba-..." (LangGraph's internal chain UUID) This is an upstream bug in ag-ui-langgraph (ag-ui-protocol/ag-ui#1582): ``_handle_stream_events`` overwrites ``self.active_run['id']`` with every LangGraph event's internal ``run_id``, so RUN_FINISHED emits LangGraph's UUID instead of the client-supplied ``input.run_id``. ``@copilotkit/runtime`` tracks active runs by client runId and raises RUN_ERROR/INCOMPLETE_STREAM. Verified the bug is present in both ag-ui-langgraph 0.0.22 (CopilotKit's officially-pinned version) and 0.0.34 (the reporter's version), and also in ``copilotkit.LangGraphAGUIAgent`` which inherits the broken method. Changes in this commit, all aligned with CopilotKit's own ``examples/integrations/langgraph-fastapi`` reference sample: 1. ``evaluator_agent.py``: - State class converted from ``@dataclass`` to a ``TypedDict`` inheriting from ``copilotkit.CopilotKitState``. This gives us ``MessagesState``'s ``add_messages`` reducer for free (nodes return NEW messages; they are appended, not replaced) and the ``copilotkit`` field the frontend injects. - All node functions updated from ``state.X`` / ``state.messages + [m]`` to ``state.get('X')`` / ``{'messages': [m]}``. - HITL interrupt now uses ``copilotkit.langgraph.copilotkit_interrupt`` (wraps ``interrupt()`` with ``__copilotkit_messages__`` so the prompt renders as a real AIMessage in the chat UI). The bare ``langgraph.types.interrupt(prompt)`` emitted a CUSTOM event the UI ignored -- why the reporter said "the agent stops and never reaches the human-in-the-loop confirmation step." - New ``RunIdPreservingAgent`` subclass (lazy factory ``_build_langgraph_agui_agent``) overrides ``_dispatch_event`` to restore ``input.run_id`` on RUN_FINISHED / RUN_ERROR terminal events. Clearly commented with a "remove when upstream ships" TODO pointing at the ag-ui-protocol/ag-ui issue. 2. ``samples/copilotkit/README.md``: - Version matrix re-pinned to CopilotKit's exact tested set (``copilotkit==0.1.74``, ``langchain==1.0.1``, ``langgraph==1.0.1``, ``ag-ui-langgraph==0.0.22``, ``@copilotkit/*==1.56.3``, Python ``>=3.10,<3.13``). - Upstream-bug callout explaining the runId workaround. - Backend wiring snippet updated to show the factory import and the ``LangGraphAGUIAgent`` path for non-interrupt graphs (investigator). 3. ``tests/test_samples_e2e.py``: - ``test_copilotkit_evaluator_interrupt_resume`` now sends ``Command(resume=[HumanMessage(content='ok')])`` rather than ``Command(resume='ok')``, matching ``copilotkit_interrupt``'s expected resume payload shape. - ``test_copilotkit_evaluator_agui_wire`` rewritten. The previous version had a blind spot: it only asserted RUN_FINISHED was PRESENT, not that ``RUN_STARTED.runId == RUN_FINISHED.runId == input.run_id``. Now it uses the ``RunIdPreservingAgent`` factory and asserts runId continuity end-to-end. Without the workaround this test would catch the upstream bug immediately. - Mock-module dict extended with ``copilotkit.langgraph`` for the import-smoke test. 4. ``samples/copilotkit/tests/browser/``: - ``backend/requirements.txt`` re-pinned to CopilotKit's set. - ``backend/server.py`` switched from raw ``ag_ui_langgraph.LangGraphAgent`` to the sample's factory so the browser harness also benefits from the runId workaround. All 9 copilotkit tests pass in the pinned venv. Empirical verification scripts in /tmp/ (not committed) show raw SSE with matching runIds end-to-end. * fix(copilotkit): also suppress duplicate RUN_STARTED on re-entry interrupt path DevRel's diagnostic bundle (ag-ui-langgraph==0.0.34, copilotkit==0.1.87, @ag-ui/client==0.0.52 transitively) confirmed commit 542002b did not fix the browser symptom. Raw SSE from the Network tab showed: RUN_STARTED runId=d0b9d6c5-... [graph reaches step="confirm_judge" -- interrupt IS being hit] RUN_ERROR {code: "INCOMPLETE_STREAM", message: "Cannot send 'RUN_STARTED' while a run is still active. The previous run must be finished with 'RUN_FINISHED' before starting a new run."} Same error text as before, different root cause. A second bug in ag-ui-langgraph: when a request arrives on a thread whose graph is already paused at ``interrupt()`` and the request does NOT carry ``forwardedProps.command.resume``, the ``has_active_interrupts`` branch of ``prepare_stream`` (agent.py:491) emits a second ``RunStartedEvent`` to ``events_to_dispatch`` -- after ``_handle_stream_events`` (line 209) already emitted one at the top of the stream. The server's own AG-UI encoder validator catches the duplicate and converts it into a ``RUN_ERROR`` with the exact "Cannot send 'RUN_STARTED'..." message, terminating the stream before ``RUN_FINISHED`` can be dispatched. On ``@ag-ui/client@0.0.52`` (the newer protocol-state validator, which enforces within-stream start/finish invariants rather than the runId correlation the previous version used) this is what lands as INCOMPLETE_STREAM in the browser. Extended the sample's workaround subclass to filter at the agent boundary rather than override ``_dispatch_event`` (which expects to return an Event, not None/""). The filter: 1. Drops any RUN_STARTED after the first within a single stream -- fixes the duplicate-emission bug on the ``has_active_interrupts`` path. 2. Restamps ``input.run_id`` on RUN_FINISHED / RUN_ERROR -- preserves the existing ag-ui-protocol/ag-ui#1582 fix for older clients that correlate by runId. Verified on both pin matrices: - copilotkit==0.1.74 / ag-ui-langgraph==0.0.22 (CopilotKit's own reference sample pins): all tests pass. - copilotkit==0.1.87 / ag-ui-langgraph==0.0.34 (DevRel / reporter): all tests pass. Tightened ``test_copilotkit_evaluator_agui_wire`` accordingly: - asserts exactly one RUN_STARTED per stream (catches bug b) - asserts no RUN_ERROR - asserts RUN_STARTED.runId == RUN_FINISHED.runId == input.run_id (catches bug a, ag-ui-protocol/ag-ui#1582) Without either half of the workaround the test fails with a precise message pointing at which bug regressed. Follow-up: file the duplicate-RUN_STARTED bug upstream as a separate issue on ag-ui-protocol/ag-ui. * fix(copilotkit): re-pin to latest published versions (DevRel env) and ship lockfile Replaces the earlier pinning to CopilotKit's reference-sample versions (copilotkit==0.1.74 / ag-ui-langgraph==0.0.22) with the current published set customers actually install: copilotkit==0.1.87 langchain==1.2.15 langchain-core==1.3.0 langgraph==1.1.9 ag-ui-langgraph==0.0.34 Frontend transitive ``@ag-ui/client==0.0.52`` now matches what ``@copilotkit/react-core==1.56.3`` actually pulls in (DevRel's environment per their diagnostic bundle). Changes: - ``samples/copilotkit/tests/browser/backend/requirements.txt`` -- pins updated to the latest set above. - ``samples/copilotkit/tests/browser/backend/requirements.lock`` -- NEW, committed pip-freeze of the verified environment. ``pip install -r requirements.lock`` now gives byte-identical transitive deps. - ``samples/copilotkit/README.md`` -- version matrix and install snippets updated to the latest set; upstream-bug callout now lists both issues (``ag-ui-protocol/ag-ui#1582`` runId overwrite, ``ag-ui-protocol/ag-ui#1584`` duplicate RUN_STARTED). - ``samples/copilotkit/agents/evaluator_agent.py`` -- renamed the factory from ``_build_langgraph_agui_agent`` to the public ``build_agui_agent``; added a ``_version_guard_ag_ui_langgraph`` helper that emits a ``RuntimeWarning`` when the installed version is outside the tested range ``[0.0.22, 0.0.34]`` so silent behavior drift does not hide a regression. A backwards-compatible alias keeps the old private name importable for internal tests during the rename window. - ``samples/copilotkit/tests/browser/backend/server.py`` and ``tests/test_samples_e2e.py`` -- call sites updated to the public name. Verified end-to-end against the latest version matrix: - pytest -k copilotkit: 9 passed, 2 skipped (live-only). - Manual HTTP drive against a running backend with the reporter's exact flow (turn 1 initial -> interrupt, turn 2 re-entry on paused graph): both turns emit exactly one RUN_STARTED and one RUN_FINISHED, both with matching client runIds, no RUN_ERROR / INCOMPLETE_STREAM. * fix(copilotkit): wire useLangGraphInterrupt on frontend; drop backend resume heuristic DevRel confirmed the Apr-24 push resolved the turn-1 INCOMPLETE_STREAM (backend now emits a clean RUN_STARTED -> STEP_* -> RUN_FINISHED for the initial interrupt turn). Remaining gap: when the user replies to the interrupt, plain ``<CopilotChat>`` sends the reply as an ordinary new chat message, not as ``forwardedProps.command.resume`` -- so the graph stayed paused and the same error returned on the follow-up. Correct fix is on the frontend, not the backend: ``@copilotkit/react-core@1.56.3`` ships ``useLangGraphInterrupt``, the hook specifically designed for this case. It renders a UI when the graph pauses at ``interrupt()`` and calls ``resolve(...)`` with the user's answer -- which the runtime forwards as the proper ``command.resume`` payload. This is the supported AG-UI protocol path: the frontend must explicitly signal a resume rather than a new turn. Changes: - ``samples/copilotkit/tests/browser/frontend/app/page.tsx``: wires ``useLangGraphInterrupt`` with a dedicated prompt widget (``data-testid`` stable for automation), and a "Start evaluation" test-hook button that uses ``useCopilotChat().appendMessage`` to kick off the graph without having to type into CopilotChat's textarea (which Playwright can't reliably drive on 1.56.3 + React 19). The ``resolve([{role:"user", content}])`` shape matches what ``copilotkit_interrupt`` expects server-side (``answer = response[-1].content``). - ``samples/copilotkit/tests/browser/frontend/app/globals.css``: styles for the interrupt widget and the test-hook start button. - ``samples/copilotkit/agents/evaluator_agent.py``: reverts the backend auto-resume heuristic I had shipped as a stopgap. It was overloading the protocol semantics ("any user message during active interrupt == resume answer") which is incorrect for anything beyond a simple sample -- breaks cancel flows and multi-interrupt scenarios. The backend now only does the two genuine protocol-bug workarounds (runId overwrite, duplicate RUN_STARTED). Resume belongs to the frontend. Test plan: - Python test suite: ``pytest -k copilotkit`` -- 9 passed / 2 skipped (live) on DevRel's exact version matrix. - Backend HTTP round-trip with a programmatic ``command.resume`` payload: both turns emit matched ``RUN_STARTED``/``RUN_FINISHED`` with client runId, no ``RUN_ERROR`` (verified on 2026-04-24). - Browser end-to-end: the hook wiring in page.tsx matches CopilotKit's own showcase pattern and the hook source I inspected. I could not self-verify the full browser round-trip because (a) Playwright cannot reliably drive CopilotChat's textarea on 1.56.3 + React 19 (tracked at CopilotKit/CopilotKit#4215), and (b) my attempted programmatic appendMessage test-hook did not trigger a runtime POST in my local venv for reasons I have not yet pinned down. **DevRel re-test in a real browser is the authoritative check for the frontend round-trip.** Follow-up (per "#2" in the user's plan): rewrite the evaluator HITL to use CopilotKit's current idiom (``useCopilotAction`` / ``useHumanInTheLoop`` -- frontend-defined tool + UI render + resolve) instead of backend ``interrupt()``. That's the pattern CopilotKit's active samples use; it avoids the ag-ui-langgraph interrupt path bugs entirely and is where customers should be pointed for new work. * feat(copilotkit): rearchitect evaluator as create_agent + frontend HITL tool Replaces the custom StateGraph + ``langgraph.types.interrupt()`` pattern with CopilotKit's current HITL idiom: ``langchain.agents.create_agent`` driving an LLM that calls backend tools, with the human-in-the-loop step wired as a **frontend** tool via ``useCopilotAction`` + ``renderAndWaitForResponse``. This matches what CopilotKit's active showcases (``hitl_in_chat_agent.py``, ``interrupt_agent.py``) use. Why the rearchitect: the ``interrupt()`` code path in ``ag-ui-langgraph`` has two protocol-level bugs (tracked upstream as ``ag-ui-protocol/ag-ui#1582`` and ``#1584``) that the previous revision worked around by subclassing ``LangGraphAGUIAgent`` and reaching into private internals. That ships, but it's not the pattern CopilotKit themselves exercise, and the workaround is fragile across upstream bumps. Moving off the ``interrupt()`` path sidesteps both bugs by construction and aligns with CopilotKit's active direction. Design (three-role review): - **AI engineer**: LLM drives. Backend tools (``list_judges``, ``list_recent_traces``, ``run_trace_evaluation``, ``get_evaluation_result``) are thin wrappers over the LayerLens SDK. A tight system prompt guides the flow. ``confirm_judge`` is a frontend tool declared via ``useCopilotAction``; ``CopilotKitMiddleware()`` bridges it into the agent's toolbelt so the LLM can "call" it like any other tool. - **Designer**: HITL renders as a card list -- each judge shows name, id, and evaluation goal, with a ``Select <Name>`` button. Keyboard accessible, visible focus states, compact "Judge selected." state after the user chooses. ``data-testid`` attributes throughout for deterministic automation. - **SDK engineer**: ~160 LoC for the evaluator (down from ~560). No private-API reach. No workaround subclass. No checkpointer needed (``create_agent`` owns state). Lockfile updated for ``langchain-openai``. Frontend pins unchanged. The old ``build_agui_agent`` factory, ``build_graph`` with a custom ``StateGraph``, ``EvaluatorState`` TypedDict, all node functions, the msgpack DTO allowlist, and the version-guard helpers are all gone -- replaced by one ``build_graph(model=...)`` that returns the compiled ``create_agent`` graph. Tests: - ``tests/test_samples_e2e.py`` rewritten. ``test_copilotkit_evaluator_ interrupt_resume`` and ``test_copilotkit_evaluator_agui_wire`` (both specific to the old ``interrupt()`` architecture) replaced by ``test_copilotkit_evaluator_tools``, which exercises each backend tool against a patched Stratix client and verifies the system prompt references ``confirm_judge``. - Import-smoke test mock list extended for ``langchain.agents`` / ``langchain.tools`` / ``langchain_core.tools`` / ``langchain_openai``. - ``pytest -k copilotkit``: 8 passed, 2 skipped (live). Frontend: - ``page.tsx``: ``useCopilotAction("confirm_judge", ...)`` with a rich judge-card list; ``useLangGraphInterrupt`` removed. - ``globals.css``: styles for ``judge-picker`` / ``judge-card`` / complete / empty states. - ``Evaluate my traces`` quick-action button retained for direct user triggering and automation. Backend server: - ``samples/copilotkit/tests/browser/backend/server.py`` swaps ``build_agui_agent(...)`` for plain ``LangGraphAGUIAgent(...)`` -- no workaround needed on this code path. README: - Full rewrite around the new architecture. Version matrix unchanged. The two upstream ``ag-ui-langgraph`` bugs are preserved in the "informational" section for customers building their own ``interrupt()``-based graphs. Per user direction: no backwards compatibility for the old sample (no customer has it). The workaround subclass is removed, not deprecated. * test(copilotkit): add live-LLM e2e + OpenAI-compatible endpoint support The previous commit's tests verified the new architecture against mocks; this one verifies it against a real LLM through the actual AG-UI FastAPI endpoint. New test ``test_copilotkit_evaluator_live_llm``: - Loads credentials from a gitignored ``.env`` (or real env vars in CI), with OpenRouter convenience: if only ``OPENROUTER_API_KEY`` is set, the loader auto-points ``OPENAI_BASE_URL`` at OpenRouter. - Builds a FastAPI app with the patched Stratix client + the real evaluator graph (real LLM, no fake model). - POSTs an AG-UI ``RunAgentInput`` whose ``tools`` array declares the ``confirm_judge`` frontend tool, exactly as the browser would. - Asserts: tool sequence is ``list_recent_traces`` -> ``list_judges`` -> ``confirm_judge``; agent halts at ``confirm_judge`` (never calls ``run_trace_evaluation``); single ``RUN_STARTED`` + ``RUN_FINISHED`` with matching client ``runId``; no ``RUN_ERROR``. - Marked ``@pytest.mark.live`` and ``pytest.skip``s when no key is available, so the default ``pytest`` run is unaffected. Verified locally: passes against ``openrouter:openai/gpt-4o-mini``. Other changes in this commit: - ``evaluator_agent.py``: - ``_default_model()`` honours ``OPENAI_API_KEY``, ``OPENAI_BASE_URL``, and ``OPENAI_MODEL`` so any OpenAI-compatible endpoint works (OpenAI, Ollama, LM Studio, OpenRouter, vLLM, ...). For non-compatible providers, customers pass any LangChain ``BaseChatModel`` to ``build_graph(model=...)``. - ``create_agent`` now compiles with ``InMemorySaver``. ``ag-ui- langgraph``'s ``add_langgraph_fastapi_endpoint`` calls ``graph.aget_state(config)`` on every request, which fails with ``ValueError("No checkpointer set")`` if the graph wasn't compiled with one -- regardless of whether ``interrupt()`` is used. - ``build_agui_agent`` reintroduced as a *minimal* runId-only workaround for ``ag-ui-protocol/ag-ui#1582``. Bug #1584 (duplicate RUN_STARTED) is unreachable on this code path because the evaluator never calls ``langgraph.types.interrupt()``, so we only need the runId fix. Live test confirms the workaround restores runId continuity end-to-end. - ``samples/copilotkit/tests/browser/backend/server.py``: switched back to ``build_agui_agent(...)`` so the runId workaround is active in the harness backend. The earlier "no workaround needed" claim was wrong; @ag-ui/client@0.0.52 doesn't enforce runId continuity but older clients did and future strict ones likely will. - ``tests/.env.example``: documents the supported env vars (OPENAI, OpenRouter convenience, LayerLens). Real ``tests/.env`` is gitignored. - ``samples/copilotkit/README.md``: documents the live-test setup and links the .env.example. Also documents the ``OPENAI_API_KEY``/``OPENAI_BASE_URL``/``OPENAI_MODEL`` env-var triplet for OpenAI-compatible providers (Ollama, LM Studio, OpenRouter). * fix(copilotkit harness): unblock React hydration on Next 16 dev server DevRel hit a "page renders but every button is dead, textarea won't accept input" failure mode while running the harness locally. Diagnosis took several iterations because there was no client-side error: - Backend was healthy; ``/healthz`` returned 200. - ``/api/copilotkit`` was up; an ``info`` JSON-RPC probe listed the evaluator agent. - Direct POSTs to the backend at :8123 streamed real LLM events. - The page HTML had every expected ``data-testid``. - Browser console showed only one repeating warning: ``WebSocket connection to 'ws://127.0.0.1:3000/_next/webpack-hmr' failed: Error during WebSocket handshake`` Root cause: Next 16 enforces a cross-origin allowlist for dev resources (including the webpack-hmr WebSocket). When the user serves on ``127.0.0.1`` but the allowlist is implicit ``localhost``, HMR fails to connect and Next leaves React in a half-hydrated state. The page renders from the server but client React never wires up event handlers or controlled-input state -- so buttons and textareas are visually present but inert. No error is surfaced beyond the WebSocket warning. Fix: - Add ``allowedDevOrigins: ["127.0.0.1", "localhost"]`` to ``samples/copilotkit/tests/browser/frontend/next.config.js``. Both origins are the supported way to load the harness; without this, whichever the user picks tends to break. Also, to make this kind of failure self-diagnosing rather than requiring DevTools-paste skills: - New ``samples/copilotkit/tests/browser/frontend/public/diag.html`` -- a static page (no React) that runs three probes on load and renders results inline: runtime ``info`` reachability, an ``agent/run`` round-trip through ``/api/copilotkit``, and a direct ``/healthz`` ping against the backend. Visit ``http://127.0.0.1:3000/diag.html`` to see green/red labels for each. This bypasses the React app entirely, so it stays useful even when hydration is broken. - New "Run diagnostic" button on the harness page (next to "Evaluate my traces") that runs the same probes plus a couple of React-only checks (textarea state, isLoading, intercepted ``appendMessage`` POST body) and renders the report directly on the page. Useful for users who can't (or don't want to) paste JS into DevTools console. Verified locally: after the cache + allowedDevOrigins fix, both buttons fire, ``appendMessage`` POSTs to ``/api/copilotkit`` and gets back a real ``RUN_STARTED`` SSE stream end-to-end. * fix(harness picker): resilient render while LLM streams tool args CopilotKit's ``renderAndWaitForResponse`` re-renders the action UI progressively as the LLM streams the tool-call JSON, so for the first render tick or two ``judge.id`` (and sometimes ``judge.name``) can be undefined even though the surrounding React state is stable. That tripped two issues in our judge picker: 1. ``key={judge.id}`` warned "Each child in a list should have a unique key prop" when id was undefined. 2. The Select button was clickable with an undefined id, which would ``respond({ id: undefined, name: undefined })`` and break the resume. Fix: - Fall back to ``pending-{index}`` for the React key while id is pending. Quiet warning + stable row identity. - Mark each row "ready" only when both id and name are present and ``respond`` is non-null. Disable the Select button and show "Loading..." until ready. The button text and ``data-testid`` follow the ready state so automated tests don't grab a half-loaded row by accident. - Hide the dim id-pill (``judge-card-id``) while id is pending so the card doesn't flash an empty grey box. * feat(harness): per-tool render cards + Tailwind 4 + production EvaluationCard DevRel asked: "where is the tool indicator I should see?" CopilotChat only renders user/assistant text and frontend HITL widgets by default; backend tool calls fire invisibly. Surface them with the ``useCopilotAction`` + ``available: "remote"`` + ``render`` pattern -- the same pattern CopilotKit's ``tool_rendering_agent.py`` showcase uses. Changes: - All four backend tools (``list_recent_traces``, ``list_judges``, ``run_trace_evaluation``, ``get_evaluation_result``) now render inline cards with a pulsing-dot "Running" status pill, transitioning to a green "Done" pill when the tool resolves. Each card has a stable ``data-testid`` for automated tests. - ``get_evaluation_result`` (the final result) renders the polished ``EvaluationCard`` from ``samples/copilotkit/components/`` -- the production-grade SDK card with the score donut and pass-rate ring. Imported via a tsconfig path alias (``@layerlens/copilotkit-cards``) so the harness can reuse the upstream SDK components without copying or duplicating them. - ``confirm_judge`` HITL picker restyled with matching Tailwind tokens to keep the visual language consistent across all tool cards. - Tailwind 4 added (``@tailwindcss/postcss``, ``tailwindcss``) + ``postcss.config.mjs`` + ``@import "tailwindcss"`` in ``globals.css``. Inline custom CSS removed in favour of Tailwind utilities, matching CopilotKit's own showcase samples. - ``html className="dark"`` + ``color-scheme: dark`` so the SDK reference cards (which key off the ``.dark`` ancestor) render in dark mode by default. - ``<CopilotKit showDevConsole={false}>`` -- DevRel reported the default web-inspector "kite" obscured the harness header; suppressed for the sample. - ``tsconfig.json`` includes ``../../../components/**/*`` so Next's bundler picks up the SDK card sources, and adds the ``@layerlens/copilotkit-cards`` path alias. The pattern (frontend ``useCopilotAction`` for backend tools with ``available: "remote"``) is what customers should copy. The harness demonstrates it in two flavours: lightweight inline cards (for the first three tools) and full SDK-component composition (for the result). Both styles are valid; teams pick based on visual weight they want. * copilotkit sample: ship customer-facing app with shadcn-aligned cards Reshaped the CopilotKit sample so it reads as a commercial-grade SDK demo rather than a test fixture, and brought the visual language into line with CopilotKit's own samples (research-canvas, travel, banking, with-shadcn-ui). Structure - Move sample out of `samples/copilotkit/tests/browser/{backend,frontend}` to `samples/copilotkit/app/{backend,frontend}` so customers see "the app" rather than "a test harness". Update README + path constants. - Add `app/frontend/.gitignore` for `.next/`, `node_modules/`, and Playwright artefacts. Backend (`app/backend/server.py`, `agents/evaluator_agent.py`) - Real LayerLens only: missing `LAYERLENS_STRATIX_API_KEY` is a hard startup error. No fake-fixture path, no `MagicMock`, no env-var flag — fixtures only ever existed for an earlier Playwright fixture and conflicted with the SDK posture in CLAUDE.md. - Agent built with `create_agent` + `CopilotKitMiddleware`, real `@tool` impls returning `Command(update={...})` so each tool emits state into `state.{traces,judges,evaluations,results}`. Async tools call `copilotkit_emit_state` so the canvas updates live during a run. - New `GET /evaluations/{id}` endpoint for out-of-band polling: the agent kicks off evaluations, ends in seconds, and the frontend folds completed verdicts into the canvas as each evaluation resolves on LayerLens. Fixes the 30s-evaluation-vs-LLM-polling-loop hallucination. - `LangGraphAGUIAgent` constructor gets `config={"recursion_limit": 200}` so a 5-trace fan-out doesn't trip the default 25-hop limit (tested via `with_config` first; that path is dropped by ag-ui's internal config merge). - System prompt rewritten: strict tool order; `confirm_judge` takes no args (frontend reads candidates from `state.judges` to avoid the `tool_argument_parse_failed: Unterminated string in JSON` we hit when streaming 38 judges through tool args); evaluations capped at 5 traces; pending != failed; final summary template branches on whether anything completed. SDK card library (`samples/copilotkit/components/`) - Rewritten on top of shadcn/ui primitives. Cards now compose `Card`, `CardHeader`, `CardContent`, `CardFooter`, `Badge`, `Button`, `Separator`, `Progress` from `@/components/ui/*`. Status pills use the `bg-{color}-50 text-{color}-600 dark:bg-{color}-900/20` pattern CopilotKit's banking sample uses, not custom ring/shadow chrome. - Stock shadcn neutral OKLCH palette (`baseColor: neutral`). Brand accent `#6766FC` applied via Tailwind class strings on CTAs/links — same approach research-canvas takes for its accent. No edits to `--primary` / shadcn theme variables. - Score bars solid (`bg-green-500` / `bg-red-500` / `bg-amber-500`) not gradients. Sparklines color-coded by pass-rate threshold. - `dashboardBaseUrl` is now strictly opt-in across `TraceCard` and `EvaluationCard`: the "Trace Explorer →" / "Agent Graph →" / "View in Dashboard →" footers only render when a real URL is configured via `NEXT_PUBLIC_LAYERLENS_DASHBOARD_URL`. Stops 404s on routes that aren't deployed yet. Frontend (`app/frontend/`) - shadcn primitives installed via `npx shadcn@latest add card button badge progress separator`. Deps: `radix-ui`, `class-variance- authority`, `clsx`, `tailwind-merge`, `tw-animate-css`. Tailwind 4 + React 19. `components.json` aliases `ui` to the SDK card library. - New `globals.css` with shadcn neutral tokens (`--background`, `--card`, `--muted-foreground`, etc.), `@theme inline` mapping for Tailwind 4, and a `--copilot-kit-*` bridge so `<CopilotChat>` reads the same neutral tokens as the canvas. Brand accent set on `--copilot-kit-secondary-color`. Drops the previous "force dark" CSS. - Layout split-pane, **light by default** to match every official CopilotKit sample. New `theme-toggle.tsx` segmented control (Light / System / Dark) persists to `localStorage` and reacts to OS-level theme changes when set to System. - `useCoAgent({ name: "evaluator" })` reads live agent state. New out-of-band poller (`useEffect` against `/evaluations/{id}` every 5 s) folds verdicts that arrive after the agent run ends into the canvas. `state.results` (agent) and `polledResults` (frontend) are merged via `useMemo` so MetricStrip / EvaluationCard / JudgeVerdict- Card all see one consistent results array. - Picker: `JudgePicker` is its own component subscribed to `useCoAgent` so it re-renders when `state.judges` populates after the LLM streams out the tool call. `confirm_judge` uses `available: "remote"` + `renderAndWaitForResponse` per the canonical research-canvas HITL pattern. Cleanup - Strip every dev artefact: agent's `[tool] X INVOKED` prints, the page's debug-state `<pre>`, the `console.log("[evaluator state]"…)` effect, the "Run diagnostic" button + panel + state, and the `probe_e2e.py` SSE diagnostic script. Header is now just the title, theme toggle, and the primary CTA. * copilotkit sample: chat tokens, severity pills, trace status, markdown reasoning Polish pass after first review: - Chat token bridge fixed. Re-read CopilotKit's ``react-ui/colors.css`` semantics: ``primary-color`` is the user-bubble + interactive accent, ``secondary-color`` is the assistant message background, not a brand slot. Earlier mapping made the assistant greeting render as solid indigo and clip out of view in light mode. Now mapped onto shadcn tokens semantically: ``primary → --primary``, ``contrast → --primary- foreground``, ``secondary → --card``, ``secondary-contrast → --card-foreground``. Brand accent ``#6766FC`` stays only on actual CTA buttons via Tailwind class strings. - ``JudgePicker`` "selected" pill now uses light + dark variants (``bg-green-50 text-green-700 dark:bg-green-900/20 dark:text-green-300``) instead of dark-mode-only emerald that disappeared on a light page. - ``JudgeVerdictCard`` redesign: * Pass / Fail / Error are now solid-filled badges (``bg-green-600``, ``bg-red-600``, ``bg-amber-600`` with white text), readable at a glance instead of subtle ghost pills. * Severity rendered as a colored pill with a triangle alert glyph, not a dot. Severity is a status (impact-of-failure level), not a trend, so an "alert" shape is correct; chevrons would imply direction. Hide the severity chip when verdict=pass AND severity=low — nothing meaningful to flag. * Reasoning rendered through a tiny inline ``MarkdownLite`` that handles paragraph breaks, line breaks, ``**bold**``, and ``*italic*`` — the cases LayerLens API actually emits. No ``react-markdown`` dep (the SDK card library lives outside the Next app's node_modules so it can't resolve packages there); no raw HTML injection. Fixes the wall-of-text rendering of judge reasoning. - Tailwind 4 ``@source`` directive added to ``globals.css`` so it scans ``samples/copilotkit/components/**/*.{ts,tsx}``. Without this, classes used inside the SDK card library (``bg-amber-500``, ``bg-green-600``, etc.) get tree-shaken out of the generated CSS and pills silently flatten to plain text. - ``TraceCardProps.status`` made optional. The LayerLens ``traces.get_many`` API doesn't expose per-trace lifecycle today, so the sample no longer hardcodes ``status="ok"`` — that was rendering a misleading green pill on every trace regardless of reality. The status pill is hidden when the prop is omitted; restore it once the API surfaces real status. * copilotkit sample: render pending verdicts as skeleton cards When the agent kicks off N evaluations and only K complete on the first poll, the remaining (N - K) used to disappear from the ``Verdicts`` grid even though the run-summary card still counted them — verdict count would say "5", grid would show 4, and the trailing pending one looked like it had been lost. Add ``PendingVerdictCard``: same shadcn ``Card`` chrome as ``JudgeVerdictCard``, with a "Running" pill, a pulsing skeleton bar for the score, and copy explaining real LayerLens evaluations can take a minute or two. Render one per evaluation that doesn't have a matching entry in ``state.results`` yet. Side effects: - ``Verdicts`` section count now reflects total evaluations (not just completed) so the grid count matches what's actually rendered. - Section now renders even when ``results.length === 0`` as long as there are evaluations in flight (previously fell through to a textual placeholder). - Run summary picks the judge name from the first pending evaluation if no result has come back yet. The polling loop is unchanged — it keeps polling ``/evaluations/{id}`` every 5 s and replaces a pending card with the real ``JudgeVerdictCard`` the moment LayerLens returns a verdict. * copilotkit sample: extract MarkdownLite, render judge goals The judge ``evaluation_goal`` field LayerLens returns is markdown- formatted (paragraph breaks, ``**bold**`` headers, numbered lists). Both the in-chat picker and the canvas's "Available judges" card were rendering it through plain ``<p>{text}</p>`` so each judge collapsed into one indented wall of text — same problem the verdict card's reasoning had before. Pull the inline markdown renderer that previously lived inside ``JudgeVerdictCard.tsx`` into its own ``markdown-lite.tsx`` module, re-export it from the SDK card library's ``index.ts``, and use it in: - JudgeVerdictCard reasoning (already) - JudgePicker goal description (chat-side) - JudgesCard goal description (canvas-side) Output is the same as before for the verdict card; the picker and the canvas judges card now show structured goal text. Still no ``react-markdown`` dependency — the SDK card library has to stay resolvable without the Next.js app's node_modules in scope, so we keep the small built-in renderer instead. * copilotkit sample: rewrite README to match the shipped architecture The README still described the previous incarnation of the sample — the create_agent + frontend HITL design from before the canvas / out-of-band-polling rewrite. Rewrite top-to-bottom to reflect what actually ships: - New layout section showing ``samples/copilotkit/{agents,components,app}`` with the SDK card library and the customer-facing app side-by-side. - Architecture diagram updated for the canvas + chat split-pane, ``useCoAgent`` driving state-driven cards, and the ``GET /evaluations/{id}`` polling endpoint that the frontend hits every 5s for in-flight verdicts. - Step-by-step "How the demo flows" walkthrough so a customer can read the README and predict what each click will do. - "Why this pattern" updated to highlight the canvas + frontend polling + ``copilotkit_emit_state`` triad. Old text framed the choice as ``create_agent`` vs ``interrupt()``; new text frames it as the research-canvas pattern. - Tools section updated for the async + ``Command(update={...})`` return shape and the no-arg ``confirm_judge`` (frontend reads candidates from ``state.judges``). - Frontend section adds: shadcn/ui foundation, ``components.json``, light-default theme + ``ThemeToggle``, ``--copilot-kit-*`` token bridge, brand accent ``#6766FC``, the SDK card matrix (5 cards + ``MarkdownLite``). - Backend section adds: ``recursion_limit: 200`` config, the ``GET /evaluations/{id}`` polling handler, and the "no fake fixture" guardrail. Drive-by: ``ruff format`` brought ``evaluator_agent.py`` and ``server.py`` in line with the project's ruff style. (The repo's ``[tool.ruff]`` ``exclude = ["samples"]`` would skip these on discovery, but reformatting locally keeps them tidy and avoids contributors re-doing it.) * ci: replace walrus + reformat test_samples_e2e for ruff py37 target Fixes both red CI checks on PR #92: - ``Check Lint`` was failing because tests/test_samples_e2e.py used the walrus operator (``:=``) at line 1446 and ruff's ``[tool.ruff].target-version`` is pinned to ``py37``. Replace with a regular assignment + boolean check — same semantics, py37 compatible. The package's runtime support (``Python >=3.10,<3.13``) doesn't dictate ruff's syntax target; bumping the ruff target is out of scope for this PR. - ``Check Format`` was failing because the same file had pre-existing multi-line wrapping that ruff's auto-format collapses to single lines under the 120-char limit. Apply ``ruff format``. - ``ruff check --fix`` also normalised one import block (I001). CI's ``test (3.9..3.12)`` jobs cancelled out after the lint pre-step failed — they should now actually run. * copilotkit sample: drop OpenRouter-specific references Per existing repo policy: the SDK sample and tests should not name a specific OpenAI-compatible provider. Configuring OpenRouter (or any other gateway) is the user's job in their own .env — the docs and test code stay vendor-neutral. Removes: - OpenRouter row from ``_default_model``'s docstring table. - OpenRouter mention in ``build_graph``'s docstring. - ``OpenRouter, vLLM`` aside in the CLI ``main()`` print block. - OpenRouter URL in ``samples/copilotkit/README.md`` env-var example. Replaced with a placeholder ``your-openai-compatible-host``. - ``OPENROUTER_API_KEY`` auto-mapping in ``test_copilotkit_evaluator _live_llm`` (the test now expects ``OPENAI_API_KEY`` and lets the user set ``OPENAI_BASE_URL`` / ``OPENAI_MODEL`` themselves if pointing at a non-OpenAI endpoint). - Skip-message reference to ``OPENROUTER_API_KEY``. The sample still works against any OpenAI-compatible endpoint — the generic env vars (``OPENAI_API_KEY`` / ``OPENAI_BASE_URL`` / ``OPENAI_MODEL``) carry the configuration. The user's own gitignored ``.env`` is where provider-specific URLs (OpenRouter, Ollama, LM Studio, …) live. * ci: wrap server.py uvicorn boot in main() + extend test mocks Three test failures from the previous CI run, all addressed here: 1. ``tests/test_samples.py::test_sample_has_main[copilotkit/app/backend /server.py]`` expects every sample's entry-point file to expose a ``main()`` function. ``server.py`` had a bare ``if __name__ == "__main__":`` block instead. Lift the uvicorn.run call into a ``main()`` and call it from the ``if __name__`` guard. 2. ``test_copilotkit_agent_import[evaluator_agent]`` and 3. ``test_copilotkit_without_langchain[evaluator_agent]`` both stub the heavy deps via ``patch.dict("sys.modules", ...)`` so the agent module imports cleanly without langchain / copilotkit installed. The mock dict was missing the new submodules the agent now imports (``langgraph.prebuilt``, ``langchain.agents.middleware``, ``langchain_core.runnables``, ``langchain_core.tools.base``). Add them to both mock dicts. Locally ``ruff check`` and ``ruff format --check`` are clean on all touched files. * copilotkit sample: stop polling on any terminal eval state, not just success Bug repro: same evaluation reliably stayed "Running" across multiple demo runs. Root cause was the polling filter on the frontend: const completed = updates.filter( (u) => u.status === "success" && typeof u.score === "number", ); This rejected any LayerLens response that wasn't a clean success with a numeric score — including ``status: "failure"``, ``status: "error"``, ``status: "cancelled"``, and the ``status: "success"`` case where ``trace_evaluations.get_results`` returned ``score: null`` (which some judges legitimately do). The poller would then keep firing every 5s forever and the verdict card would sit in "Running" indefinitely. Two-sided fix: Backend (``GET /evaluations/{id}``): - New ``done: bool`` field — true for any of ``success | failure | error | cancelled | not_found``, false while the evaluation is still ``in_progress`` / ``pending`` / ``queued``. - Always include ``passed`` / ``score`` / ``reasoning`` once ``done: true``, even for terminal failures and ``success``-without- score: defaults are ``passed: false``, ``score: 0.0``, and a ``reasoning`` string explaining the terminal state. - ``try/except`` around ``trace_evaluations.get`` so a malformed / unauthorized id surfaces as ``status: "error", done: true`` instead of a 500 that the frontend retries forever. Frontend (``page.tsx``): - Polling filter is now ``u.done === true`` instead of ``status === "success" && typeof score === "number"``. - ``ResultRecord`` type gains an optional ``done?: boolean`` field (the agent's own ``state.results`` entries don't carry it; only the ``/evaluations/{id}`` polling responses do). Verified against a real eval id (clean success path → ``done: true``, score returned) and a deadbeef id (error path → ``done: true``, ``status: "error"``, no 500). The 5th-eval-stuck symptom is from the non-success terminal cases — frontend now folds them into the canvas as a verdict card with the appropriate fail/error styling instead of spinning forever. * Fix .gitignore --------- Co-authored-by: m-peko <marinpeko5@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.