Release by m-peko · Pull Request #140 · LayerLens/stratix-python

m-peko · 2026-04-27T10:38:05Z

No description provided.

…rupt() works over AG-UI (#92) * fix(copilotkit): compile evaluator_graph with a checkpointer for interrupt() The evaluator agent calls ``interrupt()`` in ``confirm_judge_node`` for human-in-the-loop judge confirmation. A checkpointer is mandatory for that to work -- without one, a ``Command(resume=...)`` call produces zero events, ``ag-ui-langgraph`` never emits ``RUN_FINISHED``, and the CopilotKit frontend blocks all subsequent messages with "Cannot send 'RUN_STARTED' while a run is still active". Changes: - ``evaluator_agent.py``: compile with ``InMemorySaver`` and a commented Postgres swap block. Convert ``JudgeInfo`` / ``TraceInfo`` / ``EvaluationInfo`` from ``@dataclass`` to ``pydantic.BaseModel`` so LangGraph's default ``JsonPlusSerializer`` can persist state across the pause boundary (dataclasses raise ``TypeError: Type is not msgpack serializable``). - ``samples/copilotkit/README.md``: add full FastAPI backend wiring with ``add_langgraph_fastapi_endpoint``, Next.js frontend wiring with ``LangGraphHttpAgent``, a checkpointer options matrix (InMemory / SQLite / Postgres / Redis / LangGraph Platform) with per-option migration snippets, a version-compatibility table pinning the versions the bug reporter used, and a troubleshooting section mapping the observed frontend errors back to the backend cause. - ``docs/samples-guide.md``: cross-reference the checkpointer requirement. - ``tests/test_samples_e2e.py``: add ``test_copilotkit_evaluator_interrupt_resume`` that imports the real ``langgraph`` (not ``MagicMock``), asserts the compiled graph has a non-None checkpointer, and drives a full ``astream -> interrupt -> Command(resume=...) -> astream`` cycle with a patched Stratix client. Confirmed this test fails on the pre-fix code and passes on the fix. Also extended the existing mock-modules dicts so the import-smoke tests include ``langgraph.checkpoint.memory``. The existing tests missed this because they mock ``langgraph``, ``langgraph.graph``, and ``langgraph.types`` with ``MagicMock()`` and then only call ``main()`` (which prints usage). They never build or execute the graph, so they cannot observe the missing checkpointer. * fix(copilotkit): register DTOs with serde, add AG-UI wire e2e test Follow-ups to the interrupt/checkpointer fix, addressing the open items flagged in the prior commit: 1. Deserialize warning resolved. The Pydantic DTOs (``JudgeInfo`` / ``TraceInfo`` / ``EvaluationInfo``) are now registered on ``JsonPlusSerializer(allowed_msgpack_modules=...)`` via a custom serde passed to ``InMemorySaver``. Verified the sample passes with ``LANGGRAPH_STRICT_MSGPACK=true``, so it survives LangGraph's planned tightening of checkpoint deserialization. 2. End-to-end AG-UI wire validation. New integration test ``test_copilotkit_evaluator_agui_wire`` wires the evaluator graph into a FastAPI app through ``ag_ui_langgraph.add_langgraph_fastapi_endpoint``, drives the full user flow in-process via ``httpx.ASGITransport``, and asserts: - Phase 1 (initial run, hits ``interrupt()``): emits RUN_STARTED and RUN_FINISHED on the SSE stream. - Phase 2 (resume with user confirmation, same ``threadId``): emits RUN_STARTED and RUN_FINISHED. - Phase 3 (follow-up message after resume): not blocked -- RUN_STARTED and RUN_FINISHED fire again. This is the exact symptom the reporter hit, tested through the exact protocol path. Gated on ``pytest.importorskip`` for the heavy deps so the test skips cleanly when they are absent. Side benefit: running the same scenario against pre-fix code produces ``ValueError: No checkpointer set`` directly from ``graph.aget_state()``, giving operators a much louder error than the silent "stream ends without RUN_FINISHED" path. 3. README backend-wiring snippet corrected. The actual ``add_langgraph_fastapi_endpoint`` signature takes an ``agent=LangGraphAgent(...)`` wrapper, not a bare ``graph=`` kwarg -- the example in the previous commit would have failed at import. Also expanded the SSE-protocol explanation to match what the new e2e test observes on the wire. 4. Investigator graph annotated. ``investigator_graph`` does not call ``interrupt()`` so it does not need a checkpointer, but without an explicit note future contributors adding a HITL step would silently regress. Added a short comment at the ``.compile()`` call pointing at the evaluator pattern. * fix(copilotkit): resolve error/field collision + browser e2e harness Follow-ups addressing the remaining open items from the previous two commits: 1. Rename ``error`` node to ``handle_error`` (evaluator + investigator). The old name collided with the ``error`` field on the state dataclass. LangGraph 1.x accepts the collision; earlier versions reject it with "'error' is already being used as a state key". Renaming the node (and the conditional-edge routing targets) keeps the routing token ``"error"`` purely an edge key and sidesteps the conflict on any LangGraph version the sample may be copied into. 2. Guard the ``allowed_msgpack_modules`` kwarg behind try/except so the sample still imports cleanly on langgraph<1.0 (where the kwarg does not exist and the strict-msgpack warning is not emitted either). Verified the sample now imports on both langgraph 0.2.56 and 1.1.9. 3. Ruff-clean the changed files (import sort I001 fixes on the new test additions; unrelated warnings in pre-existing ``main()`` / ``error_node`` code are out of scope per "only change what was asked"). 4. New ``samples/copilotkit/tests/browser/`` harness: - ``backend/server.py`` -- FastAPI app that patches ``layerlens.Stratix`` before importing the evaluator module and mounts ``evaluator_graph`` via ``add_langgraph_fastapi_endpoint(..., path="/evaluator")``. - ``frontend/`` -- Next.js 16.2.4 app pinned to the reporter's exact CopilotKit versions (``@copilotkit/react-core``, ``@copilotkit/react-ui``, ``@copilotkit/runtime`` all at 1.56.3), with the CopilotKit runtime wired to ``LangGraphHttpAgent`` against the FastAPI backend. - ``frontend/tests/interrupt-resume.spec.ts`` -- Playwright spec that drives CopilotChat through the three-turn scenario the reporter hit ("evaluate" -> "ok" -> "thanks") and asserts the exact string "Cannot send 'RUN_STARTED' while a run is still active" appears in neither the visible DOM nor the browser console. Known limitation documented in the harness README: CopilotChat 1.56's textarea reports as aria-hidden / non-"visible" under Playwright strict actionability checks in **headless** Chromium, and multiple input-driving patterns (``fill``, ``keyboard.type + Enter``, ``pressSequentially``, DOM-setter + bubbled input event) failed to reliably enable the Send button headlessly. The harness works with ``--headed`` for human verification and is structurally complete. The authoritative regression coverage for the fix is the Python test suite (``test_copilotkit_evaluator_interrupt_resume`` and ``test_copilotkit_evaluator_agui_wire``); the browser harness is corroborating / demo value, not gate-keeping. * fix(copilotkit): mirror CopilotKit's sample + workaround for ag-ui-langgraph runId bug DevRel surfaced that the backend fix in the previous commits got the Python side working but the frontend still locked up with "Cannot send 'RUN_STARTED' while a run is still active. ... INCOMPLETE_STREAM" on the second message. Raw SSE capture confirmed the root cause: RUN_STARTED runId = "r1_aca59ad1" (client-supplied) RUN_FINISHED runId = "019dc049-14ba-..." (LangGraph's internal chain UUID) This is an upstream bug in ag-ui-langgraph (ag-ui-protocol/ag-ui#1582): ``_handle_stream_events`` overwrites ``self.active_run['id']`` with every LangGraph event's internal ``run_id``, so RUN_FINISHED emits LangGraph's UUID instead of the client-supplied ``input.run_id``. ``@copilotkit/runtime`` tracks active runs by client runId and raises RUN_ERROR/INCOMPLETE_STREAM. Verified the bug is present in both ag-ui-langgraph 0.0.22 (CopilotKit's officially-pinned version) and 0.0.34 (the reporter's version), and also in ``copilotkit.LangGraphAGUIAgent`` which inherits the broken method. Changes in this commit, all aligned with CopilotKit's own ``examples/integrations/langgraph-fastapi`` reference sample: 1. ``evaluator_agent.py``: - State class converted from ``@dataclass`` to a ``TypedDict`` inheriting from ``copilotkit.CopilotKitState``. This gives us ``MessagesState``'s ``add_messages`` reducer for free (nodes return NEW messages; they are appended, not replaced) and the ``copilotkit`` field the frontend injects. - All node functions updated from ``state.X`` / ``state.messages + [m]`` to ``state.get('X')`` / ``{'messages': [m]}``. - HITL interrupt now uses ``copilotkit.langgraph.copilotkit_interrupt`` (wraps ``interrupt()`` with ``__copilotkit_messages__`` so the prompt renders as a real AIMessage in the chat UI). The bare ``langgraph.types.interrupt(prompt)`` emitted a CUSTOM event the UI ignored -- why the reporter said "the agent stops and never reaches the human-in-the-loop confirmation step." - New ``RunIdPreservingAgent`` subclass (lazy factory ``_build_langgraph_agui_agent``) overrides ``_dispatch_event`` to restore ``input.run_id`` on RUN_FINISHED / RUN_ERROR terminal events. Clearly commented with a "remove when upstream ships" TODO pointing at the ag-ui-protocol/ag-ui issue. 2. ``samples/copilotkit/README.md``: - Version matrix re-pinned to CopilotKit's exact tested set (``copilotkit==0.1.74``, ``langchain==1.0.1``, ``langgraph==1.0.1``, ``ag-ui-langgraph==0.0.22``, ``@copilotkit/*==1.56.3``, Python ``>=3.10,<3.13``). - Upstream-bug callout explaining the runId workaround. - Backend wiring snippet updated to show the factory import and the ``LangGraphAGUIAgent`` path for non-interrupt graphs (investigator). 3. ``tests/test_samples_e2e.py``: - ``test_copilotkit_evaluator_interrupt_resume`` now sends ``Command(resume=[HumanMessage(content='ok')])`` rather than ``Command(resume='ok')``, matching ``copilotkit_interrupt``'s expected resume payload shape. - ``test_copilotkit_evaluator_agui_wire`` rewritten. The previous version had a blind spot: it only asserted RUN_FINISHED was PRESENT, not that ``RUN_STARTED.runId == RUN_FINISHED.runId == input.run_id``. Now it uses the ``RunIdPreservingAgent`` factory and asserts runId continuity end-to-end. Without the workaround this test would catch the upstream bug immediately. - Mock-module dict extended with ``copilotkit.langgraph`` for the import-smoke test. 4. ``samples/copilotkit/tests/browser/``: - ``backend/requirements.txt`` re-pinned to CopilotKit's set. - ``backend/server.py`` switched from raw ``ag_ui_langgraph.LangGraphAgent`` to the sample's factory so the browser harness also benefits from the runId workaround. All 9 copilotkit tests pass in the pinned venv. Empirical verification scripts in /tmp/ (not committed) show raw SSE with matching runIds end-to-end. * fix(copilotkit): also suppress duplicate RUN_STARTED on re-entry interrupt path DevRel's diagnostic bundle (ag-ui-langgraph==0.0.34, copilotkit==0.1.87, @ag-ui/client==0.0.52 transitively) confirmed commit 542002b did not fix the browser symptom. Raw SSE from the Network tab showed: RUN_STARTED runId=d0b9d6c5-... [graph reaches step="confirm_judge" -- interrupt IS being hit] RUN_ERROR {code: "INCOMPLETE_STREAM", message: "Cannot send 'RUN_STARTED' while a run is still active. The previous run must be finished with 'RUN_FINISHED' before starting a new run."} Same error text as before, different root cause. A second bug in ag-ui-langgraph: when a request arrives on a thread whose graph is already paused at ``interrupt()`` and the request does NOT carry ``forwardedProps.command.resume``, the ``has_active_interrupts`` branch of ``prepare_stream`` (agent.py:491) emits a second ``RunStartedEvent`` to ``events_to_dispatch`` -- after ``_handle_stream_events`` (line 209) already emitted one at the top of the stream. The server's own AG-UI encoder validator catches the duplicate and converts it into a ``RUN_ERROR`` with the exact "Cannot send 'RUN_STARTED'..." message, terminating the stream before ``RUN_FINISHED`` can be dispatched. On ``@ag-ui/client@0.0.52`` (the newer protocol-state validator, which enforces within-stream start/finish invariants rather than the runId correlation the previous version used) this is what lands as INCOMPLETE_STREAM in the browser. Extended the sample's workaround subclass to filter at the agent boundary rather than override ``_dispatch_event`` (which expects to return an Event, not None/""). The filter: 1. Drops any RUN_STARTED after the first within a single stream -- fixes the duplicate-emission bug on the ``has_active_interrupts`` path. 2. Restamps ``input.run_id`` on RUN_FINISHED / RUN_ERROR -- preserves the existing ag-ui-protocol/ag-ui#1582 fix for older clients that correlate by runId. Verified on both pin matrices: - copilotkit==0.1.74 / ag-ui-langgraph==0.0.22 (CopilotKit's own reference sample pins): all tests pass. - copilotkit==0.1.87 / ag-ui-langgraph==0.0.34 (DevRel / reporter): all tests pass. Tightened ``test_copilotkit_evaluator_agui_wire`` accordingly: - asserts exactly one RUN_STARTED per stream (catches bug b) - asserts no RUN_ERROR - asserts RUN_STARTED.runId == RUN_FINISHED.runId == input.run_id (catches bug a, ag-ui-protocol/ag-ui#1582) Without either half of the workaround the test fails with a precise message pointing at which bug regressed. Follow-up: file the duplicate-RUN_STARTED bug upstream as a separate issue on ag-ui-protocol/ag-ui. * fix(copilotkit): re-pin to latest published versions (DevRel env) and ship lockfile Replaces the earlier pinning to CopilotKit's reference-sample versions (copilotkit==0.1.74 / ag-ui-langgraph==0.0.22) with the current published set customers actually install: copilotkit==0.1.87 langchain==1.2.15 langchain-core==1.3.0 langgraph==1.1.9 ag-ui-langgraph==0.0.34 Frontend transitive ``@ag-ui/client==0.0.52`` now matches what ``@copilotkit/react-core==1.56.3`` actually pulls in (DevRel's environment per their diagnostic bundle). Changes: - ``samples/copilotkit/tests/browser/backend/requirements.txt`` -- pins updated to the latest set above. - ``samples/copilotkit/tests/browser/backend/requirements.lock`` -- NEW, committed pip-freeze of the verified environment. ``pip install -r requirements.lock`` now gives byte-identical transitive deps. - ``samples/copilotkit/README.md`` -- version matrix and install snippets updated to the latest set; upstream-bug callout now lists both issues (``ag-ui-protocol/ag-ui#1582`` runId overwrite, ``ag-ui-protocol/ag-ui#1584`` duplicate RUN_STARTED). - ``samples/copilotkit/agents/evaluator_agent.py`` -- renamed the factory from ``_build_langgraph_agui_agent`` to the public ``build_agui_agent``; added a ``_version_guard_ag_ui_langgraph`` helper that emits a ``RuntimeWarning`` when the installed version is outside the tested range ``[0.0.22, 0.0.34]`` so silent behavior drift does not hide a regression. A backwards-compatible alias keeps the old private name importable for internal tests during the rename window. - ``samples/copilotkit/tests/browser/backend/server.py`` and ``tests/test_samples_e2e.py`` -- call sites updated to the public name. Verified end-to-end against the latest version matrix: - pytest -k copilotkit: 9 passed, 2 skipped (live-only). - Manual HTTP drive against a running backend with the reporter's exact flow (turn 1 initial -> interrupt, turn 2 re-entry on paused graph): both turns emit exactly one RUN_STARTED and one RUN_FINISHED, both with matching client runIds, no RUN_ERROR / INCOMPLETE_STREAM. * fix(copilotkit): wire useLangGraphInterrupt on frontend; drop backend resume heuristic DevRel confirmed the Apr-24 push resolved the turn-1 INCOMPLETE_STREAM (backend now emits a clean RUN_STARTED -> STEP_* -> RUN_FINISHED for the initial interrupt turn). Remaining gap: when the user replies to the interrupt, plain ``<CopilotChat>`` sends the reply as an ordinary new chat message, not as ``forwardedProps.command.resume`` -- so the graph stayed paused and the same error returned on the follow-up. Correct fix is on the frontend, not the backend: ``@copilotkit/react-core@1.56.3`` ships ``useLangGraphInterrupt``, the hook specifically designed for this case. It renders a UI when the graph pauses at ``interrupt()`` and calls ``resolve(...)`` with the user's answer -- which the runtime forwards as the proper ``command.resume`` payload. This is the supported AG-UI protocol path: the frontend must explicitly signal a resume rather than a new turn. Changes: - ``samples/copilotkit/tests/browser/frontend/app/page.tsx``: wires ``useLangGraphInterrupt`` with a dedicated prompt widget (``data-testid`` stable for automation), and a "Start evaluation" test-hook button that uses ``useCopilotChat().appendMessage`` to kick off the graph without having to type into CopilotChat's textarea (which Playwright can't reliably drive on 1.56.3 + React 19). The ``resolve([{role:"user", content}])`` shape matches what ``copilotkit_interrupt`` expects server-side (``answer = response[-1].content``). - ``samples/copilotkit/tests/browser/frontend/app/globals.css``: styles for the interrupt widget and the test-hook start button. - ``samples/copilotkit/agents/evaluator_agent.py``: reverts the backend auto-resume heuristic I had shipped as a stopgap. It was overloading the protocol semantics ("any user message during active interrupt == resume answer") which is incorrect for anything beyond a simple sample -- breaks cancel flows and multi-interrupt scenarios. The backend now only does the two genuine protocol-bug workarounds (runId overwrite, duplicate RUN_STARTED). Resume belongs to the frontend. Test plan: - Python test suite: ``pytest -k copilotkit`` -- 9 passed / 2 skipped (live) on DevRel's exact version matrix. - Backend HTTP round-trip with a programmatic ``command.resume`` payload: both turns emit matched ``RUN_STARTED``/``RUN_FINISHED`` with client runId, no ``RUN_ERROR`` (verified on 2026-04-24). - Browser end-to-end: the hook wiring in page.tsx matches CopilotKit's own showcase pattern and the hook source I inspected. I could not self-verify the full browser round-trip because (a) Playwright cannot reliably drive CopilotChat's textarea on 1.56.3 + React 19 (tracked at CopilotKit/CopilotKit#4215), and (b) my attempted programmatic appendMessage test-hook did not trigger a runtime POST in my local venv for reasons I have not yet pinned down. **DevRel re-test in a real browser is the authoritative check for the frontend round-trip.** Follow-up (per "#2" in the user's plan): rewrite the evaluator HITL to use CopilotKit's current idiom (``useCopilotAction`` / ``useHumanInTheLoop`` -- frontend-defined tool + UI render + resolve) instead of backend ``interrupt()``. That's the pattern CopilotKit's active samples use; it avoids the ag-ui-langgraph interrupt path bugs entirely and is where customers should be pointed for new work. * feat(copilotkit): rearchitect evaluator as create_agent + frontend HITL tool Replaces the custom StateGraph + ``langgraph.types.interrupt()`` pattern with CopilotKit's current HITL idiom: ``langchain.agents.create_agent`` driving an LLM that calls backend tools, with the human-in-the-loop step wired as a **frontend** tool via ``useCopilotAction`` + ``renderAndWaitForResponse``. This matches what CopilotKit's active showcases (``hitl_in_chat_agent.py``, ``interrupt_agent.py``) use. Why the rearchitect: the ``interrupt()`` code path in ``ag-ui-langgraph`` has two protocol-level bugs (tracked upstream as ``ag-ui-protocol/ag-ui#1582`` and ``#1584``) that the previous revision worked around by subclassing ``LangGraphAGUIAgent`` and reaching into private internals. That ships, but it's not the pattern CopilotKit themselves exercise, and the workaround is fragile across upstream bumps. Moving off the ``interrupt()`` path sidesteps both bugs by construction and aligns with CopilotKit's active direction. Design (three-role review): - **AI engineer**: LLM drives. Backend tools (``list_judges``, ``list_recent_traces``, ``run_trace_evaluation``, ``get_evaluation_result``) are thin wrappers over the LayerLens SDK. A tight system prompt guides the flow. ``confirm_judge`` is a frontend tool declared via ``useCopilotAction``; ``CopilotKitMiddleware()`` bridges it into the agent's toolbelt so the LLM can "call" it like any other tool. - **Designer**: HITL renders as a card list -- each judge shows name, id, and evaluation goal, with a ``Select <Name>`` button. Keyboard accessible, visible focus states, compact "Judge selected." state after the user chooses. ``data-testid`` attributes throughout for deterministic automation. - **SDK engineer**: ~160 LoC for the evaluator (down from ~560). No private-API reach. No workaround subclass. No checkpointer needed (``create_agent`` owns state). Lockfile updated for ``langchain-openai``. Frontend pins unchanged. The old ``build_agui_agent`` factory, ``build_graph`` with a custom ``StateGraph``, ``EvaluatorState`` TypedDict, all node functions, the msgpack DTO allowlist, and the version-guard helpers are all gone -- replaced by one ``build_graph(model=...)`` that returns the compiled ``create_agent`` graph. Tests: - ``tests/test_samples_e2e.py`` rewritten. ``test_copilotkit_evaluator_ interrupt_resume`` and ``test_copilotkit_evaluator_agui_wire`` (both specific to the old ``interrupt()`` architecture) replaced by ``test_copilotkit_evaluator_tools``, which exercises each backend tool against a patched Stratix client and verifies the system prompt references ``confirm_judge``. - Import-smoke test mock list extended for ``langchain.agents`` / ``langchain.tools`` / ``langchain_core.tools`` / ``langchain_openai``. - ``pytest -k copilotkit``: 8 passed, 2 skipped (live). Frontend: - ``page.tsx``: ``useCopilotAction("confirm_judge", ...)`` with a rich judge-card list; ``useLangGraphInterrupt`` removed. - ``globals.css``: styles for ``judge-picker`` / ``judge-card`` / complete / empty states. - ``Evaluate my traces`` quick-action button retained for direct user triggering and automation. Backend server: - ``samples/copilotkit/tests/browser/backend/server.py`` swaps ``build_agui_agent(...)`` for plain ``LangGraphAGUIAgent(...)`` -- no workaround needed on this code path. README: - Full rewrite around the new architecture. Version matrix unchanged. The two upstream ``ag-ui-langgraph`` bugs are preserved in the "informational" section for customers building their own ``interrupt()``-based graphs. Per user direction: no backwards compatibility for the old sample (no customer has it). The workaround subclass is removed, not deprecated. * test(copilotkit): add live-LLM e2e + OpenAI-compatible endpoint support The previous commit's tests verified the new architecture against mocks; this one verifies it against a real LLM through the actual AG-UI FastAPI endpoint. New test ``test_copilotkit_evaluator_live_llm``: - Loads credentials from a gitignored ``.env`` (or real env vars in CI), with OpenRouter convenience: if only ``OPENROUTER_API_KEY`` is set, the loader auto-points ``OPENAI_BASE_URL`` at OpenRouter. - Builds a FastAPI app with the patched Stratix client + the real evaluator graph (real LLM, no fake model). - POSTs an AG-UI ``RunAgentInput`` whose ``tools`` array declares the ``confirm_judge`` frontend tool, exactly as the browser would. - Asserts: tool sequence is ``list_recent_traces`` -> ``list_judges`` -> ``confirm_judge``; agent halts at ``confirm_judge`` (never calls ``run_trace_evaluation``); single ``RUN_STARTED`` + ``RUN_FINISHED`` with matching client ``runId``; no ``RUN_ERROR``. - Marked ``@pytest.mark.live`` and ``pytest.skip``s when no key is available, so the default ``pytest`` run is unaffected. Verified locally: passes against ``openrouter:openai/gpt-4o-mini``. Other changes in this commit: - ``evaluator_agent.py``: - ``_default_model()`` honours ``OPENAI_API_KEY``, ``OPENAI_BASE_URL``, and ``OPENAI_MODEL`` so any OpenAI-compatible endpoint works (OpenAI, Ollama, LM Studio, OpenRouter, vLLM, ...). For non-compatible providers, customers pass any LangChain ``BaseChatModel`` to ``build_graph(model=...)``. - ``create_agent`` now compiles with ``InMemorySaver``. ``ag-ui- langgraph``'s ``add_langgraph_fastapi_endpoint`` calls ``graph.aget_state(config)`` on every request, which fails with ``ValueError("No checkpointer set")`` if the graph wasn't compiled with one -- regardless of whether ``interrupt()`` is used. - ``build_agui_agent`` reintroduced as a *minimal* runId-only workaround for ``ag-ui-protocol/ag-ui#1582``. Bug #1584 (duplicate RUN_STARTED) is unreachable on this code path because the evaluator never calls ``langgraph.types.interrupt()``, so we only need the runId fix. Live test confirms the workaround restores runId continuity end-to-end. - ``samples/copilotkit/tests/browser/backend/server.py``: switched back to ``build_agui_agent(...)`` so the runId workaround is active in the harness backend. The earlier "no workaround needed" claim was wrong; @ag-ui/client@0.0.52 doesn't enforce runId continuity but older clients did and future strict ones likely will. - ``tests/.env.example``: documents the supported env vars (OPENAI, OpenRouter convenience, LayerLens). Real ``tests/.env`` is gitignored. - ``samples/copilotkit/README.md``: documents the live-test setup and links the .env.example. Also documents the ``OPENAI_API_KEY``/``OPENAI_BASE_URL``/``OPENAI_MODEL`` env-var triplet for OpenAI-compatible providers (Ollama, LM Studio, OpenRouter). * fix(copilotkit harness): unblock React hydration on Next 16 dev server DevRel hit a "page renders but every button is dead, textarea won't accept input" failure mode while running the harness locally. Diagnosis took several iterations because there was no client-side error: - Backend was healthy; ``/healthz`` returned 200. - ``/api/copilotkit`` was up; an ``info`` JSON-RPC probe listed the evaluator agent. - Direct POSTs to the backend at :8123 streamed real LLM events. - The page HTML had every expected ``data-testid``. - Browser console showed only one repeating warning: ``WebSocket connection to 'ws://127.0.0.1:3000/_next/webpack-hmr' failed: Error during WebSocket handshake`` Root cause: Next 16 enforces a cross-origin allowlist for dev resources (including the webpack-hmr WebSocket). When the user serves on ``127.0.0.1`` but the allowlist is implicit ``localhost``, HMR fails to connect and Next leaves React in a half-hydrated state. The page renders from the server but client React never wires up event handlers or controlled-input state -- so buttons and textareas are visually present but inert. No error is surfaced beyond the WebSocket warning. Fix: - Add ``allowedDevOrigins: ["127.0.0.1", "localhost"]`` to ``samples/copilotkit/tests/browser/frontend/next.config.js``. Both origins are the supported way to load the harness; without this, whichever the user picks tends to break. Also, to make this kind of failure self-diagnosing rather than requiring DevTools-paste skills: - New ``samples/copilotkit/tests/browser/frontend/public/diag.html`` -- a static page (no React) that runs three probes on load and renders results inline: runtime ``info`` reachability, an ``agent/run`` round-trip through ``/api/copilotkit``, and a direct ``/healthz`` ping against the backend. Visit ``http://127.0.0.1:3000/diag.html`` to see green/red labels for each. This bypasses the React app entirely, so it stays useful even when hydration is broken. - New "Run diagnostic" button on the harness page (next to "Evaluate my traces") that runs the same probes plus a couple of React-only checks (textarea state, isLoading, intercepted ``appendMessage`` POST body) and renders the report directly on the page. Useful for users who can't (or don't want to) paste JS into DevTools console. Verified locally: after the cache + allowedDevOrigins fix, both buttons fire, ``appendMessage`` POSTs to ``/api/copilotkit`` and gets back a real ``RUN_STARTED`` SSE stream end-to-end. * fix(harness picker): resilient render while LLM streams tool args CopilotKit's ``renderAndWaitForResponse`` re-renders the action UI progressively as the LLM streams the tool-call JSON, so for the first render tick or two ``judge.id`` (and sometimes ``judge.name``) can be undefined even though the surrounding React state is stable. That tripped two issues in our judge picker: 1. ``key={judge.id}`` warned "Each child in a list should have a unique key prop" when id was undefined. 2. The Select button was clickable with an undefined id, which would ``respond({ id: undefined, name: undefined })`` and break the resume. Fix: - Fall back to ``pending-{index}`` for the React key while id is pending. Quiet warning + stable row identity. - Mark each row "ready" only when both id and name are present and ``respond`` is non-null. Disable the Select button and show "Loading..." until ready. The button text and ``data-testid`` follow the ready state so automated tests don't grab a half-loaded row by accident. - Hide the dim id-pill (``judge-card-id``) while id is pending so the card doesn't flash an empty grey box. * feat(harness): per-tool render cards + Tailwind 4 + production EvaluationCard DevRel asked: "where is the tool indicator I should see?" CopilotChat only renders user/assistant text and frontend HITL widgets by default; backend tool calls fire invisibly. Surface them with the ``useCopilotAction`` + ``available: "remote"`` + ``render`` pattern -- the same pattern CopilotKit's ``tool_rendering_agent.py`` showcase uses. Changes: - All four backend tools (``list_recent_traces``, ``list_judges``, ``run_trace_evaluation``, ``get_evaluation_result``) now render inline cards with a pulsing-dot "Running" status pill, transitioning to a green "Done" pill when the tool resolves. Each card has a stable ``data-testid`` for automated tests. - ``get_evaluation_result`` (the final result) renders the polished ``EvaluationCard`` from ``samples/copilotkit/components/`` -- the production-grade SDK card with the score donut and pass-rate ring. Imported via a tsconfig path alias (``@layerlens/copilotkit-cards``) so the harness can reuse the upstream SDK components without copying or duplicating them. - ``confirm_judge`` HITL picker restyled with matching Tailwind tokens to keep the visual language consistent across all tool cards. - Tailwind 4 added (``@tailwindcss/postcss``, ``tailwindcss``) + ``postcss.config.mjs`` + ``@import "tailwindcss"`` in ``globals.css``. Inline custom CSS removed in favour of Tailwind utilities, matching CopilotKit's own showcase samples. - ``html className="dark"`` + ``color-scheme: dark`` so the SDK reference cards (which key off the ``.dark`` ancestor) render in dark mode by default. - ``<CopilotKit showDevConsole={false}>`` -- DevRel reported the default web-inspector "kite" obscured the harness header; suppressed for the sample. - ``tsconfig.json`` includes ``../../../components/**/*`` so Next's bundler picks up the SDK card sources, and adds the ``@layerlens/copilotkit-cards`` path alias. The pattern (frontend ``useCopilotAction`` for backend tools with ``available: "remote"``) is what customers should copy. The harness demonstrates it in two flavours: lightweight inline cards (for the first three tools) and full SDK-component composition (for the result). Both styles are valid; teams pick based on visual weight they want. * copilotkit sample: ship customer-facing app with shadcn-aligned cards Reshaped the CopilotKit sample so it reads as a commercial-grade SDK demo rather than a test fixture, and brought the visual language into line with CopilotKit's own samples (research-canvas, travel, banking, with-shadcn-ui). Structure - Move sample out of `samples/copilotkit/tests/browser/{backend,frontend}` to `samples/copilotkit/app/{backend,frontend}` so customers see "the app" rather than "a test harness". Update README + path constants. - Add `app/frontend/.gitignore` for `.next/`, `node_modules/`, and Playwright artefacts. Backend (`app/backend/server.py`, `agents/evaluator_agent.py`) - Real LayerLens only: missing `LAYERLENS_STRATIX_API_KEY` is a hard startup error. No fake-fixture path, no `MagicMock`, no env-var flag — fixtures only ever existed for an earlier Playwright fixture and conflicted with the SDK posture in CLAUDE.md. - Agent built with `create_agent` + `CopilotKitMiddleware`, real `@tool` impls returning `Command(update={...})` so each tool emits state into `state.{traces,judges,evaluations,results}`. Async tools call `copilotkit_emit_state` so the canvas updates live during a run. - New `GET /evaluations/{id}` endpoint for out-of-band polling: the agent kicks off evaluations, ends in seconds, and the frontend folds completed verdicts into the canvas as each evaluation resolves on LayerLens. Fixes the 30s-evaluation-vs-LLM-polling-loop hallucination. - `LangGraphAGUIAgent` constructor gets `config={"recursion_limit": 200}` so a 5-trace fan-out doesn't trip the default 25-hop limit (tested via `with_config` first; that path is dropped by ag-ui's internal config merge). - System prompt rewritten: strict tool order; `confirm_judge` takes no args (frontend reads candidates from `state.judges` to avoid the `tool_argument_parse_failed: Unterminated string in JSON` we hit when streaming 38 judges through tool args); evaluations capped at 5 traces; pending != failed; final summary template branches on whether anything completed. SDK card library (`samples/copilotkit/components/`) - Rewritten on top of shadcn/ui primitives. Cards now compose `Card`, `CardHeader`, `CardContent`, `CardFooter`, `Badge`, `Button`, `Separator`, `Progress` from `@/components/ui/*`. Status pills use the `bg-{color}-50 text-{color}-600 dark:bg-{color}-900/20` pattern CopilotKit's banking sample uses, not custom ring/shadow chrome. - Stock shadcn neutral OKLCH palette (`baseColor: neutral`). Brand accent `#6766FC` applied via Tailwind class strings on CTAs/links — same approach research-canvas takes for its accent. No edits to `--primary` / shadcn theme variables. - Score bars solid (`bg-green-500` / `bg-red-500` / `bg-amber-500`) not gradients. Sparklines color-coded by pass-rate threshold. - `dashboardBaseUrl` is now strictly opt-in across `TraceCard` and `EvaluationCard`: the "Trace Explorer →" / "Agent Graph →" / "View in Dashboard →" footers only render when a real URL is configured via `NEXT_PUBLIC_LAYERLENS_DASHBOARD_URL`. Stops 404s on routes that aren't deployed yet. Frontend (`app/frontend/`) - shadcn primitives installed via `npx shadcn@latest add card button badge progress separator`. Deps: `radix-ui`, `class-variance- authority`, `clsx`, `tailwind-merge`, `tw-animate-css`. Tailwind 4 + React 19. `components.json` aliases `ui` to the SDK card library. - New `globals.css` with shadcn neutral tokens (`--background`, `--card`, `--muted-foreground`, etc.), `@theme inline` mapping for Tailwind 4, and a `--copilot-kit-*` bridge so `<CopilotChat>` reads the same neutral tokens as the canvas. Brand accent set on `--copilot-kit-secondary-color`. Drops the previous "force dark" CSS. - Layout split-pane, **light by default** to match every official CopilotKit sample. New `theme-toggle.tsx` segmented control (Light / System / Dark) persists to `localStorage` and reacts to OS-level theme changes when set to System. - `useCoAgent({ name: "evaluator" })` reads live agent state. New out-of-band poller (`useEffect` against `/evaluations/{id}` every 5 s) folds verdicts that arrive after the agent run ends into the canvas. `state.results` (agent) and `polledResults` (frontend) are merged via `useMemo` so MetricStrip / EvaluationCard / JudgeVerdict- Card all see one consistent results array. - Picker: `JudgePicker` is its own component subscribed to `useCoAgent` so it re-renders when `state.judges` populates after the LLM streams out the tool call. `confirm_judge` uses `available: "remote"` + `renderAndWaitForResponse` per the canonical research-canvas HITL pattern. Cleanup - Strip every dev artefact: agent's `[tool] X INVOKED` prints, the page's debug-state `<pre>`, the `console.log("[evaluator state]"…)` effect, the "Run diagnostic" button + panel + state, and the `probe_e2e.py` SSE diagnostic script. Header is now just the title, theme toggle, and the primary CTA. * copilotkit sample: chat tokens, severity pills, trace status, markdown reasoning Polish pass after first review: - Chat token bridge fixed. Re-read CopilotKit's ``react-ui/colors.css`` semantics: ``primary-color`` is the user-bubble + interactive accent, ``secondary-color`` is the assistant message background, not a brand slot. Earlier mapping made the assistant greeting render as solid indigo and clip out of view in light mode. Now mapped onto shadcn tokens semantically: ``primary → --primary``, ``contrast → --primary- foreground``, ``secondary → --card``, ``secondary-contrast → --card-foreground``. Brand accent ``#6766FC`` stays only on actual CTA buttons via Tailwind class strings. - ``JudgePicker`` "selected" pill now uses light + dark variants (``bg-green-50 text-green-700 dark:bg-green-900/20 dark:text-green-300``) instead of dark-mode-only emerald that disappeared on a light page. - ``JudgeVerdictCard`` redesign: * Pass / Fail / Error are now solid-filled badges (``bg-green-600``, ``bg-red-600``, ``bg-amber-600`` with white text), readable at a glance instead of subtle ghost pills. * Severity rendered as a colored pill with a triangle alert glyph, not a dot. Severity is a status (impact-of-failure level), not a trend, so an "alert" shape is correct; chevrons would imply direction. Hide the severity chip when verdict=pass AND severity=low — nothing meaningful to flag. * Reasoning rendered through a tiny inline ``MarkdownLite`` that handles paragraph breaks, line breaks, ``**bold**``, and ``*italic*`` — the cases LayerLens API actually emits. No ``react-markdown`` dep (the SDK card library lives outside the Next app's node_modules so it can't resolve packages there); no raw HTML injection. Fixes the wall-of-text rendering of judge reasoning. - Tailwind 4 ``@source`` directive added to ``globals.css`` so it scans ``samples/copilotkit/components/**/*.{ts,tsx}``. Without this, classes used inside the SDK card library (``bg-amber-500``, ``bg-green-600``, etc.) get tree-shaken out of the generated CSS and pills silently flatten to plain text. - ``TraceCardProps.status`` made optional. The LayerLens ``traces.get_many`` API doesn't expose per-trace lifecycle today, so the sample no longer hardcodes ``status="ok"`` — that was rendering a misleading green pill on every trace regardless of reality. The status pill is hidden when the prop is omitted; restore it once the API surfaces real status. * copilotkit sample: render pending verdicts as skeleton cards When the agent kicks off N evaluations and only K complete on the first poll, the remaining (N - K) used to disappear from the ``Verdicts`` grid even though the run-summary card still counted them — verdict count would say "5", grid would show 4, and the trailing pending one looked like it had been lost. Add ``PendingVerdictCard``: same shadcn ``Card`` chrome as ``JudgeVerdictCard``, with a "Running" pill, a pulsing skeleton bar for the score, and copy explaining real LayerLens evaluations can take a minute or two. Render one per evaluation that doesn't have a matching entry in ``state.results`` yet. Side effects: - ``Verdicts`` section count now reflects total evaluations (not just completed) so the grid count matches what's actually rendered. - Section now renders even when ``results.length === 0`` as long as there are evaluations in flight (previously fell through to a textual placeholder). - Run summary picks the judge name from the first pending evaluation if no result has come back yet. The polling loop is unchanged — it keeps polling ``/evaluations/{id}`` every 5 s and replaces a pending card with the real ``JudgeVerdictCard`` the moment LayerLens returns a verdict. * copilotkit sample: extract MarkdownLite, render judge goals The judge ``evaluation_goal`` field LayerLens returns is markdown- formatted (paragraph breaks, ``**bold**`` headers, numbered lists). Both the in-chat picker and the canvas's "Available judges" card were rendering it through plain ``<p>{text}</p>`` so each judge collapsed into one indented wall of text — same problem the verdict card's reasoning had before. Pull the inline markdown renderer that previously lived inside ``JudgeVerdictCard.tsx`` into its own ``markdown-lite.tsx`` module, re-export it from the SDK card library's ``index.ts``, and use it in: - JudgeVerdictCard reasoning (already) - JudgePicker goal description (chat-side) - JudgesCard goal description (canvas-side) Output is the same as before for the verdict card; the picker and the canvas judges card now show structured goal text. Still no ``react-markdown`` dependency — the SDK card library has to stay resolvable without the Next.js app's node_modules in scope, so we keep the small built-in renderer instead. * copilotkit sample: rewrite README to match the shipped architecture The README still described the previous incarnation of the sample — the create_agent + frontend HITL design from before the canvas / out-of-band-polling rewrite. Rewrite top-to-bottom to reflect what actually ships: - New layout section showing ``samples/copilotkit/{agents,components,app}`` with the SDK card library and the customer-facing app side-by-side. - Architecture diagram updated for the canvas + chat split-pane, ``useCoAgent`` driving state-driven cards, and the ``GET /evaluations/{id}`` polling endpoint that the frontend hits every 5s for in-flight verdicts. - Step-by-step "How the demo flows" walkthrough so a customer can read the README and predict what each click will do. - "Why this pattern" updated to highlight the canvas + frontend polling + ``copilotkit_emit_state`` triad. Old text framed the choice as ``create_agent`` vs ``interrupt()``; new text frames it as the research-canvas pattern. - Tools section updated for the async + ``Command(update={...})`` return shape and the no-arg ``confirm_judge`` (frontend reads candidates from ``state.judges``). - Frontend section adds: shadcn/ui foundation, ``components.json``, light-default theme + ``ThemeToggle``, ``--copilot-kit-*`` token bridge, brand accent ``#6766FC``, the SDK card matrix (5 cards + ``MarkdownLite``). - Backend section adds: ``recursion_limit: 200`` config, the ``GET /evaluations/{id}`` polling handler, and the "no fake fixture" guardrail. Drive-by: ``ruff format`` brought ``evaluator_agent.py`` and ``server.py`` in line with the project's ruff style. (The repo's ``[tool.ruff]`` ``exclude = ["samples"]`` would skip these on discovery, but reformatting locally keeps them tidy and avoids contributors re-doing it.) * ci: replace walrus + reformat test_samples_e2e for ruff py37 target Fixes both red CI checks on PR #92: - ``Check Lint`` was failing because tests/test_samples_e2e.py used the walrus operator (``:=``) at line 1446 and ruff's ``[tool.ruff].target-version`` is pinned to ``py37``. Replace with a regular assignment + boolean check — same semantics, py37 compatible. The package's runtime support (``Python >=3.10,<3.13``) doesn't dictate ruff's syntax target; bumping the ruff target is out of scope for this PR. - ``Check Format`` was failing because the same file had pre-existing multi-line wrapping that ruff's auto-format collapses to single lines under the 120-char limit. Apply ``ruff format``. - ``ruff check --fix`` also normalised one import block (I001). CI's ``test (3.9..3.12)`` jobs cancelled out after the lint pre-step failed — they should now actually run. * copilotkit sample: drop OpenRouter-specific references Per existing repo policy: the SDK sample and tests should not name a specific OpenAI-compatible provider. Configuring OpenRouter (or any other gateway) is the user's job in their own .env — the docs and test code stay vendor-neutral. Removes: - OpenRouter row from ``_default_model``'s docstring table. - OpenRouter mention in ``build_graph``'s docstring. - ``OpenRouter, vLLM`` aside in the CLI ``main()`` print block. - OpenRouter URL in ``samples/copilotkit/README.md`` env-var example. Replaced with a placeholder ``your-openai-compatible-host``. - ``OPENROUTER_API_KEY`` auto-mapping in ``test_copilotkit_evaluator _live_llm`` (the test now expects ``OPENAI_API_KEY`` and lets the user set ``OPENAI_BASE_URL`` / ``OPENAI_MODEL`` themselves if pointing at a non-OpenAI endpoint). - Skip-message reference to ``OPENROUTER_API_KEY``. The sample still works against any OpenAI-compatible endpoint — the generic env vars (``OPENAI_API_KEY`` / ``OPENAI_BASE_URL`` / ``OPENAI_MODEL``) carry the configuration. The user's own gitignored ``.env`` is where provider-specific URLs (OpenRouter, Ollama, LM Studio, …) live. * ci: wrap server.py uvicorn boot in main() + extend test mocks Three test failures from the previous CI run, all addressed here: 1. ``tests/test_samples.py::test_sample_has_main[copilotkit/app/backend /server.py]`` expects every sample's entry-point file to expose a ``main()`` function. ``server.py`` had a bare ``if __name__ == "__main__":`` block instead. Lift the uvicorn.run call into a ``main()`` and call it from the ``if __name__`` guard. 2. ``test_copilotkit_agent_import[evaluator_agent]`` and 3. ``test_copilotkit_without_langchain[evaluator_agent]`` both stub the heavy deps via ``patch.dict("sys.modules", ...)`` so the agent module imports cleanly without langchain / copilotkit installed. The mock dict was missing the new submodules the agent now imports (``langgraph.prebuilt``, ``langchain.agents.middleware``, ``langchain_core.runnables``, ``langchain_core.tools.base``). Add them to both mock dicts. Locally ``ruff check`` and ``ruff format --check`` are clean on all touched files. * copilotkit sample: stop polling on any terminal eval state, not just success Bug repro: same evaluation reliably stayed "Running" across multiple demo runs. Root cause was the polling filter on the frontend: const completed = updates.filter( (u) => u.status === "success" && typeof u.score === "number", ); This rejected any LayerLens response that wasn't a clean success with a numeric score — including ``status: "failure"``, ``status: "error"``, ``status: "cancelled"``, and the ``status: "success"`` case where ``trace_evaluations.get_results`` returned ``score: null`` (which some judges legitimately do). The poller would then keep firing every 5s forever and the verdict card would sit in "Running" indefinitely. Two-sided fix: Backend (``GET /evaluations/{id}``): - New ``done: bool`` field — true for any of ``success | failure | error | cancelled | not_found``, false while the evaluation is still ``in_progress`` / ``pending`` / ``queued``. - Always include ``passed`` / ``score`` / ``reasoning`` once ``done: true``, even for terminal failures and ``success``-without- score: defaults are ``passed: false``, ``score: 0.0``, and a ``reasoning`` string explaining the terminal state. - ``try/except`` around ``trace_evaluations.get`` so a malformed / unauthorized id surfaces as ``status: "error", done: true`` instead of a 500 that the frontend retries forever. Frontend (``page.tsx``): - Polling filter is now ``u.done === true`` instead of ``status === "success" && typeof score === "number"``. - ``ResultRecord`` type gains an optional ``done?: boolean`` field (the agent's own ``state.results`` entries don't carry it; only the ``/evaluations/{id}`` polling responses do). Verified against a real eval id (clean success path → ``done: true``, score returned) and a deadbeef id (error path → ``done: true``, ``status: "error"``, no 500). The 5th-eval-stuck symptom is from the non-success terminal cases — frontend now folds them into the canvas as a verdict card with the appropriate fail/error styling instead of spinning forever. * Fix .gitignore --------- Co-authored-by: m-peko <marinpeko5@gmail.com>

mmercuri and others added 2 commits April 27, 2026 12:18

Update docs (#139)

08b0945

m-peko merged commit 9d5089b into release Apr 27, 2026
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Release#140

Release#140
m-peko merged 2 commits into
releasefrom
main

m-peko commented Apr 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

m-peko commented Apr 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants