Skip to content

Release#140

Merged
m-peko merged 2 commits into
releasefrom
main
Apr 27, 2026
Merged

Release#140
m-peko merged 2 commits into
releasefrom
main

Conversation

@m-peko

@m-peko m-peko commented Apr 27, 2026

Copy link
Copy Markdown
Collaborator

No description provided.

mmercuri and others added 2 commits April 27, 2026 12:18
…rupt() works over AG-UI (#92)

* fix(copilotkit): compile evaluator_graph with a checkpointer for interrupt()

The evaluator agent calls ``interrupt()`` in ``confirm_judge_node`` for
human-in-the-loop judge confirmation. A checkpointer is mandatory for that
to work -- without one, a ``Command(resume=...)`` call produces zero events,
``ag-ui-langgraph`` never emits ``RUN_FINISHED``, and the CopilotKit
frontend blocks all subsequent messages with "Cannot send 'RUN_STARTED'
while a run is still active".

Changes:
- ``evaluator_agent.py``: compile with ``InMemorySaver`` and a commented
  Postgres swap block. Convert ``JudgeInfo`` / ``TraceInfo`` /
  ``EvaluationInfo`` from ``@dataclass`` to ``pydantic.BaseModel`` so
  LangGraph's default ``JsonPlusSerializer`` can persist state across the
  pause boundary (dataclasses raise ``TypeError: Type is not msgpack
  serializable``).
- ``samples/copilotkit/README.md``: add full FastAPI backend wiring with
  ``add_langgraph_fastapi_endpoint``, Next.js frontend wiring with
  ``LangGraphHttpAgent``, a checkpointer options matrix (InMemory / SQLite
  / Postgres / Redis / LangGraph Platform) with per-option migration
  snippets, a version-compatibility table pinning the versions the bug
  reporter used, and a troubleshooting section mapping the observed
  frontend errors back to the backend cause.
- ``docs/samples-guide.md``: cross-reference the checkpointer requirement.
- ``tests/test_samples_e2e.py``: add
  ``test_copilotkit_evaluator_interrupt_resume`` that imports the real
  ``langgraph`` (not ``MagicMock``), asserts the compiled graph has a
  non-None checkpointer, and drives a full
  ``astream -> interrupt -> Command(resume=...) -> astream`` cycle with
  a patched Stratix client. Confirmed this test fails on the pre-fix
  code and passes on the fix. Also extended the existing mock-modules
  dicts so the import-smoke tests include ``langgraph.checkpoint.memory``.

The existing tests missed this because they mock ``langgraph``,
``langgraph.graph``, and ``langgraph.types`` with ``MagicMock()`` and then
only call ``main()`` (which prints usage). They never build or execute the
graph, so they cannot observe the missing checkpointer.

* fix(copilotkit): register DTOs with serde, add AG-UI wire e2e test

Follow-ups to the interrupt/checkpointer fix, addressing the open items
flagged in the prior commit:

1. Deserialize warning resolved. The Pydantic DTOs
   (``JudgeInfo`` / ``TraceInfo`` / ``EvaluationInfo``) are now registered
   on ``JsonPlusSerializer(allowed_msgpack_modules=...)`` via a custom
   serde passed to ``InMemorySaver``. Verified the sample passes with
   ``LANGGRAPH_STRICT_MSGPACK=true``, so it survives LangGraph's planned
   tightening of checkpoint deserialization.

2. End-to-end AG-UI wire validation. New integration test
   ``test_copilotkit_evaluator_agui_wire`` wires the evaluator graph into
   a FastAPI app through ``ag_ui_langgraph.add_langgraph_fastapi_endpoint``,
   drives the full user flow in-process via ``httpx.ASGITransport``, and
   asserts:
   - Phase 1 (initial run, hits ``interrupt()``): emits RUN_STARTED and
     RUN_FINISHED on the SSE stream.
   - Phase 2 (resume with user confirmation, same ``threadId``): emits
     RUN_STARTED and RUN_FINISHED.
   - Phase 3 (follow-up message after resume): not blocked -- RUN_STARTED
     and RUN_FINISHED fire again.
   This is the exact symptom the reporter hit, tested through the exact
   protocol path. Gated on ``pytest.importorskip`` for the heavy deps so
   the test skips cleanly when they are absent.

   Side benefit: running the same scenario against pre-fix code produces
   ``ValueError: No checkpointer set`` directly from
   ``graph.aget_state()``, giving operators a much louder error than the
   silent "stream ends without RUN_FINISHED" path.

3. README backend-wiring snippet corrected. The actual
   ``add_langgraph_fastapi_endpoint`` signature takes an
   ``agent=LangGraphAgent(...)`` wrapper, not a bare ``graph=`` kwarg --
   the example in the previous commit would have failed at import.
   Also expanded the SSE-protocol explanation to match what the new e2e
   test observes on the wire.

4. Investigator graph annotated. ``investigator_graph`` does not call
   ``interrupt()`` so it does not need a checkpointer, but without an
   explicit note future contributors adding a HITL step would silently
   regress. Added a short comment at the ``.compile()`` call pointing at
   the evaluator pattern.

* fix(copilotkit): resolve error/field collision + browser e2e harness

Follow-ups addressing the remaining open items from the previous two
commits:

1. Rename ``error`` node to ``handle_error`` (evaluator + investigator).
   The old name collided with the ``error`` field on the state
   dataclass. LangGraph 1.x accepts the collision; earlier versions
   reject it with "'error' is already being used as a state key".
   Renaming the node (and the conditional-edge routing targets) keeps
   the routing token ``"error"`` purely an edge key and sidesteps the
   conflict on any LangGraph version the sample may be copied into.

2. Guard the ``allowed_msgpack_modules`` kwarg behind try/except so the
   sample still imports cleanly on langgraph<1.0 (where the kwarg does
   not exist and the strict-msgpack warning is not emitted either).
   Verified the sample now imports on both langgraph 0.2.56 and 1.1.9.

3. Ruff-clean the changed files (import sort I001 fixes on the new
   test additions; unrelated warnings in pre-existing ``main()`` /
   ``error_node`` code are out of scope per "only change what was
   asked").

4. New ``samples/copilotkit/tests/browser/`` harness:

   - ``backend/server.py`` -- FastAPI app that patches
     ``layerlens.Stratix`` before importing the evaluator module and
     mounts ``evaluator_graph`` via
     ``add_langgraph_fastapi_endpoint(..., path="/evaluator")``.
   - ``frontend/`` -- Next.js 16.2.4 app pinned to the reporter's exact
     CopilotKit versions (``@copilotkit/react-core``,
     ``@copilotkit/react-ui``, ``@copilotkit/runtime`` all at 1.56.3),
     with the CopilotKit runtime wired to ``LangGraphHttpAgent`` against
     the FastAPI backend.
   - ``frontend/tests/interrupt-resume.spec.ts`` -- Playwright spec
     that drives CopilotChat through the three-turn scenario the
     reporter hit ("evaluate" -> "ok" -> "thanks") and asserts the
     exact string "Cannot send 'RUN_STARTED' while a run is still
     active" appears in neither the visible DOM nor the browser
     console.

   Known limitation documented in the harness README: CopilotChat
   1.56's textarea reports as aria-hidden / non-"visible" under
   Playwright strict actionability checks in **headless** Chromium,
   and multiple input-driving patterns (``fill``, ``keyboard.type +
   Enter``, ``pressSequentially``, DOM-setter + bubbled input event)
   failed to reliably enable the Send button headlessly. The harness
   works with ``--headed`` for human verification and is structurally
   complete. The authoritative regression coverage for the fix is the
   Python test suite (``test_copilotkit_evaluator_interrupt_resume``
   and ``test_copilotkit_evaluator_agui_wire``); the browser harness
   is corroborating / demo value, not gate-keeping.

* fix(copilotkit): mirror CopilotKit's sample + workaround for ag-ui-langgraph runId bug

DevRel surfaced that the backend fix in the previous commits got the Python
side working but the frontend still locked up with
"Cannot send 'RUN_STARTED' while a run is still active. ... INCOMPLETE_STREAM"
on the second message. Raw SSE capture confirmed the root cause:

  RUN_STARTED runId = "r1_aca59ad1" (client-supplied)
  RUN_FINISHED runId = "019dc049-14ba-..." (LangGraph's internal chain UUID)

This is an upstream bug in ag-ui-langgraph
(ag-ui-protocol/ag-ui#1582):
``_handle_stream_events`` overwrites ``self.active_run['id']`` with every
LangGraph event's internal ``run_id``, so RUN_FINISHED emits LangGraph's UUID
instead of the client-supplied ``input.run_id``. ``@copilotkit/runtime``
tracks active runs by client runId and raises RUN_ERROR/INCOMPLETE_STREAM.

Verified the bug is present in both ag-ui-langgraph 0.0.22 (CopilotKit's
officially-pinned version) and 0.0.34 (the reporter's version), and also
in ``copilotkit.LangGraphAGUIAgent`` which inherits the broken method.

Changes in this commit, all aligned with CopilotKit's own
``examples/integrations/langgraph-fastapi`` reference sample:

1. ``evaluator_agent.py``:
   - State class converted from ``@dataclass`` to a ``TypedDict`` inheriting
     from ``copilotkit.CopilotKitState``. This gives us ``MessagesState``'s
     ``add_messages`` reducer for free (nodes return NEW messages; they are
     appended, not replaced) and the ``copilotkit`` field the frontend
     injects.
   - All node functions updated from ``state.X`` / ``state.messages + [m]``
     to ``state.get('X')`` / ``{'messages': [m]}``.
   - HITL interrupt now uses ``copilotkit.langgraph.copilotkit_interrupt``
     (wraps ``interrupt()`` with ``__copilotkit_messages__`` so the prompt
     renders as a real AIMessage in the chat UI). The bare
     ``langgraph.types.interrupt(prompt)`` emitted a CUSTOM event the UI
     ignored -- why the reporter said "the agent stops and never reaches
     the human-in-the-loop confirmation step."
   - New ``RunIdPreservingAgent`` subclass (lazy factory
     ``_build_langgraph_agui_agent``) overrides ``_dispatch_event`` to
     restore ``input.run_id`` on RUN_FINISHED / RUN_ERROR terminal events.
     Clearly commented with a "remove when upstream ships" TODO pointing at
     the ag-ui-protocol/ag-ui issue.

2. ``samples/copilotkit/README.md``:
   - Version matrix re-pinned to CopilotKit's exact tested set
     (``copilotkit==0.1.74``, ``langchain==1.0.1``, ``langgraph==1.0.1``,
     ``ag-ui-langgraph==0.0.22``, ``@copilotkit/*==1.56.3``,
     Python ``>=3.10,<3.13``).
   - Upstream-bug callout explaining the runId workaround.
   - Backend wiring snippet updated to show the factory import and the
     ``LangGraphAGUIAgent`` path for non-interrupt graphs (investigator).

3. ``tests/test_samples_e2e.py``:
   - ``test_copilotkit_evaluator_interrupt_resume`` now sends
     ``Command(resume=[HumanMessage(content='ok')])`` rather than
     ``Command(resume='ok')``, matching ``copilotkit_interrupt``'s expected
     resume payload shape.
   - ``test_copilotkit_evaluator_agui_wire`` rewritten. The previous
     version had a blind spot: it only asserted RUN_FINISHED was PRESENT,
     not that ``RUN_STARTED.runId == RUN_FINISHED.runId == input.run_id``.
     Now it uses the ``RunIdPreservingAgent`` factory and asserts runId
     continuity end-to-end. Without the workaround this test would catch
     the upstream bug immediately.
   - Mock-module dict extended with ``copilotkit.langgraph`` for the
     import-smoke test.

4. ``samples/copilotkit/tests/browser/``:
   - ``backend/requirements.txt`` re-pinned to CopilotKit's set.
   - ``backend/server.py`` switched from raw
     ``ag_ui_langgraph.LangGraphAgent`` to the sample's factory so the
     browser harness also benefits from the runId workaround.

All 9 copilotkit tests pass in the pinned venv. Empirical verification
scripts in /tmp/ (not committed) show raw SSE with matching runIds
end-to-end.

* fix(copilotkit): also suppress duplicate RUN_STARTED on re-entry interrupt path

DevRel's diagnostic bundle (ag-ui-langgraph==0.0.34, copilotkit==0.1.87,
@ag-ui/client==0.0.52 transitively) confirmed commit 542002b did not fix
the browser symptom. Raw SSE from the Network tab showed:

  RUN_STARTED runId=d0b9d6c5-...
  [graph reaches step="confirm_judge" -- interrupt IS being hit]
  RUN_ERROR {code: "INCOMPLETE_STREAM",
             message: "Cannot send 'RUN_STARTED' while a run is still
                       active. The previous run must be finished with
                       'RUN_FINISHED' before starting a new run."}

Same error text as before, different root cause. A second bug in
ag-ui-langgraph: when a request arrives on a thread whose graph is
already paused at ``interrupt()`` and the request does NOT carry
``forwardedProps.command.resume``, the ``has_active_interrupts`` branch
of ``prepare_stream`` (agent.py:491) emits a second ``RunStartedEvent``
to ``events_to_dispatch`` -- after ``_handle_stream_events`` (line 209)
already emitted one at the top of the stream. The server's own AG-UI
encoder validator catches the duplicate and converts it into a
``RUN_ERROR`` with the exact "Cannot send 'RUN_STARTED'..." message,
terminating the stream before ``RUN_FINISHED`` can be dispatched. On
``@ag-ui/client@0.0.52`` (the newer protocol-state validator, which
enforces within-stream start/finish invariants rather than the runId
correlation the previous version used) this is what lands as
INCOMPLETE_STREAM in the browser.

Extended the sample's workaround subclass to filter at the agent
boundary rather than override ``_dispatch_event`` (which expects to
return an Event, not None/""). The filter:

  1. Drops any RUN_STARTED after the first within a single stream --
     fixes the duplicate-emission bug on the ``has_active_interrupts``
     path.
  2. Restamps ``input.run_id`` on RUN_FINISHED / RUN_ERROR -- preserves
     the existing ag-ui-protocol/ag-ui#1582 fix for older clients that
     correlate by runId.

Verified on both pin matrices:
  - copilotkit==0.1.74 / ag-ui-langgraph==0.0.22 (CopilotKit's own
    reference sample pins): all tests pass.
  - copilotkit==0.1.87 / ag-ui-langgraph==0.0.34 (DevRel / reporter):
    all tests pass.

Tightened ``test_copilotkit_evaluator_agui_wire`` accordingly:
  - asserts exactly one RUN_STARTED per stream (catches bug b)
  - asserts no RUN_ERROR
  - asserts RUN_STARTED.runId == RUN_FINISHED.runId == input.run_id
    (catches bug a, ag-ui-protocol/ag-ui#1582)

Without either half of the workaround the test fails with a precise
message pointing at which bug regressed.

Follow-up: file the duplicate-RUN_STARTED bug upstream as a separate
issue on ag-ui-protocol/ag-ui.

* fix(copilotkit): re-pin to latest published versions (DevRel env) and ship lockfile

Replaces the earlier pinning to CopilotKit's reference-sample versions
(copilotkit==0.1.74 / ag-ui-langgraph==0.0.22) with the current published
set customers actually install:

  copilotkit==0.1.87
  langchain==1.2.15
  langchain-core==1.3.0
  langgraph==1.1.9
  ag-ui-langgraph==0.0.34

Frontend transitive ``@ag-ui/client==0.0.52`` now matches what
``@copilotkit/react-core==1.56.3`` actually pulls in (DevRel's
environment per their diagnostic bundle).

Changes:

- ``samples/copilotkit/tests/browser/backend/requirements.txt`` -- pins
  updated to the latest set above.
- ``samples/copilotkit/tests/browser/backend/requirements.lock`` -- NEW,
  committed pip-freeze of the verified environment. ``pip install -r
  requirements.lock`` now gives byte-identical transitive deps.
- ``samples/copilotkit/README.md`` -- version matrix and install snippets
  updated to the latest set; upstream-bug callout now lists both issues
  (``ag-ui-protocol/ag-ui#1582`` runId overwrite,
  ``ag-ui-protocol/ag-ui#1584`` duplicate RUN_STARTED).
- ``samples/copilotkit/agents/evaluator_agent.py`` -- renamed the
  factory from ``_build_langgraph_agui_agent`` to the public
  ``build_agui_agent``; added a ``_version_guard_ag_ui_langgraph`` helper
  that emits a ``RuntimeWarning`` when the installed version is outside
  the tested range ``[0.0.22, 0.0.34]`` so silent behavior drift does
  not hide a regression. A backwards-compatible alias keeps the old
  private name importable for internal tests during the rename window.
- ``samples/copilotkit/tests/browser/backend/server.py`` and
  ``tests/test_samples_e2e.py`` -- call sites updated to the public
  name.

Verified end-to-end against the latest version matrix:

- pytest -k copilotkit: 9 passed, 2 skipped (live-only).
- Manual HTTP drive against a running backend with the reporter's
  exact flow (turn 1 initial -> interrupt, turn 2 re-entry on paused
  graph): both turns emit exactly one RUN_STARTED and one RUN_FINISHED,
  both with matching client runIds, no RUN_ERROR / INCOMPLETE_STREAM.

* fix(copilotkit): wire useLangGraphInterrupt on frontend; drop backend resume heuristic

DevRel confirmed the Apr-24 push resolved the turn-1 INCOMPLETE_STREAM
(backend now emits a clean RUN_STARTED -> STEP_* -> RUN_FINISHED for the
initial interrupt turn). Remaining gap: when the user replies to the
interrupt, plain ``<CopilotChat>`` sends the reply as an ordinary new
chat message, not as ``forwardedProps.command.resume`` -- so the graph
stayed paused and the same error returned on the follow-up.

Correct fix is on the frontend, not the backend:
``@copilotkit/react-core@1.56.3`` ships ``useLangGraphInterrupt``, the
hook specifically designed for this case. It renders a UI when the
graph pauses at ``interrupt()`` and calls ``resolve(...)`` with the
user's answer -- which the runtime forwards as the proper
``command.resume`` payload. This is the supported AG-UI protocol path:
the frontend must explicitly signal a resume rather than a new turn.

Changes:

- ``samples/copilotkit/tests/browser/frontend/app/page.tsx``: wires
  ``useLangGraphInterrupt`` with a dedicated prompt widget
  (``data-testid`` stable for automation), and a "Start evaluation"
  test-hook button that uses ``useCopilotChat().appendMessage`` to
  kick off the graph without having to type into CopilotChat's
  textarea (which Playwright can't reliably drive on 1.56.3 + React 19).
  The ``resolve([{role:"user", content}])`` shape matches what
  ``copilotkit_interrupt`` expects server-side
  (``answer = response[-1].content``).
- ``samples/copilotkit/tests/browser/frontend/app/globals.css``: styles
  for the interrupt widget and the test-hook start button.
- ``samples/copilotkit/agents/evaluator_agent.py``: reverts the
  backend auto-resume heuristic I had shipped as a stopgap. It was
  overloading the protocol semantics ("any user message during active
  interrupt == resume answer") which is incorrect for anything beyond
  a simple sample -- breaks cancel flows and multi-interrupt
  scenarios. The backend now only does the two genuine protocol-bug
  workarounds (runId overwrite, duplicate RUN_STARTED). Resume
  belongs to the frontend.

Test plan:

- Python test suite: ``pytest -k copilotkit`` -- 9 passed / 2 skipped
  (live) on DevRel's exact version matrix.
- Backend HTTP round-trip with a programmatic ``command.resume``
  payload: both turns emit matched ``RUN_STARTED``/``RUN_FINISHED``
  with client runId, no ``RUN_ERROR`` (verified on 2026-04-24).
- Browser end-to-end: the hook wiring in page.tsx matches CopilotKit's
  own showcase pattern and the hook source I inspected. I could not
  self-verify the full browser round-trip because (a) Playwright
  cannot reliably drive CopilotChat's textarea on 1.56.3 + React 19
  (tracked at CopilotKit/CopilotKit#4215), and (b) my attempted
  programmatic appendMessage test-hook did not trigger a runtime
  POST in my local venv for reasons I have not yet pinned down.
  **DevRel re-test in a real browser is the authoritative check for
  the frontend round-trip.**

Follow-up (per "#2" in the user's plan): rewrite the evaluator HITL
to use CopilotKit's current idiom (``useCopilotAction`` /
``useHumanInTheLoop`` -- frontend-defined tool + UI render + resolve)
instead of backend ``interrupt()``. That's the pattern CopilotKit's
active samples use; it avoids the ag-ui-langgraph interrupt path
bugs entirely and is where customers should be pointed for new work.

* feat(copilotkit): rearchitect evaluator as create_agent + frontend HITL tool

Replaces the custom StateGraph + ``langgraph.types.interrupt()`` pattern
with CopilotKit's current HITL idiom: ``langchain.agents.create_agent``
driving an LLM that calls backend tools, with the human-in-the-loop
step wired as a **frontend** tool via ``useCopilotAction`` +
``renderAndWaitForResponse``. This matches what CopilotKit's active
showcases (``hitl_in_chat_agent.py``, ``interrupt_agent.py``) use.

Why the rearchitect: the ``interrupt()`` code path in ``ag-ui-langgraph``
has two protocol-level bugs (tracked upstream as
``ag-ui-protocol/ag-ui#1582`` and ``#1584``) that the previous revision
worked around by subclassing ``LangGraphAGUIAgent`` and reaching into
private internals. That ships, but it's not the pattern CopilotKit
themselves exercise, and the workaround is fragile across upstream
bumps. Moving off the ``interrupt()`` path sidesteps both bugs by
construction and aligns with CopilotKit's active direction.

Design (three-role review):

- **AI engineer**: LLM drives. Backend tools (``list_judges``,
  ``list_recent_traces``, ``run_trace_evaluation``,
  ``get_evaluation_result``) are thin wrappers over the LayerLens SDK.
  A tight system prompt guides the flow. ``confirm_judge`` is a
  frontend tool declared via ``useCopilotAction``;
  ``CopilotKitMiddleware()`` bridges it into the agent's toolbelt so
  the LLM can "call" it like any other tool.
- **Designer**: HITL renders as a card list -- each judge shows name,
  id, and evaluation goal, with a ``Select <Name>`` button. Keyboard
  accessible, visible focus states, compact "Judge selected." state
  after the user chooses. ``data-testid`` attributes throughout for
  deterministic automation.
- **SDK engineer**: ~160 LoC for the evaluator (down from ~560). No
  private-API reach. No workaround subclass. No checkpointer needed
  (``create_agent`` owns state). Lockfile updated for
  ``langchain-openai``. Frontend pins unchanged. The old
  ``build_agui_agent`` factory, ``build_graph`` with a custom
  ``StateGraph``, ``EvaluatorState`` TypedDict, all node functions,
  the msgpack DTO allowlist, and the version-guard helpers are all
  gone -- replaced by one ``build_graph(model=...)`` that returns the
  compiled ``create_agent`` graph.

Tests:

- ``tests/test_samples_e2e.py`` rewritten. ``test_copilotkit_evaluator_
  interrupt_resume`` and ``test_copilotkit_evaluator_agui_wire`` (both
  specific to the old ``interrupt()`` architecture) replaced by
  ``test_copilotkit_evaluator_tools``, which exercises each backend
  tool against a patched Stratix client and verifies the system
  prompt references ``confirm_judge``.
- Import-smoke test mock list extended for ``langchain.agents`` /
  ``langchain.tools`` / ``langchain_core.tools`` / ``langchain_openai``.
- ``pytest -k copilotkit``: 8 passed, 2 skipped (live).

Frontend:

- ``page.tsx``: ``useCopilotAction("confirm_judge", ...)`` with a
  rich judge-card list; ``useLangGraphInterrupt`` removed.
- ``globals.css``: styles for ``judge-picker`` / ``judge-card`` /
  complete / empty states.
- ``Evaluate my traces`` quick-action button retained for direct
  user triggering and automation.

Backend server:

- ``samples/copilotkit/tests/browser/backend/server.py`` swaps
  ``build_agui_agent(...)`` for plain ``LangGraphAGUIAgent(...)`` --
  no workaround needed on this code path.

README:

- Full rewrite around the new architecture. Version matrix unchanged.
  The two upstream ``ag-ui-langgraph`` bugs are preserved in the
  "informational" section for customers building their own
  ``interrupt()``-based graphs.

Per user direction: no backwards compatibility for the old sample (no
customer has it). The workaround subclass is removed, not deprecated.

* test(copilotkit): add live-LLM e2e + OpenAI-compatible endpoint support

The previous commit's tests verified the new architecture against
mocks; this one verifies it against a real LLM through the actual AG-UI
FastAPI endpoint.

New test ``test_copilotkit_evaluator_live_llm``:

- Loads credentials from a gitignored ``.env`` (or real env vars in CI),
  with OpenRouter convenience: if only ``OPENROUTER_API_KEY`` is set,
  the loader auto-points ``OPENAI_BASE_URL`` at OpenRouter.
- Builds a FastAPI app with the patched Stratix client + the real
  evaluator graph (real LLM, no fake model).
- POSTs an AG-UI ``RunAgentInput`` whose ``tools`` array declares the
  ``confirm_judge`` frontend tool, exactly as the browser would.
- Asserts: tool sequence is ``list_recent_traces`` -> ``list_judges``
  -> ``confirm_judge``; agent halts at ``confirm_judge`` (never calls
  ``run_trace_evaluation``); single ``RUN_STARTED`` + ``RUN_FINISHED``
  with matching client ``runId``; no ``RUN_ERROR``.
- Marked ``@pytest.mark.live`` and ``pytest.skip``s when no key is
  available, so the default ``pytest`` run is unaffected.

Verified locally: passes against ``openrouter:openai/gpt-4o-mini``.

Other changes in this commit:

- ``evaluator_agent.py``:
  - ``_default_model()`` honours ``OPENAI_API_KEY``,
    ``OPENAI_BASE_URL``, and ``OPENAI_MODEL`` so any OpenAI-compatible
    endpoint works (OpenAI, Ollama, LM Studio, OpenRouter, vLLM, ...).
    For non-compatible providers, customers pass any LangChain
    ``BaseChatModel`` to ``build_graph(model=...)``.
  - ``create_agent`` now compiles with ``InMemorySaver``. ``ag-ui-
    langgraph``'s ``add_langgraph_fastapi_endpoint`` calls
    ``graph.aget_state(config)`` on every request, which fails with
    ``ValueError("No checkpointer set")`` if the graph wasn't compiled
    with one -- regardless of whether ``interrupt()`` is used.
  - ``build_agui_agent`` reintroduced as a *minimal* runId-only
    workaround for ``ag-ui-protocol/ag-ui#1582``. Bug #1584 (duplicate
    RUN_STARTED) is unreachable on this code path because the
    evaluator never calls ``langgraph.types.interrupt()``, so we only
    need the runId fix. Live test confirms the workaround restores
    runId continuity end-to-end.

- ``samples/copilotkit/tests/browser/backend/server.py``: switched back
  to ``build_agui_agent(...)`` so the runId workaround is active in
  the harness backend. The earlier "no workaround needed" claim was
  wrong; @ag-ui/client@0.0.52 doesn't enforce runId continuity but
  older clients did and future strict ones likely will.

- ``tests/.env.example``: documents the supported env vars (OPENAI,
  OpenRouter convenience, LayerLens). Real ``tests/.env`` is
  gitignored.

- ``samples/copilotkit/README.md``: documents the live-test setup and
  links the .env.example. Also documents the
  ``OPENAI_API_KEY``/``OPENAI_BASE_URL``/``OPENAI_MODEL`` env-var
  triplet for OpenAI-compatible providers (Ollama, LM Studio,
  OpenRouter).

* fix(copilotkit harness): unblock React hydration on Next 16 dev server

DevRel hit a "page renders but every button is dead, textarea won't
accept input" failure mode while running the harness locally. Diagnosis
took several iterations because there was no client-side error:

  - Backend was healthy; ``/healthz`` returned 200.
  - ``/api/copilotkit`` was up; an ``info`` JSON-RPC probe listed the
    evaluator agent.
  - Direct POSTs to the backend at :8123 streamed real LLM events.
  - The page HTML had every expected ``data-testid``.
  - Browser console showed only one repeating warning:
    ``WebSocket connection to 'ws://127.0.0.1:3000/_next/webpack-hmr'
    failed: Error during WebSocket handshake``

Root cause: Next 16 enforces a cross-origin allowlist for dev resources
(including the webpack-hmr WebSocket). When the user serves on
``127.0.0.1`` but the allowlist is implicit ``localhost``, HMR fails to
connect and Next leaves React in a half-hydrated state. The page
renders from the server but client React never wires up event handlers
or controlled-input state -- so buttons and textareas are visually
present but inert. No error is surfaced beyond the WebSocket warning.

Fix:
- Add ``allowedDevOrigins: ["127.0.0.1", "localhost"]`` to
  ``samples/copilotkit/tests/browser/frontend/next.config.js``. Both
  origins are the supported way to load the harness; without this,
  whichever the user picks tends to break.

Also, to make this kind of failure self-diagnosing rather than
requiring DevTools-paste skills:

- New ``samples/copilotkit/tests/browser/frontend/public/diag.html``
  -- a static page (no React) that runs three probes on load and
  renders results inline: runtime ``info`` reachability, an
  ``agent/run`` round-trip through ``/api/copilotkit``, and a direct
  ``/healthz`` ping against the backend. Visit
  ``http://127.0.0.1:3000/diag.html`` to see green/red labels for
  each. This bypasses the React app entirely, so it stays useful even
  when hydration is broken.
- New "Run diagnostic" button on the harness page (next to "Evaluate
  my traces") that runs the same probes plus a couple of React-only
  checks (textarea state, isLoading, intercepted ``appendMessage`` POST
  body) and renders the report directly on the page. Useful for users
  who can't (or don't want to) paste JS into DevTools console.

Verified locally: after the cache + allowedDevOrigins fix, both
buttons fire, ``appendMessage`` POSTs to ``/api/copilotkit`` and gets
back a real ``RUN_STARTED`` SSE stream end-to-end.

* fix(harness picker): resilient render while LLM streams tool args

CopilotKit's ``renderAndWaitForResponse`` re-renders the action UI
progressively as the LLM streams the tool-call JSON, so for the first
render tick or two ``judge.id`` (and sometimes ``judge.name``) can be
undefined even though the surrounding React state is stable. That
tripped two issues in our judge picker:

1. ``key={judge.id}`` warned "Each child in a list should have a
   unique key prop" when id was undefined.
2. The Select button was clickable with an undefined id, which would
   ``respond({ id: undefined, name: undefined })`` and break the
   resume.

Fix:
- Fall back to ``pending-{index}`` for the React key while id is
  pending. Quiet warning + stable row identity.
- Mark each row "ready" only when both id and name are present and
  ``respond`` is non-null. Disable the Select button and show
  "Loading..." until ready. The button text and ``data-testid``
  follow the ready state so automated tests don't grab a half-loaded
  row by accident.
- Hide the dim id-pill (``judge-card-id``) while id is pending so the
  card doesn't flash an empty grey box.

* feat(harness): per-tool render cards + Tailwind 4 + production EvaluationCard

DevRel asked: "where is the tool indicator I should see?" CopilotChat
only renders user/assistant text and frontend HITL widgets by default;
backend tool calls fire invisibly. Surface them with the
``useCopilotAction`` + ``available: "remote"`` + ``render`` pattern --
the same pattern CopilotKit's ``tool_rendering_agent.py`` showcase
uses.

Changes:

- All four backend tools (``list_recent_traces``, ``list_judges``,
  ``run_trace_evaluation``, ``get_evaluation_result``) now render
  inline cards with a pulsing-dot "Running" status pill, transitioning
  to a green "Done" pill when the tool resolves. Each card has a
  stable ``data-testid`` for automated tests.
- ``get_evaluation_result`` (the final result) renders the polished
  ``EvaluationCard`` from ``samples/copilotkit/components/`` -- the
  production-grade SDK card with the score donut and pass-rate ring.
  Imported via a tsconfig path alias
  (``@layerlens/copilotkit-cards``) so the harness can reuse the
  upstream SDK components without copying or duplicating them.
- ``confirm_judge`` HITL picker restyled with matching Tailwind tokens
  to keep the visual language consistent across all tool cards.
- Tailwind 4 added (``@tailwindcss/postcss``, ``tailwindcss``) +
  ``postcss.config.mjs`` + ``@import "tailwindcss"`` in ``globals.css``.
  Inline custom CSS removed in favour of Tailwind utilities, matching
  CopilotKit's own showcase samples.
- ``html className="dark"`` + ``color-scheme: dark`` so the SDK
  reference cards (which key off the ``.dark`` ancestor) render in
  dark mode by default.
- ``<CopilotKit showDevConsole={false}>`` -- DevRel reported the
  default web-inspector "kite" obscured the harness header; suppressed
  for the sample.
- ``tsconfig.json`` includes ``../../../components/**/*`` so Next's
  bundler picks up the SDK card sources, and adds the
  ``@layerlens/copilotkit-cards`` path alias.

The pattern (frontend ``useCopilotAction`` for backend tools with
``available: "remote"``) is what customers should copy. The harness
demonstrates it in two flavours: lightweight inline cards (for the
first three tools) and full SDK-component composition (for the
result). Both styles are valid; teams pick based on visual weight
they want.

* copilotkit sample: ship customer-facing app with shadcn-aligned cards

Reshaped the CopilotKit sample so it reads as a commercial-grade SDK
demo rather than a test fixture, and brought the visual language into
line with CopilotKit's own samples (research-canvas, travel, banking,
with-shadcn-ui).

Structure
- Move sample out of `samples/copilotkit/tests/browser/{backend,frontend}`
  to `samples/copilotkit/app/{backend,frontend}` so customers see "the
  app" rather than "a test harness". Update README + path constants.
- Add `app/frontend/.gitignore` for `.next/`, `node_modules/`, and
  Playwright artefacts.

Backend (`app/backend/server.py`, `agents/evaluator_agent.py`)
- Real LayerLens only: missing `LAYERLENS_STRATIX_API_KEY` is a hard
  startup error. No fake-fixture path, no `MagicMock`, no env-var
  flag — fixtures only ever existed for an earlier Playwright fixture
  and conflicted with the SDK posture in CLAUDE.md.
- Agent built with `create_agent` + `CopilotKitMiddleware`, real `@tool`
  impls returning `Command(update={...})` so each tool emits state into
  `state.{traces,judges,evaluations,results}`. Async tools call
  `copilotkit_emit_state` so the canvas updates live during a run.
- New `GET /evaluations/{id}` endpoint for out-of-band polling: the
  agent kicks off evaluations, ends in seconds, and the frontend folds
  completed verdicts into the canvas as each evaluation resolves on
  LayerLens. Fixes the 30s-evaluation-vs-LLM-polling-loop hallucination.
- `LangGraphAGUIAgent` constructor gets `config={"recursion_limit":
  200}` so a 5-trace fan-out doesn't trip the default 25-hop limit
  (tested via `with_config` first; that path is dropped by ag-ui's
  internal config merge).
- System prompt rewritten: strict tool order; `confirm_judge` takes no
  args (frontend reads candidates from `state.judges` to avoid the
  `tool_argument_parse_failed: Unterminated string in JSON` we hit
  when streaming 38 judges through tool args); evaluations capped at
  5 traces; pending != failed; final summary template branches on
  whether anything completed.

SDK card library (`samples/copilotkit/components/`)
- Rewritten on top of shadcn/ui primitives. Cards now compose `Card`,
  `CardHeader`, `CardContent`, `CardFooter`, `Badge`, `Button`,
  `Separator`, `Progress` from `@/components/ui/*`. Status pills use
  the `bg-{color}-50 text-{color}-600 dark:bg-{color}-900/20` pattern
  CopilotKit's banking sample uses, not custom ring/shadow chrome.
- Stock shadcn neutral OKLCH palette (`baseColor: neutral`). Brand
  accent `#6766FC` applied via Tailwind class strings on CTAs/links —
  same approach research-canvas takes for its accent. No edits to
  `--primary` / shadcn theme variables.
- Score bars solid (`bg-green-500` / `bg-red-500` / `bg-amber-500`)
  not gradients. Sparklines color-coded by pass-rate threshold.
- `dashboardBaseUrl` is now strictly opt-in across `TraceCard` and
  `EvaluationCard`: the "Trace Explorer →" / "Agent Graph →" / "View
  in Dashboard →" footers only render when a real URL is configured
  via `NEXT_PUBLIC_LAYERLENS_DASHBOARD_URL`. Stops 404s on routes
  that aren't deployed yet.

Frontend (`app/frontend/`)
- shadcn primitives installed via `npx shadcn@latest add card button
  badge progress separator`. Deps: `radix-ui`, `class-variance-
  authority`, `clsx`, `tailwind-merge`, `tw-animate-css`. Tailwind 4 +
  React 19. `components.json` aliases `ui` to the SDK card library.
- New `globals.css` with shadcn neutral tokens (`--background`,
  `--card`, `--muted-foreground`, etc.), `@theme inline` mapping for
  Tailwind 4, and a `--copilot-kit-*` bridge so `<CopilotChat>` reads
  the same neutral tokens as the canvas. Brand accent set on
  `--copilot-kit-secondary-color`. Drops the previous "force dark"
  CSS.
- Layout split-pane, **light by default** to match every official
  CopilotKit sample. New `theme-toggle.tsx` segmented control
  (Light / System / Dark) persists to `localStorage` and reacts to
  OS-level theme changes when set to System.
- `useCoAgent({ name: "evaluator" })` reads live agent state. New
  out-of-band poller (`useEffect` against `/evaluations/{id}` every
  5 s) folds verdicts that arrive after the agent run ends into the
  canvas. `state.results` (agent) and `polledResults` (frontend) are
  merged via `useMemo` so MetricStrip / EvaluationCard / JudgeVerdict-
  Card all see one consistent results array.
- Picker: `JudgePicker` is its own component subscribed to `useCoAgent`
  so it re-renders when `state.judges` populates after the LLM streams
  out the tool call. `confirm_judge` uses `available: "remote"` +
  `renderAndWaitForResponse` per the canonical research-canvas HITL
  pattern.

Cleanup
- Strip every dev artefact: agent's `[tool] X INVOKED` prints, the
  page's debug-state `<pre>`, the `console.log("[evaluator state]"…)`
  effect, the "Run diagnostic" button + panel + state, and the
  `probe_e2e.py` SSE diagnostic script. Header is now just the title,
  theme toggle, and the primary CTA.

* copilotkit sample: chat tokens, severity pills, trace status, markdown reasoning

Polish pass after first review:

- Chat token bridge fixed. Re-read CopilotKit's ``react-ui/colors.css``
  semantics: ``primary-color`` is the user-bubble + interactive accent,
  ``secondary-color`` is the assistant message background, not a brand
  slot. Earlier mapping made the assistant greeting render as solid
  indigo and clip out of view in light mode. Now mapped onto shadcn
  tokens semantically: ``primary → --primary``, ``contrast → --primary-
  foreground``, ``secondary → --card``, ``secondary-contrast →
  --card-foreground``. Brand accent ``#6766FC`` stays only on actual
  CTA buttons via Tailwind class strings.

- ``JudgePicker`` "selected" pill now uses light + dark variants
  (``bg-green-50 text-green-700 dark:bg-green-900/20 dark:text-green-300``)
  instead of dark-mode-only emerald that disappeared on a light page.

- ``JudgeVerdictCard`` redesign:
  * Pass / Fail / Error are now solid-filled badges (``bg-green-600``,
    ``bg-red-600``, ``bg-amber-600`` with white text), readable at a
    glance instead of subtle ghost pills.
  * Severity rendered as a colored pill with a triangle alert glyph,
    not a dot. Severity is a status (impact-of-failure level), not a
    trend, so an "alert" shape is correct; chevrons would imply
    direction. Hide the severity chip when verdict=pass AND
    severity=low — nothing meaningful to flag.
  * Reasoning rendered through a tiny inline ``MarkdownLite`` that
    handles paragraph breaks, line breaks, ``**bold**``, and
    ``*italic*`` — the cases LayerLens API actually emits. No
    ``react-markdown`` dep (the SDK card library lives outside the
    Next app's node_modules so it can't resolve packages there); no
    raw HTML injection. Fixes the wall-of-text rendering of judge
    reasoning.

- Tailwind 4 ``@source`` directive added to ``globals.css`` so it
  scans ``samples/copilotkit/components/**/*.{ts,tsx}``. Without this,
  classes used inside the SDK card library (``bg-amber-500``,
  ``bg-green-600``, etc.) get tree-shaken out of the generated CSS
  and pills silently flatten to plain text.

- ``TraceCardProps.status`` made optional. The LayerLens
  ``traces.get_many`` API doesn't expose per-trace lifecycle today, so
  the sample no longer hardcodes ``status="ok"`` — that was rendering
  a misleading green pill on every trace regardless of reality. The
  status pill is hidden when the prop is omitted; restore it once the
  API surfaces real status.

* copilotkit sample: render pending verdicts as skeleton cards

When the agent kicks off N evaluations and only K complete on the
first poll, the remaining (N - K) used to disappear from the
``Verdicts`` grid even though the run-summary card still counted
them — verdict count would say "5", grid would show 4, and the
trailing pending one looked like it had been lost.

Add ``PendingVerdictCard``: same shadcn ``Card`` chrome as
``JudgeVerdictCard``, with a "Running" pill, a pulsing skeleton bar
for the score, and copy explaining real LayerLens evaluations can take
a minute or two. Render one per evaluation that doesn't have a
matching entry in ``state.results`` yet.

Side effects:
- ``Verdicts`` section count now reflects total evaluations (not just
  completed) so the grid count matches what's actually rendered.
- Section now renders even when ``results.length === 0`` as long as
  there are evaluations in flight (previously fell through to a
  textual placeholder).
- Run summary picks the judge name from the first pending evaluation
  if no result has come back yet.

The polling loop is unchanged — it keeps polling
``/evaluations/{id}`` every 5 s and replaces a pending card with the
real ``JudgeVerdictCard`` the moment LayerLens returns a verdict.

* copilotkit sample: extract MarkdownLite, render judge goals

The judge ``evaluation_goal`` field LayerLens returns is markdown-
formatted (paragraph breaks, ``**bold**`` headers, numbered lists).
Both the in-chat picker and the canvas's "Available judges" card
were rendering it through plain ``<p>{text}</p>`` so each judge
collapsed into one indented wall of text — same problem the verdict
card's reasoning had before.

Pull the inline markdown renderer that previously lived inside
``JudgeVerdictCard.tsx`` into its own ``markdown-lite.tsx`` module,
re-export it from the SDK card library's ``index.ts``, and use it in:

  - JudgeVerdictCard reasoning   (already)
  - JudgePicker goal description (chat-side)
  - JudgesCard goal description  (canvas-side)

Output is the same as before for the verdict card; the picker and the
canvas judges card now show structured goal text. Still no
``react-markdown`` dependency — the SDK card library has to stay
resolvable without the Next.js app's node_modules in scope, so we
keep the small built-in renderer instead.

* copilotkit sample: rewrite README to match the shipped architecture

The README still described the previous incarnation of the sample —
the create_agent + frontend HITL design from before the canvas /
out-of-band-polling rewrite. Rewrite top-to-bottom to reflect what
actually ships:

- New layout section showing ``samples/copilotkit/{agents,components,app}``
  with the SDK card library and the customer-facing app side-by-side.
- Architecture diagram updated for the canvas + chat split-pane,
  ``useCoAgent`` driving state-driven cards, and the
  ``GET /evaluations/{id}`` polling endpoint that the frontend hits
  every 5s for in-flight verdicts.
- Step-by-step "How the demo flows" walkthrough so a customer can
  read the README and predict what each click will do.
- "Why this pattern" updated to highlight the canvas + frontend
  polling + ``copilotkit_emit_state`` triad. Old text framed the
  choice as ``create_agent`` vs ``interrupt()``; new text frames it
  as the research-canvas pattern.
- Tools section updated for the async + ``Command(update={...})``
  return shape and the no-arg ``confirm_judge`` (frontend reads
  candidates from ``state.judges``).
- Frontend section adds: shadcn/ui foundation, ``components.json``,
  light-default theme + ``ThemeToggle``, ``--copilot-kit-*`` token
  bridge, brand accent ``#6766FC``, the SDK card matrix
  (5 cards + ``MarkdownLite``).
- Backend section adds: ``recursion_limit: 200`` config, the
  ``GET /evaluations/{id}`` polling handler, and the "no fake
  fixture" guardrail.

Drive-by: ``ruff format`` brought ``evaluator_agent.py`` and
``server.py`` in line with the project's ruff style. (The repo's
``[tool.ruff]`` ``exclude = ["samples"]`` would skip these on
discovery, but reformatting locally keeps them tidy and avoids
contributors re-doing it.)

* ci: replace walrus + reformat test_samples_e2e for ruff py37 target

Fixes both red CI checks on PR #92:

- ``Check Lint`` was failing because tests/test_samples_e2e.py used
  the walrus operator (``:=``) at line 1446 and ruff's
  ``[tool.ruff].target-version`` is pinned to ``py37``. Replace with
  a regular assignment + boolean check — same semantics, py37
  compatible. The package's runtime support (``Python >=3.10,<3.13``)
  doesn't dictate ruff's syntax target; bumping the ruff target is
  out of scope for this PR.

- ``Check Format`` was failing because the same file had pre-existing
  multi-line wrapping that ruff's auto-format collapses to single
  lines under the 120-char limit. Apply ``ruff format``.

- ``ruff check --fix`` also normalised one import block (I001).

CI's ``test (3.9..3.12)`` jobs cancelled out after the lint pre-step
failed — they should now actually run.

* copilotkit sample: drop OpenRouter-specific references

Per existing repo policy: the SDK sample and tests should not name a
specific OpenAI-compatible provider. Configuring OpenRouter (or any
other gateway) is the user's job in their own .env — the docs and
test code stay vendor-neutral.

Removes:
- OpenRouter row from ``_default_model``'s docstring table.
- OpenRouter mention in ``build_graph``'s docstring.
- ``OpenRouter, vLLM`` aside in the CLI ``main()`` print block.
- OpenRouter URL in ``samples/copilotkit/README.md`` env-var
  example. Replaced with a placeholder ``your-openai-compatible-host``.
- ``OPENROUTER_API_KEY`` auto-mapping in ``test_copilotkit_evaluator
  _live_llm`` (the test now expects ``OPENAI_API_KEY`` and lets the
  user set ``OPENAI_BASE_URL`` / ``OPENAI_MODEL`` themselves if
  pointing at a non-OpenAI endpoint).
- Skip-message reference to ``OPENROUTER_API_KEY``.

The sample still works against any OpenAI-compatible endpoint — the
generic env vars (``OPENAI_API_KEY`` / ``OPENAI_BASE_URL`` /
``OPENAI_MODEL``) carry the configuration. The user's own gitignored
``.env`` is where provider-specific URLs (OpenRouter, Ollama, LM
Studio, …) live.

* ci: wrap server.py uvicorn boot in main() + extend test mocks

Three test failures from the previous CI run, all addressed here:

1. ``tests/test_samples.py::test_sample_has_main[copilotkit/app/backend
   /server.py]`` expects every sample's entry-point file to expose a
   ``main()`` function. ``server.py`` had a bare ``if __name__ ==
   "__main__":`` block instead. Lift the uvicorn.run call into a
   ``main()`` and call it from the ``if __name__`` guard.

2. ``test_copilotkit_agent_import[evaluator_agent]`` and
3. ``test_copilotkit_without_langchain[evaluator_agent]`` both stub
   the heavy deps via ``patch.dict("sys.modules", ...)`` so the agent
   module imports cleanly without langchain / copilotkit installed.
   The mock dict was missing the new submodules the agent now imports
   (``langgraph.prebuilt``, ``langchain.agents.middleware``,
   ``langchain_core.runnables``, ``langchain_core.tools.base``).
   Add them to both mock dicts.

Locally ``ruff check`` and ``ruff format --check`` are clean on all
touched files.

* copilotkit sample: stop polling on any terminal eval state, not just success

Bug repro: same evaluation reliably stayed "Running" across multiple
demo runs. Root cause was the polling filter on the frontend:

    const completed = updates.filter(
      (u) => u.status === "success" && typeof u.score === "number",
    );

This rejected any LayerLens response that wasn't a clean success with
a numeric score — including ``status: "failure"``, ``status: "error"``,
``status: "cancelled"``, and the ``status: "success"`` case where
``trace_evaluations.get_results`` returned ``score: null`` (which
some judges legitimately do). The poller would then keep firing every
5s forever and the verdict card would sit in "Running" indefinitely.

Two-sided fix:

Backend (``GET /evaluations/{id}``):
- New ``done: bool`` field — true for any of
  ``success | failure | error | cancelled | not_found``, false while
  the evaluation is still ``in_progress`` / ``pending`` / ``queued``.
- Always include ``passed`` / ``score`` / ``reasoning`` once
  ``done: true``, even for terminal failures and ``success``-without-
  score: defaults are ``passed: false``, ``score: 0.0``, and a
  ``reasoning`` string explaining the terminal state.
- ``try/except`` around ``trace_evaluations.get`` so a malformed /
  unauthorized id surfaces as ``status: "error", done: true`` instead
  of a 500 that the frontend retries forever.

Frontend (``page.tsx``):
- Polling filter is now ``u.done === true`` instead of
  ``status === "success" && typeof score === "number"``.
- ``ResultRecord`` type gains an optional ``done?: boolean`` field
  (the agent's own ``state.results`` entries don't carry it; only the
  ``/evaluations/{id}`` polling responses do).

Verified against a real eval id (clean success path → ``done: true``,
score returned) and a deadbeef id (error path → ``done: true``,
``status: "error"``, no 500). The 5th-eval-stuck symptom is from the
non-success terminal cases — frontend now folds them into the canvas
as a verdict card with the appropriate fail/error styling instead of
spinning forever.

* Fix .gitignore

---------

Co-authored-by: m-peko <marinpeko5@gmail.com>
@m-peko m-peko merged commit 9d5089b into release Apr 27, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants