Qitos cybergym by ravenSanstete · Pull Request #23 · WhitzardAgent/qitos

ravenSanstete · 2026-06-26T11:47:03Z

Summary

Describe the change in 2-5 sentences. Focus on user-facing behavior and why this change matters.

What Changed

Why

What problem does this solve? What workflow, bug, or maintenance gap does it address?

Validation

pytest -q
relevant lint / type checks
examples or manual verification, if applicable

Commands run:

# paste commands here

QitOS Design Impact

If relevant, explain how this PR affects:

state design
prompt / parser / model protocol
tool surface / toolsets
memory / history / compaction
tracing / qita / observability

Docs and Changelog

docs updated
tutorials updated
CHANGELOG.md updated
not needed

Checklist

I scoped this PR to one coherent change.
I did not introduce unrelated cleanup.
I added or updated tests where behavior changed.
I preserved backward compatibility, or documented the break clearly.
I checked whether examples, docs, or qita flows need updates.

Add CyberGym benchmark integration

- Remove AI-sounding phrasing and unnecessary quotes across all docs (EN + ZH) - Add term explanations at first occurrence (kernel, protocol, transport, parser, harness, preset, critic, artifact, trace, etc.) - Make Chinese docs more natural instead of machine-translated - Add multi-agent handoff design plan to plans/ Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Replace English terms with "中文（英文术语）" format at first occurrence, then Chinese only - Eliminate translationese and English-Chinese mixed sentences - Unify terminology: parser→解析器, protocol→协议, harness→适配层, preset→预设, agent→智能体, runtime→运行时, prompt→提示词, benchmark→基准测试, trace→追踪记录, etc. - Keep code blocks, commands, and API/class names in English Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Multi-agent core: - Add DelegateTool (1:1 delegation+return), FanOutTool (1:N parallel+reduce), Decision.handoff() for linear agent transfer - Add AgentSpec, AgentRegistry, ContextStrategy, HandoffContext, StateAdapter - Add SharedMemory for cross-agent state sharing - Add _HandoffRuntime with loop detection and max-handoff guard - Add agent_id to StepRecord, trace serialization, and manifest metadata - Add kit patterns: manager_worker, planner_executor, proposer_verifier - Add examples/patterns/ for delegate, fanout, handoff AsyncEngine + Streaming (Phase 1): - Add AsyncEngine with arun() and arun_stream() methods - Add EngineEvent, EngineEventType, EventStream for structured streaming - Add AsyncOpenAICompatibleModel and AsyncOpenAIModel with async API calls - Add SSE endpoint /api/stream/{run_id} to qita - Add live stream button to qita run detail page TUI + qita multi-agent adaptation: - ClaudeStyleHook renders HANDOFF/DELEGATE/FANOUT events with agent labels - qita board shows multi-agent badges, agent filters, agent switch indicators - qita run view: step agent badges, handoff/delegate/fanout event rendering - qita replay: multi-agent kind detection and specialized card rendering Docs (EN + ZH): - Update engine concept docs with AsyncEngine section - Update API reference with AsyncEngine, EngineEvent, EventStream - Update observability guide with SSE streaming section - Add multi-agent patterns guide and tutorial - Update CHANGELOG.md Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

merge and add more applications

- Remove examples/real/claude_code/ (now in qitos-zoo/qitos_coder/) - Remove qitos/examples/pentagi/ (now in qitos-zoo/qitos_cyber/) - Remove qitos/benchmark/pentagi_e2e/ (now in qitos-zoo/qitos_cyber/) - Remove sandbox/ and root-level debug/test scripts - Remove agent-specific e2e tests from core tests/ - Add qitos_zoo as git submodule (github.com/Qitor/qitos-zoo) - Update core test imports to use qitos_zoo.qitos_coder - Update tests/conftest.py to add qitos_zoo to sys.path - Update CORE_BOUNDARY.md with e2e ownership and zero-duplication rules - Fix engine: native_text_final decision and set_stop on final mode Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…esponses When native tool calling is preferred and the model returns plain text without tool_calls, the json_decision_v1 parser would return wait(), causing agents to loop forever without producing a final result. Now, when the model returns plain text under native tool calling: 1. Try the parser chain first (honour act/final if it works) 2. Override parser wait() with native_text_final Decision.final() This fixes 6/12 previously failing e2e tests — all 12 now pass. Also updated test assertions: stop_reason "final" is now correctly returned alongside "success" for agents that produce final answers. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…Kimi/GLM Key changes enabling real LLM tool calling in DeepAudit pipeline: - openai.py: Add default_request_kwargs to OpenAIModel/OpenAICompatibleModel; relocate chat_template_kwargs into extra_body for vLLM compatibility (passing it as top-level kwarg causes TypeError, silently dropping tools) - _presets.py: Set native_tool_call_preferred=True for deepseek, kimi, glm; change glm primary_delivery to api_parameter with json_decision_v1 protocol - _model_runtime.py: Add logging for tool schema delivery failures instead of silent exception swallowing; add warning on TypeError fallback - harness/_adapters.py, harness/__init__.py: Forward default_request_kwargs through the build_model_for_preset chain Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Chore/core governance zoo split

…cratch file Remove 9 files from examples/real/ that are duplicated in qitos-zoo/experimental/, violating CORE_BOUNDARY.md "zero duplication" rule. Also delete live_test_parse.py scratch file with hardcoded path. Duplicated files removed: - _whitzard_memory.py, computer_use_agent.py, epub_reader_agent.py - openai_cua_agent.py, skillhub_github_agent.py, swe_agent.py - terminus_2.py, visual_inspect_agent.py, whitzard_agent.py Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Add ToolCallLoopDetector to qitos/engine/__init__.py exports - Update migration plan import to use public API Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

… and more Major additions: - qitos/func/: Functional agent composition (agent, compose, infer, task) - qitos/mcp/: MCP integration (bridge, filter, http, schema_convert, server, stdio) - qitos/tracing/: Tracing system (config, console, json_processor, models, provider) - qitos/checkpoint/: Enhanced checkpointing (durability, fork, sqlite_store, versioning) - qitos/engine/stream/: Streaming lifecycle (lifecycle, messages, transformer) - qitos/kit/interceptor/: Interceptors (cache, logging, retry) - qitos/core/: New modules (channel, field_reducers, interceptor, tool_schema) - qitos/engine/: New modules (approval, cancellation, interrupt, run_state) - Export ToolCallLoopDetector as public API from qitos.engine - New CI workflows: contribution-test.yml, zoo-test.yml - ~870 additional tests Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The 'zoo' extra declared 'qitos' as a dependency, which is a circular self-reference. Removed 'zoo' extra and removed 'qitos' from 'all' extra. The correct dependency direction is qitos-zoo depends on qitos, not reverse. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

qitos setup.py's find_packages() was including the qitos_zoo/ git submodule directory, which conflicts with the standalone qitos-zoo package that uses the same qitos_zoo namespace. This caused import resolution to pick up the submodule copy instead of the standalone repo, breaking qitos-zoo imports. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Add RetryPolicy to qitos.core.tool with exponential backoff - Enhance qita CLI with step-by-step REPL and handoff commands - Extend action_executor with metric tracking and recovery hooks - Add 8 new e2e test files covering critic, delegation, hooks, tools - Update function_tool_decorator with improved error handling Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

@critic

…expansion Clean up dev-process artifacts and add missing tutorials to make the repo professional for external users and contributors. Root-level cleanup: - Merge agent.md + agent_new.md into single AGENTS.md - Delete DESIGN.md (off-topic Linear design system) - Delete CORE_BOUNDARY.md (merged into ARCHITECTURE.md) - Delete security_report.md (auto-generated scan artifact) - Move DEVELOPMENT.md content to docs/contributing/development.mdx - Move plans/ to docs/internal/plans/ Code cleanup: - Add DeprecationWarning to qitos/debug/ and qitos/cache/ - Delete empty qitos/examples/ package - Add per-adapter DeprecationWarning for 6 benchmark subpackages New tutorials (EN + ZH): - critic-system: CriticResult, @critic decorator, instruction/state patches - hook-lifecycle: EngineHook callbacks, HookContext, custom hooks - func-api: @function_tool decorator, type hint inference, ToolRegistry - mcp-integration: MCPServerStdio/HTTP, ToolFilter, async lifecycle - checkpoint-and-fork: CheckpointStore, fork_checkpoint, DurabilityMode Docs updates: - docs.json: add 5 tutorials + contributing page (EN + ZH) - tutorials/index.mdx: add v0.4 framework deep-dives section - concepts/glossary.mdx: add 15 new terms - introduction.mdx: add Critics/hooks, Checkpoint, MCP capabilities Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…very - Add FamilyPreset.override() for creating customized preset copies - Add advisory fields (recommended_max_steps, max_tokens, retry_budget, temperature) to FamilyPreset with values for 5 gold presets - Implement MaxTokensCriteria stop criterion (BUDGET_TOKENS) - Pass total_tokens in runtime_info for stop criteria evaluation - Consolidate context_registry to derive from FamilyPreset (remove duplicate patterns, keep explicit table for sub-variant precision) - Add `qit bench list` and `qit bench presets` CLI commands - Update model-family-matrix to all 10 presets with context window and recommended models columns - Rewrite switch-model-families tutorial with override and CLI examples - Update family-presets, add-a-family-preset, critics-and-stop-criteria docs (EN+ZH) with new features - Add glossary entries for preset override, MaxTokensCriteria, advisory defaults Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add WandbTraceProcessor (TraceProcessor for W&B integration), recipes README and eval_config.yaml files for each benchmark, updated benchmarks-and-recipes guide with eval_config usage and W&B note, and W&B integration docs (EN + ZH). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add CriticTrace/HandoffTrace on EngineResult, EngineConfig export, ToolPermissionSpec export, MlflowTraceProcessor, qita cost/perf metrics in run viewer, and MLflow integration docs (EN + ZH). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Add SelfRefineAgent + SelfRefineCritic (generate→critique→refine loop) - Add ReflexionAgent + ReflexionCritic (act→reflect→retry with memory) - Add `qit new` subcommand with cookiecutter-based scaffolding - Add `qit list-templates` to list scaffold and method templates - Add cookiecutter extra to setup.py - Add method-templates guide (EN + ZH) with quickstart and customization - Update glossary (EN + ZH) with method template terms - Update docs.json navigation for both languages Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Add LATSAgent + LATSCritic (MCTS with UCB1 scoring and reflection) - Add MoAOrchestrator + MoACritic (parallel proposals → aggregation) - Add MagenticOneOrchestrator + ProgressCritic (dual-ledger with stall detection) - Add template assets (paper.md, config.yaml, agent.py) for all 3 methods - Add eval_config.yaml for all 3 methods - Update CLI _METHOD_TEMPLATES with lats, moa, magentic_one - Update method-templates guide (EN + ZH) with all 3 method sections - Update glossary (EN + ZH) with LATS, MoA, Magentic-One terms - 60 new tests (23 LATS + 17 MoA + 20 Magentic-One) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Engine: - Fix _action_runtime.py early-exit bug: blocked actions no longer kill sibling actions; collect-all-then-merge pattern - Add submit_poc to _CONCURRENCY_SAFE_TOOLS for parallel submission Harness: - Add json_decision_multi_v1 protocol supporting native tool_calls - GLM preset uses json_decision_multi_v1 with parallel_tool_calls=True - Add recommended_request_kwargs to harness types - Auto-merge preset kwargs in harness __init__ Render: - Fix ClaudeStyleHook multi-action rendering with per-index dedup - Add parallel action/observation visual indicators - Extract _render_single_observation for multi-observation support - Content renderer: action_summary/observation_summary support multi-action/multi-observation events

+    def test_delete(self):
+        mem = InMemorySharedMemory()
+        mem.write("key1", "value1")
+        assert mem.delete("key1") is True


+        mem.write("key1", "value1")
+        assert mem.delete("key1") is True
+        assert mem.read("key1") is None
+        assert mem.delete("key1") is False


+    def test_delete(self, tmp_path):
+        mem = FileSharedMemory(tmp_path / "shared.json")
+        mem.write("key1", "value1")
+        assert mem.delete("key1") is True


+        mem = InMemorySharedMemory()
+        ns = SharedMemoryNamespace(mem, "agent_a")
+        ns.write("key1", "value1")
+        assert ns.delete("key1") is True


+    def test_delete_nonexistent(self):
+        mem = InMemorySharedMemory()
+        ns = SharedMemoryNamespace(mem, "agent_a")
+        assert ns.delete("nope") is False


+from qitos.core.state import StateSchema
+
+
+@dataclass


+
+def create_snowl_agent(**kwargs: Any) -> Any:
+    """Create the {{cookiecutter.agent_name}} agent for Snowl evaluation."""
+    from .src.agent import {{cookiecutter.agent_name | pascalcase}}Agent


+
+from __future__ import annotations
+
+from {{cookiecutter.agent_name}}.src.agent import (


+from .backends import CacheBackend
+
+
+class CachedModel(Model):


+
+    # 创建 Agent 并运行
+    agent = DeepSeekCoderAgent()
+    engine = Engine(agent=agent, model=model, history=history)


Previously, reasoning_content (from DeepSeek/QwQ/etc.) was only used as a fallback when content was empty. Now it is: 1. Extracted into ModelResponse.reasoning_content (new field) 2. Included in model_output event payload 3. Displayed by ContentFirstRenderer as the agent's "thought" text 4. Propagated through _hooks_impl to the TUI Also: ToolResult.to_dict() promotes string output as "content" key so ContentFirstRenderer can find it.

Previously _extract_response_text() returned "" when tool_calls were present, discarding any content the model produced alongside its actions. Now content is always extracted regardless of tool_calls. ContentFirstRenderer.thought_text() shows both reasoning_content and raw_output (content text) when present, separated by "---". This means the operator sees the model's complete reasoning: API-level chain-of-thought + any non-tool-call text the model generated.

The agent's task instruction was never shown in the TUI, making it impossible to understand what the agent was asked to do without checking external logs. Now the task text is rendered as a cyan "TASK" line right after the RUN banner.

Previously build_tool_spec() set every parameter description to "". The :param annotations in docstrings were only visible as part of the top-level tool description, forcing the model to parse unstructured text to understand parameters. Now: - _parse_param_descriptions() extracts :param name: desc from docstrings - Each parameter gets its own description in the schema - _strip_param_docs() removes :param lines from the top-level description to avoid duplication This means the model sees structured per-parameter descriptions in the tools API parameter, not a wall of text it has to parse itself.

- _state_stats extracts chain nodes/gates, vulnerability analysis, current hypothesis, and path trace from agent state - _render_chain_summary produces compact multi-line display with ✓/?/✗ badges for confirmed/inferred/refuted gates - _hooks_impl renders chain summary with color-coded lines: cyan for chain overview and nodes, red for refuted gates, yellow for open gates, magenta for hypothesis, blue for analysis

- _model_runtime.py: _state_stats() extracts constraint_board and task_memory from state.metadata (primary) or builds from state fields (fallback). Removes old _render_chain_summary(). - _hooks_impl.py: renders Constraint Board and Task Memory sections with semantic color-coding. Uses exact same text the LLM sees in observation packet, ensuring TUI-LLM consistency.

_state_stats() was trying to extract chain/gate data from the observation object, but the observation doesn't carry the live CyberGymState — it only has serialised data. Now the live state object is passed from run_decide() → _run_llm_decide() → _state_stats(), so constraint_board and task_memory are properly extracted for TUI.

…t loop Sync agent source (commit f63ccad) including: - Vul-only trigger is PARTIAL success — agent keeps refining for precision - stop_criteria: only stop on is_verified(), not vul_crashed() alone - New vul_only_triggered gate + vul_crashed_partial verdict - ASAN trace fallback from raw_output when vul_stderr is empty - Patch-diff-guided refinement feedback - Candidate escape hatches, targeted PoC registration, submit validation - Removed .gitignore exclusion for cybergym/agent directory

…ion (v08)

…raw_output GLM-5.1 puts identical content in both reasoning_content and raw_output, causing the TUI to display the same thought twice separated by "---". Now skips raw_output when it duplicates reasoning_content.

Add log_file parameter to ClaudeStyleHook that creates a _TeeConsole proxy writing to both terminal and a plain-text log file. Each task in the batch runner now saves tui.log in its trace directory, preserving the STEP/finish/tool_calls/ctx_used format for offline review.

Brings the vendored agent up to date: constraint collection overhaul (3081def) plus the two fixes — submit_poc results keyed by (agent_id, poc) to stop cross-task and parallel-submit verdict leakage, and the missing CyberGymState.pending_reminders field (was causing unrecoverable_error). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… handling runner.py: CYBERGYM_USE_DOCKER_ENV gate (default off) runs each task's agent tools inside an ephemeral container via a same-path bind mount, so the agent cannot wander outside its task workspace while the host-side process (LLM/submit_poc) keeps working. docker_env.py: _inner_path now uses absolute paths verbatim instead of double-nesting them under the workdir. Verified: 4 concurrent containers, per-task grading correct, no leakage. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…e-feedback fix) Vendored sync of the fix-side leak fix: under CYBERGYM_VUL_ONLY_FEEDBACK (default on) the agent sees vul-side feedback only and stops on the first vul-side crash; the server still grades fix-side into its DB for scoring but the verdict is no longer read back to the agent. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Without this file, Python treats this directory as a namespace package when the parent (traj_analyzer/) is on sys.path, shadowing the pip-installed qitos package and causing ImportError for AgentModule.

…e shadowing" This reverts commit 64207ef.

Write exceptions to CYBERGYM_TASK_TRACE_DIR/step_error.log at every exception handler in engine.py (DECIDE, ACT, init_state, setup_toolsets, setup_env) and add sys.excepthook in run_local.py so no exception can be silently lost regardless of stderr redirection.

…oard _model_runtime._state_stats() now extracts sink_candidates and objective from state.metadata. _hooks_impl.py renders them with semantic coloring: - Sink Candidates: bright_magenta for high conf, yellow for nudge - Objective: green inline text

…dering - Extract _tui_phase from metadata in _state_stats for reliable Phase Badge - Add task_context and allowed_tools to _state_stats extraction - Render Task Context section (vulnerability, bug type, strategy, input format) - Render Allowed Tools section with checkpoint-blocked highlighting - Color-code CHECKPOINT BLOCKED text in bold red

- Render Suggested Sinks section in TUI (bright_blue styling) - Add [AUTO] tag styling in Sink Candidates rendering - Extract _tui_suggested_sinks from metadata in _state_stats

Raise body truncation limits from 220/2000 to 50000 for all observation types (tool results, error content, syntax highlighting, terminal output). When truncation does occur (extremely rare at 50K), show total char count. This ensures researchers can see full tool call results in TUI logs.

ravenSanstete and others added 30 commits April 16, 2026 08:36

Merge pull request #12 from bmz-q-q/feat/cybergym-qitos-integration

b4bacaa

Add CyberGym benchmark integration

Restore non-agent cybergym changes

0d0055a

Ignore local CyberGym agent sync

399709b

Add a GLM family preset

50e5fcf

feat: harden cybergym runtime transport and layout

158e9c9

chore: checkpoint cybergym qitos updates

2cca379

Bump version to 0.4.0

7ce8278

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

merge and add more applications

b250b63

Merge pull request #15 from Qitor/chore/core-governance-zoo-split

bd7ba2b

merge and add more applications

chore: clean qitos-zoo staging paths

7878f65

chore: update qitos_zoo submodule pointer

95b6edb

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Merge pull request #16 from Qitor/chore/core-governance-zoo-split

9bc51ae

Chore/core governance zoo split

Export ToolCallLoopDetector as public API from qitos.engine

4c10c07

- Add ToolCallLoopDetector to qitos/engine/__init__.py exports - Update migration plan import to use public API Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

bmz-q-q and others added 3 commits June 20, 2026 12:09

docs: add CyberGym lightweight upgrade design spec

9c383b0

feat: cybergym optimization v1 - first stable version

e82ef2f

github-code-quality Bot found potential problems Jun 26, 2026

View reviewed changes

ravenSanstete and others added 26 commits June 27, 2026 14:20

feat(tui): display task instruction at run start

707979f

The agent's task instruction was never shown in the TUI, making it impossible to understand what the agent was asked to do without checking external logs. Now the task text is rendered as a cyan "TASK" line right after the RUN banner.

feat(cybergym): sync latest agent with constraint discovery accelerat…

da0aa73

…ion (v08)

feat: qita CLI updates, workspace tool improvements, docs refresh

85a8099

fix: add __init__.py to repo root to prevent namespace package shadowing

64207ef

Without this file, Python treats this directory as a namespace package when the parent (traj_analyzer/) is on sys.path, shadowing the pip-installed qitos package and causing ImportError for AgentModule.

Revert "fix: add __init__.py to repo root to prevent namespace packag…

d4c816e

…e shadowing" This reverts commit 64207ef.

sync: cybergym agent V11 — active sink discovery optimizations

0e5df2b

diag: add traceback.print_exc() to engine exception handlers

147468f

fix: surface recovered runtime exceptions

39f423f

feat: TUI rendering for suggested sinks + auto-discovery tags

8400678

- Render Suggested Sinks section in TUI (bright_blue styling) - Add [AUTO] tag styling in Sink Candidates rendering - Extract _tui_suggested_sinks from metadata in _state_stats

sync: agent v19 with no Point access + vuln patterns + auto-discovery

8204310

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Qitos cybergym#23

Qitos cybergym#23
ravenSanstete wants to merge 69 commits into
mainfrom
qitos_cybergym

ravenSanstete commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		from __future__ import annotations

		from {{cookiecutter.agent_name}}.src.agent import (

Uh oh!

Conversation

ravenSanstete commented Jun 26, 2026

Summary

What Changed

Why

Validation

QitOS Design Impact

Docs and Changelog

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants