Qitos cybergym#23
Open
ravenSanstete wants to merge 69 commits into
Open
Conversation
Add CyberGym benchmark integration
- Remove AI-sounding phrasing and unnecessary quotes across all docs (EN + ZH) - Add term explanations at first occurrence (kernel, protocol, transport, parser, harness, preset, critic, artifact, trace, etc.) - Make Chinese docs more natural instead of machine-translated - Add multi-agent handoff design plan to plans/ Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Replace English terms with "中文(英文术语)" format at first occurrence, then Chinese only - Eliminate translationese and English-Chinese mixed sentences - Unify terminology: parser→解析器, protocol→协议, harness→适配层, preset→预设, agent→智能体, runtime→运行时, prompt→提示词, benchmark→基准测试, trace→追踪记录, etc. - Keep code blocks, commands, and API/class names in English Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Multi-agent core:
- Add DelegateTool (1:1 delegation+return), FanOutTool (1:N parallel+reduce),
Decision.handoff() for linear agent transfer
- Add AgentSpec, AgentRegistry, ContextStrategy, HandoffContext, StateAdapter
- Add SharedMemory for cross-agent state sharing
- Add _HandoffRuntime with loop detection and max-handoff guard
- Add agent_id to StepRecord, trace serialization, and manifest metadata
- Add kit patterns: manager_worker, planner_executor, proposer_verifier
- Add examples/patterns/ for delegate, fanout, handoff
AsyncEngine + Streaming (Phase 1):
- Add AsyncEngine with arun() and arun_stream() methods
- Add EngineEvent, EngineEventType, EventStream for structured streaming
- Add AsyncOpenAICompatibleModel and AsyncOpenAIModel with async API calls
- Add SSE endpoint /api/stream/{run_id} to qita
- Add live stream button to qita run detail page
TUI + qita multi-agent adaptation:
- ClaudeStyleHook renders HANDOFF/DELEGATE/FANOUT events with agent labels
- qita board shows multi-agent badges, agent filters, agent switch indicators
- qita run view: step agent badges, handoff/delegate/fanout event rendering
- qita replay: multi-agent kind detection and specialized card rendering
Docs (EN + ZH):
- Update engine concept docs with AsyncEngine section
- Update API reference with AsyncEngine, EngineEvent, EventStream
- Update observability guide with SSE streaming section
- Add multi-agent patterns guide and tutorial
- Update CHANGELOG.md
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
merge and add more applications
- Remove examples/real/claude_code/ (now in qitos-zoo/qitos_coder/) - Remove qitos/examples/pentagi/ (now in qitos-zoo/qitos_cyber/) - Remove qitos/benchmark/pentagi_e2e/ (now in qitos-zoo/qitos_cyber/) - Remove sandbox/ and root-level debug/test scripts - Remove agent-specific e2e tests from core tests/ - Add qitos_zoo as git submodule (github.com/Qitor/qitos-zoo) - Update core test imports to use qitos_zoo.qitos_coder - Update tests/conftest.py to add qitos_zoo to sys.path - Update CORE_BOUNDARY.md with e2e ownership and zero-duplication rules - Fix engine: native_text_final decision and set_stop on final mode Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…esponses When native tool calling is preferred and the model returns plain text without tool_calls, the json_decision_v1 parser would return wait(), causing agents to loop forever without producing a final result. Now, when the model returns plain text under native tool calling: 1. Try the parser chain first (honour act/final if it works) 2. Override parser wait() with native_text_final Decision.final() This fixes 6/12 previously failing e2e tests — all 12 now pass. Also updated test assertions: stop_reason "final" is now correctly returned alongside "success" for agents that produce final answers. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…Kimi/GLM Key changes enabling real LLM tool calling in DeepAudit pipeline: - openai.py: Add default_request_kwargs to OpenAIModel/OpenAICompatibleModel; relocate chat_template_kwargs into extra_body for vLLM compatibility (passing it as top-level kwarg causes TypeError, silently dropping tools) - _presets.py: Set native_tool_call_preferred=True for deepseek, kimi, glm; change glm primary_delivery to api_parameter with json_decision_v1 protocol - _model_runtime.py: Add logging for tool schema delivery failures instead of silent exception swallowing; add warning on TypeError fallback - harness/_adapters.py, harness/__init__.py: Forward default_request_kwargs through the build_model_for_preset chain Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Chore/core governance zoo split
…cratch file Remove 9 files from examples/real/ that are duplicated in qitos-zoo/experimental/, violating CORE_BOUNDARY.md "zero duplication" rule. Also delete live_test_parse.py scratch file with hardcoded path. Duplicated files removed: - _whitzard_memory.py, computer_use_agent.py, epub_reader_agent.py - openai_cua_agent.py, skillhub_github_agent.py, swe_agent.py - terminus_2.py, visual_inspect_agent.py, whitzard_agent.py Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add ToolCallLoopDetector to qitos/engine/__init__.py exports - Update migration plan import to use public API Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… and more Major additions: - qitos/func/: Functional agent composition (agent, compose, infer, task) - qitos/mcp/: MCP integration (bridge, filter, http, schema_convert, server, stdio) - qitos/tracing/: Tracing system (config, console, json_processor, models, provider) - qitos/checkpoint/: Enhanced checkpointing (durability, fork, sqlite_store, versioning) - qitos/engine/stream/: Streaming lifecycle (lifecycle, messages, transformer) - qitos/kit/interceptor/: Interceptors (cache, logging, retry) - qitos/core/: New modules (channel, field_reducers, interceptor, tool_schema) - qitos/engine/: New modules (approval, cancellation, interrupt, run_state) - Export ToolCallLoopDetector as public API from qitos.engine - New CI workflows: contribution-test.yml, zoo-test.yml - ~870 additional tests Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The 'zoo' extra declared 'qitos' as a dependency, which is a circular self-reference. Removed 'zoo' extra and removed 'qitos' from 'all' extra. The correct dependency direction is qitos-zoo depends on qitos, not reverse. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
qitos setup.py's find_packages() was including the qitos_zoo/ git submodule directory, which conflicts with the standalone qitos-zoo package that uses the same qitos_zoo namespace. This caused import resolution to pick up the submodule copy instead of the standalone repo, breaking qitos-zoo imports. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add RetryPolicy to qitos.core.tool with exponential backoff - Enhance qita CLI with step-by-step REPL and handoff commands - Extend action_executor with metric tracking and recovery hooks - Add 8 new e2e test files covering critic, delegation, hooks, tools - Update function_tool_decorator with improved error handling Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…expansion Clean up dev-process artifacts and add missing tutorials to make the repo professional for external users and contributors. Root-level cleanup: - Merge agent.md + agent_new.md into single AGENTS.md - Delete DESIGN.md (off-topic Linear design system) - Delete CORE_BOUNDARY.md (merged into ARCHITECTURE.md) - Delete security_report.md (auto-generated scan artifact) - Move DEVELOPMENT.md content to docs/contributing/development.mdx - Move plans/ to docs/internal/plans/ Code cleanup: - Add DeprecationWarning to qitos/debug/ and qitos/cache/ - Delete empty qitos/examples/ package - Add per-adapter DeprecationWarning for 6 benchmark subpackages New tutorials (EN + ZH): - critic-system: CriticResult, @critic decorator, instruction/state patches - hook-lifecycle: EngineHook callbacks, HookContext, custom hooks - func-api: @function_tool decorator, type hint inference, ToolRegistry - mcp-integration: MCPServerStdio/HTTP, ToolFilter, async lifecycle - checkpoint-and-fork: CheckpointStore, fork_checkpoint, DurabilityMode Docs updates: - docs.json: add 5 tutorials + contributing page (EN + ZH) - tutorials/index.mdx: add v0.4 framework deep-dives section - concepts/glossary.mdx: add 15 new terms - introduction.mdx: add Critics/hooks, Checkpoint, MCP capabilities Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…very - Add FamilyPreset.override() for creating customized preset copies - Add advisory fields (recommended_max_steps, max_tokens, retry_budget, temperature) to FamilyPreset with values for 5 gold presets - Implement MaxTokensCriteria stop criterion (BUDGET_TOKENS) - Pass total_tokens in runtime_info for stop criteria evaluation - Consolidate context_registry to derive from FamilyPreset (remove duplicate patterns, keep explicit table for sub-variant precision) - Add `qit bench list` and `qit bench presets` CLI commands - Update model-family-matrix to all 10 presets with context window and recommended models columns - Rewrite switch-model-families tutorial with override and CLI examples - Update family-presets, add-a-family-preset, critics-and-stop-criteria docs (EN+ZH) with new features - Add glossary entries for preset override, MaxTokensCriteria, advisory defaults Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add WandbTraceProcessor (TraceProcessor for W&B integration), recipes README and eval_config.yaml files for each benchmark, updated benchmarks-and-recipes guide with eval_config usage and W&B note, and W&B integration docs (EN + ZH). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add CriticTrace/HandoffTrace on EngineResult, EngineConfig export, ToolPermissionSpec export, MlflowTraceProcessor, qita cost/perf metrics in run viewer, and MLflow integration docs (EN + ZH). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add SelfRefineAgent + SelfRefineCritic (generate→critique→refine loop) - Add ReflexionAgent + ReflexionCritic (act→reflect→retry with memory) - Add `qit new` subcommand with cookiecutter-based scaffolding - Add `qit list-templates` to list scaffold and method templates - Add cookiecutter extra to setup.py - Add method-templates guide (EN + ZH) with quickstart and customization - Update glossary (EN + ZH) with method template terms - Update docs.json navigation for both languages Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add LATSAgent + LATSCritic (MCTS with UCB1 scoring and reflection) - Add MoAOrchestrator + MoACritic (parallel proposals → aggregation) - Add MagenticOneOrchestrator + ProgressCritic (dual-ledger with stall detection) - Add template assets (paper.md, config.yaml, agent.py) for all 3 methods - Add eval_config.yaml for all 3 methods - Update CLI _METHOD_TEMPLATES with lats, moa, magentic_one - Update method-templates guide (EN + ZH) with all 3 method sections - Update glossary (EN + ZH) with LATS, MoA, Magentic-One terms - 60 new tests (23 LATS + 17 MoA + 20 Magentic-One) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Engine: - Fix _action_runtime.py early-exit bug: blocked actions no longer kill sibling actions; collect-all-then-merge pattern - Add submit_poc to _CONCURRENCY_SAFE_TOOLS for parallel submission Harness: - Add json_decision_multi_v1 protocol supporting native tool_calls - GLM preset uses json_decision_multi_v1 with parallel_tool_calls=True - Add recommended_request_kwargs to harness types - Auto-merge preset kwargs in harness __init__ Render: - Fix ClaudeStyleHook multi-action rendering with per-index dedup - Add parallel action/observation visual indicators - Extract _render_single_observation for multi-observation support - Content renderer: action_summary/observation_summary support multi-action/multi-observation events
| def test_delete(self): | ||
| mem = InMemorySharedMemory() | ||
| mem.write("key1", "value1") | ||
| assert mem.delete("key1") is True |
| mem.write("key1", "value1") | ||
| assert mem.delete("key1") is True | ||
| assert mem.read("key1") is None | ||
| assert mem.delete("key1") is False |
| def test_delete(self, tmp_path): | ||
| mem = FileSharedMemory(tmp_path / "shared.json") | ||
| mem.write("key1", "value1") | ||
| assert mem.delete("key1") is True |
| mem = InMemorySharedMemory() | ||
| ns = SharedMemoryNamespace(mem, "agent_a") | ||
| ns.write("key1", "value1") | ||
| assert ns.delete("key1") is True |
| def test_delete_nonexistent(self): | ||
| mem = InMemorySharedMemory() | ||
| ns = SharedMemoryNamespace(mem, "agent_a") | ||
| assert ns.delete("nope") is False |
| from qitos.core.state import StateSchema | ||
|
|
||
|
|
||
| @dataclass |
|
|
||
| def create_snowl_agent(**kwargs: Any) -> Any: | ||
| """Create the {{cookiecutter.agent_name}} agent for Snowl evaluation.""" | ||
| from .src.agent import {{cookiecutter.agent_name | pascalcase}}Agent |
|
|
||
| from __future__ import annotations | ||
|
|
||
| from {{cookiecutter.agent_name}}.src.agent import ( |
| from .backends import CacheBackend | ||
|
|
||
|
|
||
| class CachedModel(Model): |
|
|
||
| # 创建 Agent 并运行 | ||
| agent = DeepSeekCoderAgent() | ||
| engine = Engine(agent=agent, model=model, history=history) |
Previously, reasoning_content (from DeepSeek/QwQ/etc.) was only used as a fallback when content was empty. Now it is: 1. Extracted into ModelResponse.reasoning_content (new field) 2. Included in model_output event payload 3. Displayed by ContentFirstRenderer as the agent's "thought" text 4. Propagated through _hooks_impl to the TUI Also: ToolResult.to_dict() promotes string output as "content" key so ContentFirstRenderer can find it.
Previously _extract_response_text() returned "" when tool_calls were present, discarding any content the model produced alongside its actions. Now content is always extracted regardless of tool_calls. ContentFirstRenderer.thought_text() shows both reasoning_content and raw_output (content text) when present, separated by "---". This means the operator sees the model's complete reasoning: API-level chain-of-thought + any non-tool-call text the model generated.
The agent's task instruction was never shown in the TUI, making it impossible to understand what the agent was asked to do without checking external logs. Now the task text is rendered as a cyan "TASK" line right after the RUN banner.
Previously build_tool_spec() set every parameter description to "". The :param annotations in docstrings were only visible as part of the top-level tool description, forcing the model to parse unstructured text to understand parameters. Now: - _parse_param_descriptions() extracts :param name: desc from docstrings - Each parameter gets its own description in the schema - _strip_param_docs() removes :param lines from the top-level description to avoid duplication This means the model sees structured per-parameter descriptions in the tools API parameter, not a wall of text it has to parse itself.
- _state_stats extracts chain nodes/gates, vulnerability analysis, current hypothesis, and path trace from agent state - _render_chain_summary produces compact multi-line display with ✓/?/✗ badges for confirmed/inferred/refuted gates - _hooks_impl renders chain summary with color-coded lines: cyan for chain overview and nodes, red for refuted gates, yellow for open gates, magenta for hypothesis, blue for analysis
- _model_runtime.py: _state_stats() extracts constraint_board and task_memory from state.metadata (primary) or builds from state fields (fallback). Removes old _render_chain_summary(). - _hooks_impl.py: renders Constraint Board and Task Memory sections with semantic color-coding. Uses exact same text the LLM sees in observation packet, ensuring TUI-LLM consistency.
_state_stats() was trying to extract chain/gate data from the observation object, but the observation doesn't carry the live CyberGymState — it only has serialised data. Now the live state object is passed from run_decide() → _run_llm_decide() → _state_stats(), so constraint_board and task_memory are properly extracted for TUI.
…t loop Sync agent source (commit f63ccad) including: - Vul-only trigger is PARTIAL success — agent keeps refining for precision - stop_criteria: only stop on is_verified(), not vul_crashed() alone - New vul_only_triggered gate + vul_crashed_partial verdict - ASAN trace fallback from raw_output when vul_stderr is empty - Patch-diff-guided refinement feedback - Candidate escape hatches, targeted PoC registration, submit validation - Removed .gitignore exclusion for cybergym/agent directory
…raw_output GLM-5.1 puts identical content in both reasoning_content and raw_output, causing the TUI to display the same thought twice separated by "---". Now skips raw_output when it duplicates reasoning_content.
Add log_file parameter to ClaudeStyleHook that creates a _TeeConsole proxy writing to both terminal and a plain-text log file. Each task in the batch runner now saves tui.log in its trace directory, preserving the STEP/finish/tool_calls/ctx_used format for offline review.
Brings the vendored agent up to date: constraint collection overhaul (3081def) plus the two fixes — submit_poc results keyed by (agent_id, poc) to stop cross-task and parallel-submit verdict leakage, and the missing CyberGymState.pending_reminders field (was causing unrecoverable_error). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… handling runner.py: CYBERGYM_USE_DOCKER_ENV gate (default off) runs each task's agent tools inside an ephemeral container via a same-path bind mount, so the agent cannot wander outside its task workspace while the host-side process (LLM/submit_poc) keeps working. docker_env.py: _inner_path now uses absolute paths verbatim instead of double-nesting them under the workdir. Verified: 4 concurrent containers, per-task grading correct, no leakage. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…e-feedback fix) Vendored sync of the fix-side leak fix: under CYBERGYM_VUL_ONLY_FEEDBACK (default on) the agent sees vul-side feedback only and stops on the first vul-side crash; the server still grades fix-side into its DB for scoring but the verdict is no longer read back to the agent. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Without this file, Python treats this directory as a namespace package when the parent (traj_analyzer/) is on sys.path, shadowing the pip-installed qitos package and causing ImportError for AgentModule.
…e shadowing" This reverts commit 64207ef.
Write exceptions to CYBERGYM_TASK_TRACE_DIR/step_error.log at every exception handler in engine.py (DECIDE, ACT, init_state, setup_toolsets, setup_env) and add sys.excepthook in run_local.py so no exception can be silently lost regardless of stderr redirection.
…oard _model_runtime._state_stats() now extracts sink_candidates and objective from state.metadata. _hooks_impl.py renders them with semantic coloring: - Sink Candidates: bright_magenta for high conf, yellow for nudge - Objective: green inline text
…dering - Extract _tui_phase from metadata in _state_stats for reliable Phase Badge - Add task_context and allowed_tools to _state_stats extraction - Render Task Context section (vulnerability, bug type, strategy, input format) - Render Allowed Tools section with checkpoint-blocked highlighting - Color-code CHECKPOINT BLOCKED text in bold red
- Render Suggested Sinks section in TUI (bright_blue styling) - Add [AUTO] tag styling in Sink Candidates rendering - Extract _tui_suggested_sinks from metadata in _state_stats
Raise body truncation limits from 220/2000 to 50000 for all observation types (tool results, error content, syntax highlighting, terminal output). When truncation does occur (extremely rare at 50K), show total char count. This ensures researchers can see full tool call results in TUI logs.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Describe the change in 2-5 sentences. Focus on user-facing behavior and why this change matters.
What Changed
Why
What problem does this solve? What workflow, bug, or maintenance gap does it address?
Validation
pytest -qCommands run:
# paste commands hereQitOS Design Impact
If relevant, explain how this PR affects:
Docs and Changelog
CHANGELOG.mdupdatedChecklist