Skip to content

Qitos cybergym#23

Open
ravenSanstete wants to merge 69 commits into
mainfrom
qitos_cybergym
Open

Qitos cybergym#23
ravenSanstete wants to merge 69 commits into
mainfrom
qitos_cybergym

Conversation

@ravenSanstete

Copy link
Copy Markdown
Collaborator

Summary

Describe the change in 2-5 sentences. Focus on user-facing behavior and why this change matters.

What Changed

Why

What problem does this solve? What workflow, bug, or maintenance gap does it address?

Validation

  • pytest -q
  • relevant lint / type checks
  • examples or manual verification, if applicable

Commands run:

# paste commands here

QitOS Design Impact

If relevant, explain how this PR affects:

  • state design
  • prompt / parser / model protocol
  • tool surface / toolsets
  • memory / history / compaction
  • tracing / qita / observability

Docs and Changelog

  • docs updated
  • tutorials updated
  • CHANGELOG.md updated
  • not needed

Checklist

  • I scoped this PR to one coherent change.
  • I did not introduce unrelated cleanup.
  • I added or updated tests where behavior changed.
  • I preserved backward compatibility, or documented the break clearly.
  • I checked whether examples, docs, or qita flows need updates.

ravenSanstete and others added 30 commits April 16, 2026 08:36
- Remove AI-sounding phrasing and unnecessary quotes across all docs (EN + ZH)
- Add term explanations at first occurrence (kernel, protocol, transport, parser, harness, preset, critic, artifact, trace, etc.)
- Make Chinese docs more natural instead of machine-translated
- Add multi-agent handoff design plan to plans/

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Replace English terms with "中文(英文术语)" format at first occurrence, then Chinese only
- Eliminate translationese and English-Chinese mixed sentences
- Unify terminology: parser→解析器, protocol→协议, harness→适配层, preset→预设, agent→智能体, runtime→运行时, prompt→提示词, benchmark→基准测试, trace→追踪记录, etc.
- Keep code blocks, commands, and API/class names in English

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Multi-agent core:
- Add DelegateTool (1:1 delegation+return), FanOutTool (1:N parallel+reduce),
  Decision.handoff() for linear agent transfer
- Add AgentSpec, AgentRegistry, ContextStrategy, HandoffContext, StateAdapter
- Add SharedMemory for cross-agent state sharing
- Add _HandoffRuntime with loop detection and max-handoff guard
- Add agent_id to StepRecord, trace serialization, and manifest metadata
- Add kit patterns: manager_worker, planner_executor, proposer_verifier
- Add examples/patterns/ for delegate, fanout, handoff

AsyncEngine + Streaming (Phase 1):
- Add AsyncEngine with arun() and arun_stream() methods
- Add EngineEvent, EngineEventType, EventStream for structured streaming
- Add AsyncOpenAICompatibleModel and AsyncOpenAIModel with async API calls
- Add SSE endpoint /api/stream/{run_id} to qita
- Add live stream button to qita run detail page

TUI + qita multi-agent adaptation:
- ClaudeStyleHook renders HANDOFF/DELEGATE/FANOUT events with agent labels
- qita board shows multi-agent badges, agent filters, agent switch indicators
- qita run view: step agent badges, handoff/delegate/fanout event rendering
- qita replay: multi-agent kind detection and specialized card rendering

Docs (EN + ZH):
- Update engine concept docs with AsyncEngine section
- Update API reference with AsyncEngine, EngineEvent, EventStream
- Update observability guide with SSE streaming section
- Add multi-agent patterns guide and tutorial
- Update CHANGELOG.md

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove examples/real/claude_code/ (now in qitos-zoo/qitos_coder/)
- Remove qitos/examples/pentagi/ (now in qitos-zoo/qitos_cyber/)
- Remove qitos/benchmark/pentagi_e2e/ (now in qitos-zoo/qitos_cyber/)
- Remove sandbox/ and root-level debug/test scripts
- Remove agent-specific e2e tests from core tests/
- Add qitos_zoo as git submodule (github.com/Qitor/qitos-zoo)
- Update core test imports to use qitos_zoo.qitos_coder
- Update tests/conftest.py to add qitos_zoo to sys.path
- Update CORE_BOUNDARY.md with e2e ownership and zero-duplication rules
- Fix engine: native_text_final decision and set_stop on final mode

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…esponses

When native tool calling is preferred and the model returns plain text
without tool_calls, the json_decision_v1 parser would return wait(),
causing agents to loop forever without producing a final result.

Now, when the model returns plain text under native tool calling:
1. Try the parser chain first (honour act/final if it works)
2. Override parser wait() with native_text_final Decision.final()

This fixes 6/12 previously failing e2e tests — all 12 now pass.

Also updated test assertions: stop_reason "final" is now correctly
returned alongside "success" for agents that produce final answers.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…Kimi/GLM

Key changes enabling real LLM tool calling in DeepAudit pipeline:

- openai.py: Add default_request_kwargs to OpenAIModel/OpenAICompatibleModel;
  relocate chat_template_kwargs into extra_body for vLLM compatibility
  (passing it as top-level kwarg causes TypeError, silently dropping tools)
- _presets.py: Set native_tool_call_preferred=True for deepseek, kimi, glm;
  change glm primary_delivery to api_parameter with json_decision_v1 protocol
- _model_runtime.py: Add logging for tool schema delivery failures instead
  of silent exception swallowing; add warning on TypeError fallback
- harness/_adapters.py, harness/__init__.py: Forward default_request_kwargs
  through the build_model_for_preset chain

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…cratch file

Remove 9 files from examples/real/ that are duplicated in
qitos-zoo/experimental/, violating CORE_BOUNDARY.md "zero duplication"
rule. Also delete live_test_parse.py scratch file with hardcoded path.

Duplicated files removed:
- _whitzard_memory.py, computer_use_agent.py, epub_reader_agent.py
- openai_cua_agent.py, skillhub_github_agent.py, swe_agent.py
- terminus_2.py, visual_inspect_agent.py, whitzard_agent.py

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add ToolCallLoopDetector to qitos/engine/__init__.py exports
- Update migration plan import to use public API

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… and more

Major additions:
- qitos/func/: Functional agent composition (agent, compose, infer, task)
- qitos/mcp/: MCP integration (bridge, filter, http, schema_convert, server, stdio)
- qitos/tracing/: Tracing system (config, console, json_processor, models, provider)
- qitos/checkpoint/: Enhanced checkpointing (durability, fork, sqlite_store, versioning)
- qitos/engine/stream/: Streaming lifecycle (lifecycle, messages, transformer)
- qitos/kit/interceptor/: Interceptors (cache, logging, retry)
- qitos/core/: New modules (channel, field_reducers, interceptor, tool_schema)
- qitos/engine/: New modules (approval, cancellation, interrupt, run_state)
- Export ToolCallLoopDetector as public API from qitos.engine
- New CI workflows: contribution-test.yml, zoo-test.yml
- ~870 additional tests

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The 'zoo' extra declared 'qitos' as a dependency, which is a circular
self-reference. Removed 'zoo' extra and removed 'qitos' from 'all' extra.
The correct dependency direction is qitos-zoo depends on qitos, not reverse.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
qitos setup.py's find_packages() was including the qitos_zoo/ git
submodule directory, which conflicts with the standalone qitos-zoo
package that uses the same qitos_zoo namespace. This caused
import resolution to pick up the submodule copy instead of the
standalone repo, breaking qitos-zoo imports.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add RetryPolicy to qitos.core.tool with exponential backoff
- Enhance qita CLI with step-by-step REPL and handoff commands
- Extend action_executor with metric tracking and recovery hooks
- Add 8 new e2e test files covering critic, delegation, hooks, tools
- Update function_tool_decorator with improved error handling

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…expansion

Clean up dev-process artifacts and add missing tutorials to make the repo
professional for external users and contributors.

Root-level cleanup:
- Merge agent.md + agent_new.md into single AGENTS.md
- Delete DESIGN.md (off-topic Linear design system)
- Delete CORE_BOUNDARY.md (merged into ARCHITECTURE.md)
- Delete security_report.md (auto-generated scan artifact)
- Move DEVELOPMENT.md content to docs/contributing/development.mdx
- Move plans/ to docs/internal/plans/

Code cleanup:
- Add DeprecationWarning to qitos/debug/ and qitos/cache/
- Delete empty qitos/examples/ package
- Add per-adapter DeprecationWarning for 6 benchmark subpackages

New tutorials (EN + ZH):
- critic-system: CriticResult, @critic decorator, instruction/state patches
- hook-lifecycle: EngineHook callbacks, HookContext, custom hooks
- func-api: @function_tool decorator, type hint inference, ToolRegistry
- mcp-integration: MCPServerStdio/HTTP, ToolFilter, async lifecycle
- checkpoint-and-fork: CheckpointStore, fork_checkpoint, DurabilityMode

Docs updates:
- docs.json: add 5 tutorials + contributing page (EN + ZH)
- tutorials/index.mdx: add v0.4 framework deep-dives section
- concepts/glossary.mdx: add 15 new terms
- introduction.mdx: add Critics/hooks, Checkpoint, MCP capabilities

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…very

- Add FamilyPreset.override() for creating customized preset copies
- Add advisory fields (recommended_max_steps, max_tokens, retry_budget,
  temperature) to FamilyPreset with values for 5 gold presets
- Implement MaxTokensCriteria stop criterion (BUDGET_TOKENS)
- Pass total_tokens in runtime_info for stop criteria evaluation
- Consolidate context_registry to derive from FamilyPreset (remove
  duplicate patterns, keep explicit table for sub-variant precision)
- Add `qit bench list` and `qit bench presets` CLI commands
- Update model-family-matrix to all 10 presets with context window
  and recommended models columns
- Rewrite switch-model-families tutorial with override and CLI examples
- Update family-presets, add-a-family-preset, critics-and-stop-criteria
  docs (EN+ZH) with new features
- Add glossary entries for preset override, MaxTokensCriteria, advisory
  defaults

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add WandbTraceProcessor (TraceProcessor for W&B integration), recipes
README and eval_config.yaml files for each benchmark, updated
benchmarks-and-recipes guide with eval_config usage and W&B note,
and W&B integration docs (EN + ZH).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add CriticTrace/HandoffTrace on EngineResult, EngineConfig export,
ToolPermissionSpec export, MlflowTraceProcessor, qita cost/perf
metrics in run viewer, and MLflow integration docs (EN + ZH).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add SelfRefineAgent + SelfRefineCritic (generate→critique→refine loop)
- Add ReflexionAgent + ReflexionCritic (act→reflect→retry with memory)
- Add `qit new` subcommand with cookiecutter-based scaffolding
- Add `qit list-templates` to list scaffold and method templates
- Add cookiecutter extra to setup.py
- Add method-templates guide (EN + ZH) with quickstart and customization
- Update glossary (EN + ZH) with method template terms
- Update docs.json navigation for both languages

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add LATSAgent + LATSCritic (MCTS with UCB1 scoring and reflection)
- Add MoAOrchestrator + MoACritic (parallel proposals → aggregation)
- Add MagenticOneOrchestrator + ProgressCritic (dual-ledger with stall detection)
- Add template assets (paper.md, config.yaml, agent.py) for all 3 methods
- Add eval_config.yaml for all 3 methods
- Update CLI _METHOD_TEMPLATES with lats, moa, magentic_one
- Update method-templates guide (EN + ZH) with all 3 method sections
- Update glossary (EN + ZH) with LATS, MoA, Magentic-One terms
- 60 new tests (23 LATS + 17 MoA + 20 Magentic-One)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
bmz-q-q and others added 3 commits June 20, 2026 12:09
Engine:
- Fix _action_runtime.py early-exit bug: blocked actions no longer
  kill sibling actions; collect-all-then-merge pattern
- Add submit_poc to _CONCURRENCY_SAFE_TOOLS for parallel submission

Harness:
- Add json_decision_multi_v1 protocol supporting native tool_calls
- GLM preset uses json_decision_multi_v1 with parallel_tool_calls=True
- Add recommended_request_kwargs to harness types
- Auto-merge preset kwargs in harness __init__

Render:
- Fix ClaudeStyleHook multi-action rendering with per-index dedup
- Add parallel action/observation visual indicators
- Extract _render_single_observation for multi-observation support
- Content renderer: action_summary/observation_summary support
  multi-action/multi-observation events
def test_delete(self):
mem = InMemorySharedMemory()
mem.write("key1", "value1")
assert mem.delete("key1") is True
mem.write("key1", "value1")
assert mem.delete("key1") is True
assert mem.read("key1") is None
assert mem.delete("key1") is False
def test_delete(self, tmp_path):
mem = FileSharedMemory(tmp_path / "shared.json")
mem.write("key1", "value1")
assert mem.delete("key1") is True
mem = InMemorySharedMemory()
ns = SharedMemoryNamespace(mem, "agent_a")
ns.write("key1", "value1")
assert ns.delete("key1") is True
def test_delete_nonexistent(self):
mem = InMemorySharedMemory()
ns = SharedMemoryNamespace(mem, "agent_a")
assert ns.delete("nope") is False
from qitos.core.state import StateSchema


@dataclass

def create_snowl_agent(**kwargs: Any) -> Any:
"""Create the {{cookiecutter.agent_name}} agent for Snowl evaluation."""
from .src.agent import {{cookiecutter.agent_name | pascalcase}}Agent

from __future__ import annotations

from {{cookiecutter.agent_name}}.src.agent import (
Comment thread qitos/cache/wrapper.py
from .backends import CacheBackend


class CachedModel(Model):

# 创建 Agent 并运行
agent = DeepSeekCoderAgent()
engine = Engine(agent=agent, model=model, history=history)
ravenSanstete and others added 26 commits June 27, 2026 14:20
Previously, reasoning_content (from DeepSeek/QwQ/etc.) was only used
as a fallback when content was empty. Now it is:

1. Extracted into ModelResponse.reasoning_content (new field)
2. Included in model_output event payload
3. Displayed by ContentFirstRenderer as the agent's "thought" text
4. Propagated through _hooks_impl to the TUI

Also: ToolResult.to_dict() promotes string output as "content" key
so ContentFirstRenderer can find it.
Previously _extract_response_text() returned "" when tool_calls were
present, discarding any content the model produced alongside its
actions.  Now content is always extracted regardless of tool_calls.

ContentFirstRenderer.thought_text() shows both reasoning_content
and raw_output (content text) when present, separated by "---".
This means the operator sees the model's complete reasoning: API-level
chain-of-thought + any non-tool-call text the model generated.
The agent's task instruction was never shown in the TUI, making it
impossible to understand what the agent was asked to do without
checking external logs.  Now the task text is rendered as a cyan
"TASK" line right after the RUN banner.
Previously build_tool_spec() set every parameter description to "".
The :param annotations in docstrings were only visible as part of the
top-level tool description, forcing the model to parse unstructured
text to understand parameters.

Now:
- _parse_param_descriptions() extracts :param name: desc from docstrings
- Each parameter gets its own description in the schema
- _strip_param_docs() removes :param lines from the top-level
  description to avoid duplication

This means the model sees structured per-parameter descriptions in
the tools API parameter, not a wall of text it has to parse itself.
- _state_stats extracts chain nodes/gates, vulnerability analysis,
  current hypothesis, and path trace from agent state
- _render_chain_summary produces compact multi-line display with
  ✓/?/✗ badges for confirmed/inferred/refuted gates
- _hooks_impl renders chain summary with color-coded lines:
  cyan for chain overview and nodes, red for refuted gates,
  yellow for open gates, magenta for hypothesis, blue for analysis
- _model_runtime.py: _state_stats() extracts constraint_board and
  task_memory from state.metadata (primary) or builds from state fields
  (fallback). Removes old _render_chain_summary().
- _hooks_impl.py: renders Constraint Board and Task Memory sections with
  semantic color-coding. Uses exact same text the LLM sees in observation
  packet, ensuring TUI-LLM consistency.
_state_stats() was trying to extract chain/gate data from the
observation object, but the observation doesn't carry the live
CyberGymState — it only has serialised data. Now the live state
object is passed from run_decide() → _run_llm_decide() → _state_stats(),
so constraint_board and task_memory are properly extracted for TUI.
…t loop

Sync agent source (commit f63ccad) including:
- Vul-only trigger is PARTIAL success — agent keeps refining for precision
- stop_criteria: only stop on is_verified(), not vul_crashed() alone
- New vul_only_triggered gate + vul_crashed_partial verdict
- ASAN trace fallback from raw_output when vul_stderr is empty
- Patch-diff-guided refinement feedback
- Candidate escape hatches, targeted PoC registration, submit validation
- Removed .gitignore exclusion for cybergym/agent directory
…raw_output

GLM-5.1 puts identical content in both reasoning_content and raw_output,
causing the TUI to display the same thought twice separated by "---".
Now skips raw_output when it duplicates reasoning_content.
Add log_file parameter to ClaudeStyleHook that creates a _TeeConsole
proxy writing to both terminal and a plain-text log file. Each task in
the batch runner now saves tui.log in its trace directory, preserving
the STEP/finish/tool_calls/ctx_used format for offline review.
Brings the vendored agent up to date: constraint collection overhaul
(3081def) plus the two fixes — submit_poc results keyed by (agent_id, poc)
to stop cross-task and parallel-submit verdict leakage, and the missing
CyberGymState.pending_reminders field (was causing unrecoverable_error).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… handling

runner.py: CYBERGYM_USE_DOCKER_ENV gate (default off) runs each task's agent
tools inside an ephemeral container via a same-path bind mount, so the agent
cannot wander outside its task workspace while the host-side process
(LLM/submit_poc) keeps working. docker_env.py: _inner_path now uses absolute
paths verbatim instead of double-nesting them under the workdir.

Verified: 4 concurrent containers, per-task grading correct, no leakage.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…e-feedback fix)

Vendored sync of the fix-side leak fix: under CYBERGYM_VUL_ONLY_FEEDBACK
(default on) the agent sees vul-side feedback only and stops on the first
vul-side crash; the server still grades fix-side into its DB for scoring but
the verdict is no longer read back to the agent.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Without this file, Python treats this directory as a namespace package
when the parent (traj_analyzer/) is on sys.path, shadowing the
pip-installed qitos package and causing ImportError for AgentModule.
Write exceptions to CYBERGYM_TASK_TRACE_DIR/step_error.log at every
exception handler in engine.py (DECIDE, ACT, init_state, setup_toolsets,
setup_env) and add sys.excepthook in run_local.py so no exception can
be silently lost regardless of stderr redirection.
…oard

_model_runtime._state_stats() now extracts sink_candidates and
objective from state.metadata. _hooks_impl.py renders them with
semantic coloring:
- Sink Candidates: bright_magenta for high conf, yellow for nudge
- Objective: green inline text
…dering

- Extract _tui_phase from metadata in _state_stats for reliable Phase Badge
- Add task_context and allowed_tools to _state_stats extraction
- Render Task Context section (vulnerability, bug type, strategy, input format)
- Render Allowed Tools section with checkpoint-blocked highlighting
- Color-code CHECKPOINT BLOCKED text in bold red
- Render Suggested Sinks section in TUI (bright_blue styling)
- Add [AUTO] tag styling in Sink Candidates rendering
- Extract _tui_suggested_sinks from metadata in _state_stats
Raise body truncation limits from 220/2000 to 50000 for all observation
types (tool results, error content, syntax highlighting, terminal output).
When truncation does occur (extremely rare at 50K), show total char count.

This ensures researchers can see full tool call results in TUI logs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants