Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 5 additions & 2 deletions doc/summary.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,11 +10,12 @@ GitHub: https://github.com/pingp76/swoopcode

## 当前状态

**已完成阶段**: 基础 REPL + LLM 对话 + bash 工具调用 + 文件操作工具 + 消息标准化 + TODO 任务管理 + 子智能体(SubAgent)+ Skill(技能)系统 + LLM 通信日志 + 上下文压缩 + 权限管理 + Hook 机制 + Memory(长期记忆)+ **Prompt Cache 友好的请求布局** + LLM 错误恢复 + ProjectContext + Session/Transcript 原始事件流 + 持久化 Task 任务系统 + Async Run 非阻塞运行实例 + **Schedule 定时运行系统** + OutputStore 输出句柄 + 安全精确编辑 + 时间语义收口 + Runtime Hardening Round A(原子写与日志轮转)+ 教学注释增强(实现路径注释补齐)+ **PDD-16:模型适配与 Agent Runtime Policy 抽象层**(Provider Profile + Foundation Model Profile + Runtime Policy + LLM Adapter + Context Budget + Stable Context Manager + ContextRanker + RepoClassifier + TaskIntentClassifier)+ **PDD-17:Eval Harness 基础框架**(Eval Core、deterministic suite、real core tools、CLI driver)+ **PDD-18:Replay/Live/Judge/Full-tools Eval**(replay、live smoke、judge/report、live regression、full-tools live E2E)+ **PDD-19:MCP 与 Agent Team Eval Harness Prototype**(prototype suites 默认 skipped,避免误读为生产能力)+ **网页版教程雏形**(`tutorial/` 静态站点 + 第 00/01 章 + `web/temp/2/` 风格三栏阅读布局)+ **公开版 PDD 整理**(`doc/pdd-01-*.md` 到 `doc/pdd-19-*.md`,保留原始 PDD 深度,旧 refactor 工作记录已合并回对应 PDD)
**已完成阶段**: 基础 REPL + LLM 对话 + bash 工具调用 + 文件操作工具 + 消息标准化 + TODO 任务管理 + 子智能体(SubAgent)+ Skill(技能)系统 + LLM 通信日志 + 上下文压缩 + 权限管理 + Hook 机制 + Memory(长期记忆)+ **Prompt Cache 友好的请求布局** + LLM 错误恢复 + ProjectContext + Session/Transcript 原始事件流 + 持久化 Task 任务系统 + Async Run 非阻塞运行实例 + **Schedule 定时运行系统** + OutputStore 输出句柄 + 安全精确编辑 + 时间语义收口 + Runtime Hardening Round A(原子写与日志轮转)+ 教学注释增强(实现路径注释补齐)+ **PDD-16:模型适配与 Agent Runtime Policy 抽象层**(Provider Profile + Foundation Model Profile + Runtime Policy + LLM Adapter + Context Budget + Stable Context Manager + ContextRanker + RepoClassifier + TaskIntentClassifier)+ **PDD-17:Eval Harness 基础框架**(Eval Core、deterministic suite、real core tools、CLI driver)+ **PDD-18:Replay/Live/Judge/Full-tools Eval**(replay、live smoke、judge/report、live regression、full-tools live E2E)+ **PDD-19:MCP 与 Agent Team Eval Harness Prototype**(prototype suites 默认 skipped,避免误读为生产能力)+ **Eval 临时产物 TTL 清理**(失败保留 manifest + 白名单 GC CLI + trace 文件清理)+ **网页版教程雏形**(`tutorial/` 静态站点 + 第 00/01 章 + `web/temp/2/` 风格三栏阅读布局)+ **公开版 PDD 整理**(`doc/pdd-01-*.md` 到 `doc/pdd-19-*.md`,保留原始 PDD 深度,旧 refactor 工作记录已合并回对应 PDD)

- **PDD-17 Eval Harness 基础能力**:Eval Core + Deterministic Suite + Real Core Tools + CLI Driver。
- **PDD-18 Eval 回归能力**:Replay + Live Smoke + Judge/Report;Live Regression — Core Tools;Full-tools Live E2E。
- **PDD-19 Eval Prototype 边界**:MCP fixture server + MCP runtime adapter + MCP trace/assertions;顺序 supervisor Team driver + Team trace/assertions;由于项目尚未实现生产级 MCP runtime / 真实 Agent Team runtime,相关 MCP/Team 测试当前全部 `describe.skip`。
- **Eval 临时产物清理**:默认通过 case 结束即删除 workspace;`keepOnFailure` 保留失败 workspace / agentHome 时写入 `.eval-artifact.json`;`npm run eval:cleanup` 按白名单前缀和 TTL 清理 OS tmpdir 中的 eval 残留,并清理过期 `*.trace.json`。

## 网页版教程站点雏形

Expand Down Expand Up @@ -150,7 +151,8 @@ src/
│ │ ├── trace.ts # TraceRecorder、RuntimeEvent
│ │ ├── assertions.ts # portable + instrumented assertion 执行器
│ │ ├── runner.ts # runEvalCase/runEvalSuite 核心 runner
│ │ └── trace-writer.ts # JSON trace 输出
│ │ ├── trace-writer.ts # JSON trace 输出
│ │ └── temp-cleanup.ts # Eval 临时产物 TTL manifest 与白名单清理
│ ├── drivers/
│ │ ├── learn-claude-code/
│ │ │ ├── in-process-driver.ts # 当前项目 createAgent() driver
Expand Down Expand Up @@ -195,6 +197,7 @@ src/
│ ├── replay/
│ │ └── replay-llm.ts # Replay LLM client
│ ├── runner.test.ts # core + in-process driver 集成测试
│ ├── cleanup-cli.ts # Eval 临时产物清理 CLI
│ └── README.md # Eval 系统使用文档
skills/
├── code-review/
Expand Down
1 change: 1 addition & 0 deletions package.json
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@
"test:eval:live:team": "vitest run src/eval/live/live-team-suite.test.ts",
"test:eval:live:team:mcp": "vitest run src/eval/live/live-team-suite.test.ts",
"test:eval:judge": "vitest run src/eval/judge/judge-suite.test.ts",
"eval:cleanup": "tsx src/eval/cleanup-cli.ts",
"typecheck": "tsc --noEmit",
"lint": "eslint src/",
"format": "prettier --write \"src/**/*.ts\"",
Expand Down
33 changes: 32 additions & 1 deletion src/eval/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,14 +10,20 @@ npm run test:eval

# 运行所有 eval 相关测试(含 runner 集成测试)
npx vitest run src/eval/

# 清理过期 eval 临时产物(默认 7 天)
npm run eval:cleanup

# 先预览会删除什么
npm run eval:cleanup -- --dry-run
```

## 设计原则

- **确定性**:所有 case 使用 scripted LLM,不依赖真实模型,确保任何环境都能稳定通过
- **可移植**:Eval Core 不直接依赖当前项目内部模块(agent.ts、llm.ts 等),只认识 `CodingAgentDriver` 接口
- **可观测**:通过 instrumented assertions 验证工具调用、权限确认等内部行为
- **隔离性**:每个 case 在独立临时 workspace 中运行,自动清理
- **隔离性**:每个 case 在独立临时 workspace 中运行,默认自动清理;失败调试产物通过 TTL manifest 和 `npm run eval:cleanup` 定期回收

## Case 结构

Expand Down Expand Up @@ -200,6 +206,31 @@ EVAL_TRACE_DIR=./eval-traces npm run test:eval

Trace 文件包含:case 信息、步骤痕迹、runtime events、断言结果。

## 临时产物清理

Eval 会创建三类临时产物:

1. **workspace**:每个 case 的隔离工作目录,默认通过后立即删除
2. **agentHome**:full-tools / team eval 的临时持久化根目录,保存 Memory、Skill、Task、Schedule、Output 等状态
3. **trace**:开启 `trace.enabled` 或 `EVAL_TRACE_DIR` 后写出的 `*.trace.json`

当 case 设置 `workspace.keepOnFailure: true` 且运行失败时,runner 会保留 workspace;full-tools / team driver 也会同步保留自己的临时 `agentHome`。这些目录会写入 `.eval-artifact.json`,记录 `caseId`、`createdAt` 和 `expiresAt`,便于后续清理。

定期清理命令:

```bash
# 删除默认 OS tmpdir 下超过 7 天的 eval 产物
npm run eval:cleanup

# CI 中常用:删除超过 24 小时的残留
npm run eval:cleanup -- --older-than 24h

# 本地接入前先预览
npm run eval:cleanup -- --dry-run
```

清理器只扫描白名单前缀(如 `eval-`、`learn-claude-eval-home-`、`learn-claude-team-home-` 等),不会递归扫描任意临时目录。`eval-traces` 目录中只删除过期的 `*.trace.json` 文件,避免误删手工放入的其他说明文件。

## 编写 Core Tool Case 的注意事项

1. **Scripted LLM Responses**:每个 tool call 需要至少 2 个 responses
Expand Down
127 changes: 127 additions & 0 deletions src/eval/cleanup-cli.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,127 @@
/**
* cleanup-cli.ts — Eval 临时产物清理命令
*
* 职责:把 temp-cleanup.ts 暴露成 npm 脚本入口,便于本地和 CI 定期执行。
*
* 用法:
* - npm run eval:cleanup
* - npm run eval:cleanup -- --older-than 24h
* - npm run eval:cleanup -- --dry-run
*/

import { fileURLToPath } from "node:url";
import { tmpdir } from "node:os";
import {
cleanupEvalArtifacts,
DEFAULT_EVAL_ARTIFACT_TTL_MS,
parseEvalCleanupDuration,
} from "./core/temp-cleanup.js";

interface CleanupCliOptions {
rootDir: string;
olderThanMs: number;
dryRun: boolean;
}

async function main(argv: string[]): Promise<void> {
const options = parseArgs(argv);
const result = await cleanupEvalArtifacts(options);

console.log(
[
`Eval cleanup root: ${result.rootDir}`,
`Mode: ${result.dryRun ? "dry-run" : "delete"}`,
`Scanned: ${result.scanned}`,
`Deleted: ${result.deleted.length}`,
`Kept: ${result.kept.length}`,
`Errors: ${result.errors.length}`,
].join("\n"),
);

for (const entry of result.deleted) {
console.log(`[deleted] ${entry.path} (${entry.reason})`);
}
for (const error of result.errors) {
console.error(`[error] ${error.path}: ${error.message}`);
}

if (result.errors.length > 0) {
process.exitCode = 1;
}
}

function parseArgs(argv: string[]): CleanupCliOptions {
const options: CleanupCliOptions = {
rootDir: process.env["EVAL_TEMP_ROOT"] ?? tmpdir(),
olderThanMs: DEFAULT_EVAL_ARTIFACT_TTL_MS,
dryRun: false,
};

for (let i = 0; i < argv.length; i++) {
const arg = argv[i];
if (arg === undefined) {
continue;
}
if (arg === "--dry-run") {
options.dryRun = true;
continue;
}
if (arg === "--root") {
options.rootDir = readNextArg(argv, i, "--root");
i++;
continue;
}
if (arg.startsWith("--root=")) {
options.rootDir = arg.slice("--root=".length);
continue;
}
if (arg === "--older-than") {
options.olderThanMs = parseEvalCleanupDuration(
readNextArg(argv, i, "--older-than"),
);
i++;
continue;
}
if (arg.startsWith("--older-than=")) {
options.olderThanMs = parseEvalCleanupDuration(
arg.slice("--older-than=".length),
);
continue;
}
if (arg === "--help" || arg === "-h") {
printHelp();
process.exit(0);
}
throw new Error(`Unknown argument: ${arg}`);
}

return options;
}

function readNextArg(argv: string[], index: number, flag: string): string {
const value = argv[index + 1];
if (value === undefined || value.startsWith("--")) {
throw new Error(`${flag} requires a value.`);
}
return value;
}

function printHelp(): void {
console.log(`Usage: npm run eval:cleanup -- [options]

Options:
--older-than <duration> Delete artifacts older than this duration. Default: 7d.
Supported units: ms, s, m, h, d.
--dry-run Print what would be deleted without deleting.
--root <path> Scan this directory instead of EVAL_TEMP_ROOT or OS tmpdir.
-h, --help Show this help.
`);
}

const currentFile = fileURLToPath(import.meta.url);
if (process.argv[1] === currentFile) {
main(process.argv.slice(2)).catch((err: unknown) => {
console.error(err instanceof Error ? err.message : String(err));
process.exitCode = 1;
});
}
14 changes: 13 additions & 1 deletion src/eval/core/runner.ts
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,7 @@ import { createTraceRecorder } from "./trace.js";
import { runAssertions } from "./assertions.js";
import { writeEvalTrace } from "./trace-writer.js";
import { runJudge } from "../judge/judge.js";
import { writeEvalArtifactManifest } from "./temp-cleanup.js";

/**
* runEvalCase — 执行单个 eval case
Expand Down Expand Up @@ -269,7 +270,18 @@ export async function runEvalCase(
// 12. 清理 workspace(如果 case 失败且设置了 keepOnFailure,则保留)
const shouldKeep =
evalCase.workspace?.keepOnFailure === true && status !== "passed";
if (!shouldKeep) {
if (shouldKeep) {
try {
// 失败调试时保留 workspace 很有用,但必须给跨运行 GC 留下过期信息,
// 否则长期打开 keepOnFailure 后,系统临时目录会无限膨胀。
await writeEvalArtifactManifest(workspace.root, {
caseId: evalCase.id,
kind: "workspace",
});
} catch {
// manifest 写入失败不改变 eval 结果;后续 cleanup 仍可用 mtime 兜底。
}
} else {
try {
await workspace.cleanup();
} catch {
Expand Down
143 changes: 143 additions & 0 deletions src/eval/core/temp-cleanup.test.ts
Original file line number Diff line number Diff line change
@@ -0,0 +1,143 @@
/**
* temp-cleanup.test.ts — Eval 临时产物清理器测试
*
* 这些测试的重点不是“rm 能不能工作”,而是验证清理边界:
* - 只删除白名单 eval 目录
* - manifest 优先于 mtime
* - trace 只删除旧 trace JSON
* - dry-run 不产生真实删除
*/

import { existsSync } from "node:fs";
import { mkdir, mkdtemp, rm, utimes, writeFile } from "node:fs/promises";
import { tmpdir } from "node:os";
import { join } from "node:path";
import { afterEach, describe, expect, it } from "vitest";
import {
cleanupEvalArtifacts,
parseEvalCleanupDuration,
writeEvalArtifactManifest,
} from "./temp-cleanup.js";

describe("cleanupEvalArtifacts", () => {
const roots: string[] = [];
const baseNow = new Date("2026-06-12T00:00:00.000Z");
const sevenDays = parseEvalCleanupDuration("7d");

afterEach(async () => {
const pending = roots.splice(0);
await Promise.all(
pending.map((root) => rm(root, { recursive: true, force: true })),
);
});

it("removes only old directories with eval whitelist prefixes", async () => {
const root = await createRoot();
const oldEval = join(root, "eval-old-case");
const youngEval = join(root, "eval-young-case");
const unrelated = join(root, "project-cache-old");
await mkdir(oldEval);
await mkdir(youngEval);
await mkdir(unrelated);
await touchMtime(oldEval, new Date(baseNow.getTime() - sevenDays - 1000));
await touchMtime(youngEval, baseNow);
await touchMtime(unrelated, new Date(baseNow.getTime() - sevenDays - 1000));

const result = await cleanupEvalArtifacts({
rootDir: root,
olderThanMs: sevenDays,
now: baseNow,
});

expect(result.errors).toHaveLength(0);
expect(result.deleted.map((entry) => entry.path)).toContain(oldEval);
expect(existsSync(oldEval)).toBe(false);
expect(existsSync(youngEval)).toBe(true);
expect(existsSync(unrelated)).toBe(true);
});

it("uses manifest expiresAt before falling back to mtime", async () => {
const root = await createRoot();
const expiredHome = join(root, "learn-claude-eval-home-expired");
const activeHome = join(root, "learn-claude-eval-home-active");
await mkdir(expiredHome);
await mkdir(activeHome);
await writeEvalArtifactManifest(expiredHome, {
caseId: "expired-case",
kind: "agentHome",
now: new Date(baseNow.getTime() - 2 * sevenDays),
ttlMs: sevenDays,
});
await writeEvalArtifactManifest(activeHome, {
caseId: "active-case",
kind: "agentHome",
now: baseNow,
ttlMs: sevenDays,
});

const result = await cleanupEvalArtifacts({
rootDir: root,
olderThanMs: sevenDays,
now: baseNow,
});

expect(result.errors).toHaveLength(0);
expect(existsSync(expiredHome)).toBe(false);
expect(existsSync(activeHome)).toBe(true);
});

it("removes old trace JSON files without touching unrelated trace files", async () => {
const root = await createRoot();
const traceDir = join(root, "eval-traces");
const oldTrace = join(traceDir, "case-a.trace.json");
const youngTrace = join(traceDir, "case-b.trace.json");
const note = join(traceDir, "notes.txt");
await mkdir(traceDir);
await writeFile(oldTrace, "{}", "utf-8");
await writeFile(youngTrace, "{}", "utf-8");
await writeFile(note, "keep", "utf-8");
await touchMtime(oldTrace, new Date(baseNow.getTime() - sevenDays - 1000));
await touchMtime(youngTrace, baseNow);
await touchMtime(note, new Date(baseNow.getTime() - sevenDays - 1000));

const result = await cleanupEvalArtifacts({
rootDir: root,
olderThanMs: sevenDays,
now: baseNow,
});

expect(result.errors).toHaveLength(0);
expect(result.deleted.map((entry) => entry.path)).toContain(oldTrace);
expect(existsSync(oldTrace)).toBe(false);
expect(existsSync(youngTrace)).toBe(true);
expect(existsSync(note)).toBe(true);
});

it("reports deletions in dry-run mode without removing files", async () => {
const root = await createRoot();
const oldEval = join(root, "eval-dry-run-case");
await mkdir(oldEval);
await touchMtime(oldEval, new Date(baseNow.getTime() - sevenDays - 1000));

const result = await cleanupEvalArtifacts({
rootDir: root,
olderThanMs: sevenDays,
now: baseNow,
dryRun: true,
});

expect(result.dryRun).toBe(true);
expect(result.deleted.map((entry) => entry.path)).toContain(oldEval);
expect(existsSync(oldEval)).toBe(true);
});

async function createRoot(): Promise<string> {
const root = await mkdtemp(join(tmpdir(), "eval-cleanup-test-"));
roots.push(root);
return root;
}

async function touchMtime(path: string, time: Date): Promise<void> {
await utimes(path, time, time);
}
});
Loading
Loading