Skip to content
Merged

419 #57

Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
1b3a1ce
chore: cleanup internal paths, fix dependencies and upgrade onboard d…
xACE123 Apr 18, 2026
2214af7
chore: address repro report feedback, split requirements, update read…
xACE123 Apr 18, 2026
1cfc024
fix(env): enhance init script robustness, update SustainDC clone logi…
xACE123 Apr 18, 2026
41a0f0e
fix(env): comprehensively fix all task-local dependencies and unify e…
xACE123 Apr 18, 2026
316a601
fix(env): update init and merged env scripts; track codex/cursor skil…
xACE123 Apr 19, 2026
ceb22f3
docs: add run.txt with smoke, full-run, and Windows UTF-8 notes
xACE123 Apr 19, 2026
f7646be
docs(run): add v1 full batch sequence at top of run.txt
xACE123 Apr 19, 2026
eee4b0a
chore: remove obsolete local repro and env setup report files
xACE123 Apr 19, 2026
b345e48
feat(batch): merge v1 matrices into v1.yaml; env-driven LLM; run.md; …
xACE123 Apr 19, 2026
e587e32
chore: add v1 batch matrix self-check script and ignore debug logs
xACE123 Apr 19, 2026
c3bd4dc
feat(batch): v1.yaml 47 tasks — add MLA/TriMul, drop Muon; trim READM…
xACE123 Apr 19, 2026
42d6143
feat(scripts): add run_v1_batch.sh one-shot v1 batch launcher; update…
xACE123 Apr 20, 2026
00d1f31
docs(run): fix run.md markdown, restore sections, keep section 一 for …
xACE123 Apr 20, 2026
7dd2445
docs(run): remove sections 五 and 六
xACE123 Apr 20, 2026
1dca1a4
docs(run): tweak 集成脚本 heading
xACE123 Apr 20, 2026
4a8e8e9
docs: drop 一键 wording for v1 batch and run_v1_batch references
xACE123 Apr 20, 2026
02ca831
docs: trim README/run.md, remove smoke-test sections
xACE123 Apr 20, 2026
b9df267
docs(zh-CN): simplify header links, fix leaderboard table and dc-rl line
xACE123 Apr 20, 2026
ca99740
docs: restore API key / .env instructions in run.md and README
xACE123 Apr 20, 2026
c192693
docs: add run_en.md (English run guide) and cross-links
xACE123 Apr 20, 2026
b6a4e12
docs: align run guide naming with README (run.md, run_zh-CN.md)
xACE123 Apr 20, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .codex/skills/frontier-contributor.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,4 +12,4 @@ When asked to contribute or update a benchmark:
4. Run:
- `python verification/evaluator.py scripts/init.py`
- `python -m frontier_eval task=unified task.benchmark=<Domain>/<Task> algorithm=openevolve algorithm.iterations=0`
5. Keep runtime overrides unchanged and avoid secrets or machine-local paths.
5. Keep runtime overrides unchanged and avoid secrets or machinelocal paths.
2 changes: 1 addition & 1 deletion .codex/skills/frontier-evaluator.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ user_invocable: true

When asked to run or debug evaluation:

1. Read `frontier_eval/README.md` and benchmark README instructions first.
1. Read `frontier_eval/README.md` and benchmark README first.
2. Discover env docs with:
- `python .claude/skills/scripts/discover_env_docs.py <Domain>`
- `python .claude/skills/scripts/discover_env_docs.py <Domain>/<Task>`
Expand Down
2 changes: 1 addition & 1 deletion .cursor/skills/frontier-contributor.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,4 +12,4 @@ When asked to contribute or update a benchmark:
4. Run:
- `python verification/evaluator.py scripts/init.py`
- `python -m frontier_eval task=unified task.benchmark=<Domain>/<Task> algorithm=openevolve algorithm.iterations=0`
5. Keep runtime overrides unchanged and avoid secrets or machine-local paths.
5. Keep runtime overrides unchanged and avoid secrets or machinelocal paths.
19 changes: 12 additions & 7 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -31,9 +31,9 @@ SKILL_zh-CN.md
# Local personal agent configuration
/AGENT.md
/AGENT_zh-CN.md
/.codex/
/.claude/
/.cursor/
# /.codex/
# /.claude/
# /.cursor/
!/.claude/
!/.claude/skills/
!/.claude/skills/*.md
Expand All @@ -43,7 +43,12 @@ submission.json
outputlog.txt
frontier_eval/conf/batch/*
!frontier_eval/conf/batch/example_matrix.yaml
!frontier_eval/conf/batch/v1_cpu_openevolve_p8_i100_gemini-3.1-pro-preview.yaml
!frontier_eval/conf/batch/v1_engdesign_openevolve_qwen3codernext_100.yaml
!frontier_eval/conf/batch/v1_flashattention_openevolve_qwen3codernext_100.yaml
!frontier_eval/conf/batch/v1_gpu_openevolve_qwen3codernext_100.yaml
!frontier_eval/conf/batch/v1.yaml
metrics.json
artifacts.json
debug-*.log

# Task outputs and artifacts
**/outputs/
**/artifacts/
**/last_eval.json
34 changes: 28 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,11 +22,33 @@ Frontier-Eng evaluates agents on problems where genuine improvement requires int

## Getting Started

```bash
bash init.sh && conda activate frontier-eval-2
```
Setup is split between a small **driver** conda env and per-task **runtime** envs.

- **Driver** (`frontier-eval-2`): from `init.sh`; schedules jobs only.
- **Runtimes** (`frontier-v1-main`, `frontier-v1-kernel`, …): where benchmarks actually run. Install merged runtimes with `bash scripts/setup_v1_merged_task_envs.sh`.
- Before long runs: `export PYTHONNOUSERSITE=1` so user-site packages do not leak into tasks.
- Default task launch uses `task.runtime.use_conda_run=false` and `task.runtime.python_path=conda-env:<env_name>`.

**Task-specific bits**

- **DuckDB / EV2Gym**: need their local verifier deps (see each task dir).
- **Optics**: extra requirements under `benchmarks/Optics/` (also reflected in merged configs).
- **MolecularMechanics**: OpenFF stack (e.g. `openff-toolkit`); see task README.
- **GPU kernel tasks** (FlashAttention, MLA, …): need `frontier-v1-kernel`.

**External assets**

- **`dc-rl`**: clone + patch; paths under `third_party/` and `benchmarks/SustainableDataCenterControl/.../sustaindc/`.
- **PhySense**, **SustainDC**, **CarAerodynamicsSensing**: need downloaded models/data/checkpoints or they fail at runtime.

**Known issues**

- **ReactionOptimisation**: `frontier-v1-summit` pip resolution can fail; treat as env noise, not necessarily a bug in the harness.
- **EngDesign**: Docker tasks need a working Docker setup; use local mode if you cannot access the socket.

**LLM / API keys**: copy `.env.example` to `.env` and set at least **`OPENAI_API_KEY`** (and `OPENAI_API_BASE` / `OPENAI_MODEL` if you use a compatible gateway). Details: **[run.md](run.md)** · [中文 run_zh-CN.md](run_zh-CN.md).

Per-task runs, batch matrices, and runtime overrides are in **[frontier_eval/README.md](frontier_eval/README.md)**.
Per-task commands, batch matrices, and overrides: **[frontier_eval/README.md](frontier_eval/README.md)**. **v1 batch** wrapper and host notes: **[run.md](run.md)** · [中文 run_zh-CN.md](run_zh-CN.md) (`bash scripts/run_v1_batch.sh`).

## Leaderboard

Expand All @@ -51,9 +73,9 @@ The full task list by domain is in **[TASK_DETAILS.md](TASK_DETAILS.md)**.

The best solutions produced by our agent runs (across experiments, algorithms, models, and tasks) are archived in **[baseline_archive/README.md](baseline_archive/README.md)**. These serve as reference baselines for the community.

## Join the Community
## Community

Welcome to our developer community! Whether you want to discuss new engineering problem concepts, find task collaborators, or encounter technical issues, reach us via [Feishu](https://applink.feishu.cn/client/chat/chatter/add_by_link?link_token=21ak5858-60ba-44fd-9085-01f165c8771c) or [Discord](https://discord.gg/hxeVhZNN).
[Feishu](https://applink.feishu.cn/client/chat/chatter/add_by_link?link_token=21ak5858-60ba-44fd-9085-01f165c8771c) · [Discord](https://discord.gg/hxeVhZNN)

## Contributing

Expand Down
62 changes: 43 additions & 19 deletions README_zh-CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

[English](README.md) | 简体中文

[![主页](https://img.shields.io/badge/主页-lab.einsia.ai-0969DA?style=flat-square&logo=homepage&logoColor=white)](https://lab.einsia.ai/frontier-eng/) [![arXiv](https://img.shields.io/badge/arXiv-2604.12290-b31b1b?style=flat-square&logo=arxiv&logoColor=white)](http://arxiv.org/abs/2604.12290) [![飞书](https://img.shields.io/badge/飞书-加入讨论群-3370FF?style=flat-square)](https://applink.feishu.cn/client/chat/chatter/add_by_link?link_token=21ak5858-60ba-44fd-9085-01f165c8771c) [![Discord](https://img.shields.io/badge/Discord-加入-5865F2?style=flat-square&logo=discord&logoColor=white)](https://discord.gg/hxeVhZNN)
[主页](https://lab.einsia.ai/frontier-eng/) [arXiv](http://arxiv.org/abs/2604.12290) [飞书](https://applink.feishu.cn/client/chat/chatter/add_by_link?link_token=21ak5858-60ba-44fd-9085-01f165c8771c) [Discord](https://discord.gg/hxeVhZNN)

## News

Expand All @@ -20,29 +20,53 @@ Frontier-Eng 将这种范式形式化为 **generative optimization**,并指出

Frontier-Eng 要求 Agent 在只读、不可篡改的 verifier 下,将领域知识、受约束代码合成与迭代 refinement 紧密结合。

## Getting Started
## 上手

```bash
bash init.sh && conda activate frontier-eval-2
```
环境分两层:**调度用的 driver conda** 和 **各任务 runtime**。

运行具体任务、batch、环境覆盖等见 **[frontier_eval/README_zh-CN.md](frontier_eval/README_zh-CN.md)**。
- **Driver**(`frontier-eval-2`):`init.sh` 创建,只负责调度。
- **Runtime**(`frontier-v1-main`、`frontier-v1-kernel` 等):任务真正跑在这里。合并安装:`bash scripts/setup_v1_merged_task_envs.sh`。
- 长时间跑之前建议:`export PYTHONNOUSERSITE=1`,避免本机用户目录里的包混进任务进程。
- 默认用 `task.runtime.use_conda_run=false` 和 `task.runtime.python_path=conda-env:<env_name>` 启进程。

**按任务额外准备**

- **DuckDB / EV2Gym**:要装各自任务目录里写的校验依赖。
- **Optics**:见 `benchmarks/Optics/` 依赖说明。
- **MolecularMechanics**:OpenFF 等,见该任务 README。
- **GPU kernel 类**(FlashAttention 等):需要 `frontier-v1-kernel`,不能只用主 env。

**外部资源**

- **dc-rl**:按说明 clone + patch,路径在 `third_party/` 与 `benchmarks/SustainableDataCenterControl/.../sustaindc/`。
- **PhySense、SustainDC、CarAerodynamicsSensing**:要自备数据、模型或权重,缺了会跑失败。

**已知问题**

- **ReactionOptimisation**:`frontier-v1-summit` 上 pip 解析可能炸,优先当环境/依赖问题看。
- **EngDesign**:依赖 Docker;没权限就改用文档里的本地模式。

**LLM / API 密钥**:将 `.env.example` 复制为 `.env`,至少填写 **`OPENAI_API_KEY`**;使用兼容网关时按需设置 **`OPENAI_API_BASE`**、**`OPENAI_MODEL`**。详见 **[run_zh-CN.md](run_zh-CN.md)** · [English run.md](run.md)。

单任务、批量矩阵与覆盖项:**[frontier_eval/README_zh-CN.md](frontier_eval/README_zh-CN.md)**。v1 批量脚本与主机说明:**[run_zh-CN.md](run_zh-CN.md)** · [English run.md](run.md)(`bash scripts/run_v1_batch.sh`)。

## Leaderboard

详细榜单:**[lab.einsia.ai/frontier-eng/leaderboard.html](https://lab.einsia.ai/frontier-eng/leaderboard.html)**。**Frontier Models** — 平均任务内排名(47 tasks,下表按平均排名从低到高展示)。

| 排名 | Model | Average Rank |
| :--: | :--- | --: |
| 1 | Claude Opus 4.6 | 3.18 |
| 2 | GLM-5 | 4.02 |
| 3 | DeepSeek V3.2 | 4.41 |
| 4 | GPT-OSS-120B | 4.46 |
| 5 | Gemini 3.1 Pro Preview | 5.34 |
| 6 | Grok 4.20 | 5.60 |
| 7 | SEED 2.0 Pro | 5.63 |
| 8 | GPT-5.4 | 5.68 |
| 9 | Qwen3 Coder Next | 6.68 |

| 排名 | Model | Average Rank |
| --- | ---------------------- | ------------ |
| 1 | Claude Opus 4.6 | 3.18 |
| 2 | GLM-5 | 4.02 |
| 3 | DeepSeek V3.2 | 4.41 |
| 4 | GPT-OSS-120B | 4.46 |
| 5 | Gemini 3.1 Pro Preview | 5.34 |
| 6 | Grok 4.20 | 5.60 |
| 7 | SEED 2.0 Pro | 5.63 |
| 8 | GPT-5.4 | 5.68 |
| 9 | Qwen3 Coder Next | 6.68 |


## 任务详情

Expand All @@ -52,9 +76,9 @@ bash init.sh && conda activate frontier-eval-2

我们在各实验 / 算法 / 模型 / task 组合上跑出的最优代码存档于 **[baseline_archive/README.md](baseline_archive/README.md)**,可作为社区参考 baseline。

## 加入社区
## 社区

欢迎加入我们的开发者社区!无论是讨论新的工程问题构想、寻找 task 合作者,还是遇到技术问题,可通过[飞书](https://applink.feishu.cn/client/chat/chatter/add_by_link?link_token=21ak5858-60ba-44fd-9085-01f165c8771c)[Discord](https://discord.gg/hxeVhZNN)直接联系我们。
[飞书](https://applink.feishu.cn/client/chat/chatter/add_by_link?link_token=21ak5858-60ba-44fd-9085-01f165c8771c) · [Discord](https://discord.gg/hxeVhZNN)

## 贡献指南

Expand Down
Loading
Loading