diff --git a/.claude/skills/mobius-run/SKILL.md b/.claude/skills/mobius-run/SKILL.md index dff7cf2..a52c624 100644 --- a/.claude/skills/mobius-run/SKILL.md +++ b/.claude/skills/mobius-run/SKILL.md @@ -2,37 +2,153 @@ name: mobius-run description: Use when the user says "compete", "mobius run", or wants to pit agents against each other on a task. user-invocable: true -argument-hint: +argument-hint: [--free] [--api] --- # Mobius Competition Runner You are the orchestrator for Mobius, an adversarial agent swarm system. The user wants to run a competition. -## What to do +## Determine mode + +- **`--free`** (DEFAULT): Run the competition entirely within Claude Code using subagents. Zero API cost. You generate challenger personas on the fly, spawn them as haiku subagents, collect outputs, and judge them yourself. +- **`--api`**: Run via the CLI with real API calls (cross-family diversity, costs money). + +If neither flag is given, default to `--free` mode. + +--- + +## MODE: --free (Subagent Competition) + +This is the exciting part. You ARE the competition engine — no API calls needed. + +### Step 1: Initialize -1. Check that Mobius is initialized: ```bash python -m mobius.cli stats ``` -2. If not initialized, run: +If not initialized: `python -m mobius.cli init` + +### Step 2: Choose agents + +You have two options. Use whichever fits: + +**Option A — Use existing agents from the registry:** +```bash +python .claude/skills/mobius-run/scripts/create_match.py "" --count 6 +``` +This returns JSON with agent details including their system_prompts. Use these prompts for the subagents. + +**Option B — Generate fresh challengers on the fly (PREFERRED for interesting results):** +Analyze the task and design 4-8 complementary approaches that attack it from deliberately different angles. Think about what dimensions of variation would produce genuinely diverse solutions — not just "creative vs analytical" but specific strategic differences relevant to THIS task. + +For each challenger, create a short but specific system prompt (3-5 sentences) that defines their approach. Then register them: +```bash +python .claude/skills/mobius-seed/scripts/create_agent.py '{"name":"...", "slug":"...", "description":"...", "system_prompt":"...", "specializations":[...], "provider":"anthropic", "model":"claude-haiku-4-5-20251001"}' +``` + +Then create the match: +```bash +python .claude/skills/mobius-run/scripts/create_match.py "" --agents slug1,slug2,slug3,... +``` + +**Option C — Mix both:** Pull veterans from the registry AND generate fresh challengers. Pit them against each other. + +### Step 3: Spawn subagents + +For each agent from the match JSON, spawn a haiku subagent using the Agent tool: +- Set `model: "haiku"` on each agent +- Pass the agent's system_prompt as context plus the competition task +- Use `subagent_type: "general-purpose"` +- **IMPORTANT: Launch ALL subagents in a SINGLE message** so they run in parallel +- Each subagent prompt should be structured as: + +``` +You are competing in a Mobius adversarial swarm competition. + +YOUR IDENTITY AND APPROACH: + + +YOUR TASK: + + +Produce your best solution. Be thorough but focused. Output ONLY your solution. +``` + +If you have more than 6 agents, batch them: spawn the first 6, wait for results, then spawn the next batch. + +### Step 4: Record outputs + +After each subagent returns, pipe its output to the match record: +```bash +echo "" | python .claude/skills/mobius-run/scripts/record_outputs.py +``` + +You can record outputs incrementally as agents finish — each call merges into the existing record. Or record all at once with `--bulk`: ```bash -python -m mobius.cli init +echo '' | python .claude/skills/mobius-run/scripts/record_outputs.py --bulk ``` -3. If no agents exist, suggest running `/mobius-seed` first. +### Step 5: Judge + +You ARE the judge. Score each output on: +- **Correctness** (0-10): Does it solve the task accurately? +- **Quality** (0-10): Is it well-structured, readable, best practices? +- **Completeness** (0-10): Does it fully address all aspects? + +Be ruthless and fair. Don't let positional bias affect you — judge purely on merit. + +### Step 6: Record verdict -4. Run the competition with the user's task: +```bash +python .claude/skills/mobius-judge/scripts/record_verdict.py \ + --match \ + \ + '{"agent_id_1": 28.5, "agent_id_2": 22.0, ...}' \ + "Your detailed reasoning" +``` + +Use the match_id from Step 2 to ensure the verdict is recorded against the correct match. + +### Step 7: Show results + +```bash +python -m mobius.cli leaderboard +``` + +Present: the winner, your reasoning, Elo changes, and the winning solution. + +--- + +## MODE: --api (CLI Competition) + +Traditional mode using real API calls. + +1. Check initialization: +```bash +python -m mobius.cli stats +``` + +2. If no agents exist, suggest `/mobius-seed` first. + +3. Run the competition: ```bash python -m mobius.cli run "" ``` -5. After the competition, show the explain output: +4. Show results: ```bash python -m mobius.cli explain ``` -6. Present the winning output to the user along with the judge reasoning. +5. Present the winning output and judge reasoning to the user. + +--- + +## Tips -If the user didn't provide a task argument, ask them what they want the agents to compete on. +- For `--free` mode, you can scale to 12+ agents easily — haiku is fast and cheap (free on Pro) +- Generate challengers that are *orthogonal*, not just variations. Each should have a genuinely different strategy. +- If an existing champion agent loses to a fresh challenger, that's interesting — note it for the user +- The `--free` mode integrates with the same Elo system as `--api` — results are comparable diff --git a/.claude/skills/mobius-run/scripts/create_match.py b/.claude/skills/mobius-run/scripts/create_match.py new file mode 100644 index 0000000..f7b067e --- /dev/null +++ b/.claude/skills/mobius-run/scripts/create_match.py @@ -0,0 +1,114 @@ +"""Create a match record for a free (subagent-based) competition. + +Usage: + python create_match.py "" [--agents ] [--count N] + +Modes: + --agents slug1,slug2 Use specific agents from registry by slug + --count N Pick top N agents by Elo (default: 5) + +Outputs JSON with match_id and agent details for the skill to orchestrate. +""" + +import json +import sys + +sys.path.insert(0, "src") + +from mobius.config import get_config +from mobius.db import init_db +from mobius.models import MatchRecord +from mobius.registry import Registry + + +def main(): + args = sys.argv[1:] + if not args: + print("Usage: python create_match.py '' [--agents s1,s2] [--count N]") + sys.exit(1) + + task = args[0] + slugs = None + count = 5 + + i = 1 + while i < len(args): + if args[i] == "--agents" and i + 1 < len(args): + slugs = [s.strip() for s in args[i + 1].split(",")] + i += 2 + elif args[i] == "--count" and i + 1 < len(args): + count = int(args[i + 1]) + i += 2 + else: + i += 1 + + config = get_config() + conn, _ = init_db(config) + registry = Registry(conn, config) + + # Select agents + agents = [] + if slugs: + for slug in slugs: + agent = registry.get_agent_by_slug(slug) + if agent: + agents.append(agent) + else: + print(f"Warning: agent '{slug}' not found, skipping", file=sys.stderr) + else: + all_agents = registry.list_agents() + all_agents.sort(key=lambda a: a.elo_rating, reverse=True) + agents = all_agents[:count] + + if len(agents) < 2: + print(json.dumps({"error": "Need at least 2 agents", "agent_count": len(agents)})) + sys.exit(1) + + # Create match record (outputs empty — skill will fill them) + match = MatchRecord( + task_description=task, + competitor_ids=[a.id for a in agents], + ) + + conn.execute( + """INSERT INTO matches (id, task_description, competitor_ids, outputs, judge_models, + judge_reasoning, winner_id, scores, voided, created_at) + VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)""", + ( + match.id, + match.task_description, + json.dumps(match.competitor_ids), + json.dumps({}), + json.dumps([]), + "", + None, + json.dumps({}), + 0, + match.created_at.isoformat(), + ), + ) + conn.commit() + + # Output agent details for the skill + result = { + "match_id": match.id, + "task": task, + "agents": [ + { + "id": a.id, + "name": a.name, + "slug": a.slug, + "system_prompt": a.system_prompt, + "specializations": a.specializations, + "elo_rating": a.elo_rating, + } + for a in agents + ], + } + + print(json.dumps(result, indent=2)) + conn.close() + + +if __name__ == "__main__": + main() diff --git a/.claude/skills/mobius-run/scripts/record_outputs.py b/.claude/skills/mobius-run/scripts/record_outputs.py new file mode 100644 index 0000000..99402c0 --- /dev/null +++ b/.claude/skills/mobius-run/scripts/record_outputs.py @@ -0,0 +1,61 @@ +"""Record agent outputs for a free competition match. + +Usage: + echo "output text" | python record_outputs.py + echo '{"id1": "out1", "id2": "out2"}' | python record_outputs.py --bulk + +Reads output from stdin to avoid shell escaping issues. +""" + +import json +import sys + +sys.path.insert(0, "src") + +from mobius.config import get_config +from mobius.db import init_db + + +def main(): + if len(sys.argv) < 3: + print("Usage:", file=sys.stderr) + print(" echo 'output' | python record_outputs.py ", file=sys.stderr) + print(" echo '{...}' | python record_outputs.py --bulk", file=sys.stderr) + sys.exit(1) + + match_id = sys.argv[1] + mode = sys.argv[2] + sys.stdin.reconfigure(encoding="utf-8", errors="replace") + stdin_data = sys.stdin.read() + + config = get_config() + conn, _ = init_db(config) + + row = conn.execute( + "SELECT id, outputs FROM matches WHERE id LIKE ?", (f"{match_id}%",) + ).fetchone() + if not row: + print(f"Match '{match_id}' not found.", file=sys.stderr) + sys.exit(1) + + full_id = row[0] + existing = json.loads(row[1]) if row[1] else {} + + if mode == "--bulk": + new_outputs = json.loads(stdin_data) + existing.update(new_outputs) + else: + agent_id = mode + existing[agent_id] = stdin_data.strip() + + conn.execute( + "UPDATE matches SET outputs = ? WHERE id = ?", + (json.dumps(existing), full_id), + ) + conn.commit() + print(f"Recorded {len(existing)} outputs for match {full_id[:8]}") + conn.close() + + +if __name__ == "__main__": + main() diff --git a/README.md b/README.md index 6c9c4aa..2ec8607 100644 --- a/README.md +++ b/README.md @@ -7,7 +7,6 @@ Mobius is an adversarial swarm orchestrator that pits AI agents against each other, judges them with a cross-family panel, and evolves the winners — across Anthropic, Google, and OpenAI. - [![CI](https://github.com/AaronGoldsmith/mobius/actions/workflows/ci.yml/badge.svg)](https://github.com/AaronGoldsmith/mobius/actions/workflows/ci.yml) [![Python 3.12+](https://img.shields.io/badge/python-3.12+-blue.svg)](https://www.python.org/downloads/) [![License: MIT](https://img.shields.io/badge/license-MIT-green.svg)](LICENSE) @@ -18,16 +17,16 @@ judges them with a cross-family panel, and evolves the winners — across Anthro Most agent frameworks run one model, hope for the best, and call it done. Mobius takes a different approach: **competition drives quality.** -- **5 agents** tackle every task in parallel — different providers, different strategies -- **3 judges** from different model families score the outputs (no home-field advantage) +- **5 agents** (configurable) tackle every task in parallel — different providers, different strategies +- **3 judges** from different model families score the outputs (reduces single-provider bias) - **Elo ratings** track who's actually good, not who's hyped - **Evolution** breeds winners into new variants; underperformers get retired - **Memory** remembers which agents won on similar tasks, so selection gets smarter over time ``` -Task → Selector → Swarm (parallel) → Judge Panel → Elo Update → Memory - ↓ - Evolve / Retire / Promote +Task → Memory Query → Selector → Swarm (parallel) → Judge Panel → Elo Update + ↓ + Evolve / Retire / Promote ``` ## Quick start @@ -36,7 +35,8 @@ Task → Selector → Swarm (parallel) → Judge Panel → Elo Update → Memory pip install -e ".[dev]" cp .env.example .env # Add your API keys mobius init # Create database -mobius bootstrap # Seed agents (~$1.50) — or /mobius-seed for free +mobius bootstrap # Seed agents via API (~$1.50) +# OR use /mobius-seed for free (requires Claude Code Pro) mobius run "Build a CLI that converts CSV to JSON" ``` @@ -102,12 +102,24 @@ mobius loop --rounds 10 # Self-improvement loop mobius leaderboard # Elo rankings mobius scout ./src # Auto-generate domain agents from your code mobius evolve backend # Improve underperformers in a specialization +mobius train "task" --rounds N # Iterative training on a single challenge mobius explain # Show last match's judge reasoning mobius stats # Overview mobius agent list # Browse agents mobius agent show # Agent details ``` +## Claude Code Skills (Free) + +If you use [Claude Code](https://claude.com/claude-code) with a Pro subscription, these skills replace the paid API equivalents: + +| Skill | Replaces | What it does | +|-------|----------|-------------| +| `/mobius-seed` | `mobius bootstrap` | Opus creates agents directly — same quality, $0 | +| `/mobius-run --free` | `mobius run` | Haiku subagents compete, Opus judges — $0 | +| `/mobius-judge` | API judge panel | Opus evaluates outputs locally — $0 | +| `/mobius-audit` | manual testing | Health checks and guided exploration | + ## How it works ### Agents @@ -133,10 +145,14 @@ A panel of three models from different families scores each output on correctnes After every N matches, Mobius: 1. Identifies underperformers (low win rate over recent matches) -2. Takes the best agents and breeds refined variants via Opus +2. Refines underperformers using judge feedback via Opus 3. Retires agents on long losing streaks 4. Promotes consistent winners to champion status +### Memory + +After each competition, the winning agent's task is embedded (all-MiniLM-L6-v2) and stored. Future selections query this vector memory to find agents that won on similar past tasks — so the system gets smarter with every match. + ## Architecture ``` @@ -181,6 +197,8 @@ Mobius is designed to be cheap to run: | `MOBIUS_DATA_DIR` | `data` | Where the database lives | | `MOBIUS_SWARM_SIZE` | `5` | Agents per competition | | `MOBIUS_BUDGET_USD` | `50.0` | Global spending cap | +| `MOBIUS_AGENT_TIMEOUT_SECONDS` | `120` | Max time per agent execution | +| `MOBIUS_AGENT_MAX_TURNS` | `10` | Max tool-use turns per agent | See [`config.py`](src/mobius/config.py) for all tunable parameters.