Add skill forge lab, /mobius-profile skill, and ideas backlog#6
Add skill forge lab, /mobius-profile skill, and ideas backlog#6AaronGoldsmith merged 8 commits intomainfrom
Conversation
…flow - Add "when Mobius is worth it" section with honest trade-off framing - Add prerequisites section with mandatory vs optional API keys - Show all three bootstrap options (API, Claude Code, scout) upfront - Add Trade-off column to "Why Mobius?" table - Replace architecture file listing with orchestrator flow diagram - Add context and caveats to cost table (token range, date, local embeddings) - Note env-overridable vs code-only config options - Remove duplicate `mobius train` command entry Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ount - Fix cost date from March 2025 to March 2026 - Add `mobius agent export` to commands list (exists in CLI but was undocumented) - Fix scout demo showing 4 agents when --count 5 was requested Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix scout bootstrap: "free API cost" → "~$0.50" (uses Opus API) - Add scout row to cost table - Clarify Claude Code note: CLI commands still require API keys - Standardize subscription naming to "Pro/Team" everywhere - Clarify Anthropic needed for default judge panel (not just "recommended") - Change "rate limiting" to "concurrency control" (semaphore in swarm.py) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Covers Pro, Max, and Team without needing updates as plans change. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…uirements, and trade-off wording Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
First two-phase competition: 6 agents designed Claude Code skills, then 6 testers validated them against the live DB. The generate→test pattern proved more valuable than any individual skill produced. Relates to #5 (mobius improve self-diagnosing loop) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds documentation and a new Claude Code skill focused on analyzing a single Mobius agent’s performance, plus updates the project README to better explain when/how to use Mobius and how to bootstrap agents.
Changes:
- Added a lab write-up documenting the “Skill Forge” two-phase (generate → functional test) competition pattern.
- Added a new
/mobius-profileClaude Code skill (SKILL.md +show_profile.py) to display agent match history and recommend challengers. - Added
ideas.mdbacklog and refreshed README positioning, quickstart, and architecture/cost sections.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
labs/2026-03-14-skill-forge.md |
New lab entry documenting the competition and findings (formatting issue in markdown tables). |
ideas.md |
New ideas/backlog document capturing follow-on concepts from the competition. |
README.md |
Expanded framing, quickstart, and architecture/cost explanations; updated command list and skill notes. |
.claude/skills/mobius-profile/scripts/show_profile.py |
New script that prints agent stats/matches and suggests challengers (has correctness issues in loss attribution and match filtering). |
.claude/skills/mobius-profile/SKILL.md |
New skill documentation instructing how to run the profile script and interpret results. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0f454f1507
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
- Remove unused imports (json, row_to_dict) - Skip voided/undecided matches (winner_id=None) instead of counting as LOSS - Only count losses against the actual match winner, not all opponents - Key win/loss counters by slug instead of name to avoid collisions - Filter retired agents (elo_rating=0) and untested agents from challenger recommendations Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
labs/2026-03-14-skill-forge.md— First two-phase competition: 6 agents designed Claude Code skills, then 6 testers functionally validated them against the live DB. Full results, scores, and findings documented./mobius-profile— Competition winner. Single-script skill that shows agent deep-dives with match history, win/loss analysis, and challenger recommendations. Tested and working.ideas.md— Living backlog of concepts that emerged from the competition (forge skill, match replay, agent factory, self-audit).Key Finding
The generate → test two-phase pattern was more valuable than any individual skill. Design scores and functional scores diverged significantly — one skill scored 2nd in design but last in functional testing (shipped with crashing bugs). This pattern should become repeatable.
Relates to #5 (mobius improve self-diagnosing loop)
Test plan
/mobius-profiletested against live DB with 40 agents🤖 Generated with Claude Code