Skip to content

Add skill forge lab, /mobius-profile skill, and ideas backlog#6

Merged
AaronGoldsmith merged 8 commits intomainfrom
feature/skill-forge-lab
Mar 15, 2026
Merged

Add skill forge lab, /mobius-profile skill, and ideas backlog#6
AaronGoldsmith merged 8 commits intomainfrom
feature/skill-forge-lab

Conversation

@AaronGoldsmith
Copy link
Owner

Summary

  • labs/2026-03-14-skill-forge.md — First two-phase competition: 6 agents designed Claude Code skills, then 6 testers functionally validated them against the live DB. Full results, scores, and findings documented.
  • /mobius-profile — Competition winner. Single-script skill that shows agent deep-dives with match history, win/loss analysis, and challenger recommendations. Tested and working.
  • ideas.md — Living backlog of concepts that emerged from the competition (forge skill, match replay, agent factory, self-audit).

Key Finding

The generate → test two-phase pattern was more valuable than any individual skill. Design scores and functional scores diverged significantly — one skill scored 2nd in design but last in functional testing (shipped with crashing bugs). This pattern should become repeatable.

Relates to #5 (mobius improve self-diagnosing loop)

Test plan

  • /mobius-profile tested against live DB with 40 agents
  • Edge cases verified (0 matches, nonexistent agents)
  • Lab entry reviewed for accuracy against actual competition results

🤖 Generated with Claude Code

AaronGoldsmith and others added 7 commits March 14, 2026 19:48
…flow

- Add "when Mobius is worth it" section with honest trade-off framing
- Add prerequisites section with mandatory vs optional API keys
- Show all three bootstrap options (API, Claude Code, scout) upfront
- Add Trade-off column to "Why Mobius?" table
- Replace architecture file listing with orchestrator flow diagram
- Add context and caveats to cost table (token range, date, local embeddings)
- Note env-overridable vs code-only config options
- Remove duplicate `mobius train` command entry

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ount

- Fix cost date from March 2025 to March 2026
- Add `mobius agent export` to commands list (exists in CLI but was undocumented)
- Fix scout demo showing 4 agents when --count 5 was requested

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix scout bootstrap: "free API cost" → "~$0.50" (uses Opus API)
- Add scout row to cost table
- Clarify Claude Code note: CLI commands still require API keys
- Standardize subscription naming to "Pro/Team" everywhere
- Clarify Anthropic needed for default judge panel (not just "recommended")
- Change "rate limiting" to "concurrency control" (semaphore in swarm.py)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Covers Pro, Max, and Team without needing updates as plans change.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…uirements, and trade-off wording

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
First two-phase competition: 6 agents designed Claude Code skills,
then 6 testers validated them against the live DB. The generate→test
pattern proved more valuable than any individual skill produced.

Relates to #5 (mobius improve self-diagnosing loop)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings March 15, 2026 04:10
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds documentation and a new Claude Code skill focused on analyzing a single Mobius agent’s performance, plus updates the project README to better explain when/how to use Mobius and how to bootstrap agents.

Changes:

  • Added a lab write-up documenting the “Skill Forge” two-phase (generate → functional test) competition pattern.
  • Added a new /mobius-profile Claude Code skill (SKILL.md + show_profile.py) to display agent match history and recommend challengers.
  • Added ideas.md backlog and refreshed README positioning, quickstart, and architecture/cost sections.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
labs/2026-03-14-skill-forge.md New lab entry documenting the competition and findings (formatting issue in markdown tables).
ideas.md New ideas/backlog document capturing follow-on concepts from the competition.
README.md Expanded framing, quickstart, and architecture/cost explanations; updated command list and skill notes.
.claude/skills/mobius-profile/scripts/show_profile.py New script that prints agent stats/matches and suggests challengers (has correctness issues in loss attribution and match filtering).
.claude/skills/mobius-profile/SKILL.md New skill documentation instructing how to run the profile script and interpret results.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0f454f1507

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

- Remove unused imports (json, row_to_dict)
- Skip voided/undecided matches (winner_id=None) instead of counting as LOSS
- Only count losses against the actual match winner, not all opponents
- Key win/loss counters by slug instead of name to avoid collisions
- Filter retired agents (elo_rating=0) and untested agents from challenger recommendations

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@AaronGoldsmith AaronGoldsmith merged commit 96f0075 into main Mar 15, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants