Skip to content
127 changes: 84 additions & 43 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,12 +17,22 @@ judges them with a cross-family panel, and evolves the winners β€” across Anthro

Most agent frameworks run one model, hope for the best, and call it done. Mobius takes a different approach: **competition drives quality.**

Run multiple agents in parallel, have independent judges score them, and evolve the winners.

- **5 agents** (configurable) tackle every task in parallel β€” different providers, different strategies
- **3 judges** from different model families score the outputs (reduces single-provider bias)
- **Elo ratings** track who's actually good, not who's hyped
- **Evolution** breeds winners into new variants; underperformers get retired
- **Memory** remembers which agents won on similar tasks, so selection gets smarter over time

**When Mobius is worth it:**
- You're uncertain which model/provider is best for your task
- Output quality varies between runs and you need consistency
- You want long-term performance tracking as models change
- You're willing to spend 3-5x more tokens for statistically better outputs

For simple tasks with predictable outputs, a single good model is fine. Mobius is for problems where consistency and optimization matter.

```
Task β†’ Memory Query β†’ Selector β†’ Swarm (parallel) β†’ Judge Panel β†’ Elo Update
↓
Expand All @@ -31,12 +41,39 @@ Task β†’ Memory Query β†’ Selector β†’ Swarm (parallel) β†’ Judge Panel β†’ Elo

## Quick start

### Prerequisites

- **Python 3.12+**
- At least one LLM API key (more keys = better judge diversity):
- **Anthropic** (Claude) β€” recommended, included in default judge panel
- **Google** (Gemini) β€” optional, adds cross-family judging
- **OpenAI** (GPT) β€” optional, adds cross-family judging
- **Claude Code subscription** β€” optional, enables free agent seeding and judging via skills

### Install & run

```bash
pip install -e ".[dev]"
cp .env.example .env # Add your API keys
cp .env.example .env # Add your API keys (see .env.example for all options)
mobius init # Create database
mobius bootstrap # Seed agents via API (~$1.50)
# OR use /mobius-seed for free (requires Claude Code Pro)
```

Bootstrap agents (choose one):

```bash
# Option A: API-driven (~$1.50 one-time cost, automated)
mobius bootstrap

# Option B: Claude Code (free, requires any Claude Code subscription)
# In Claude Code, type: /mobius-seed

# Option C: Domain-specific (uses Opus API to analyze your codebase, ~$0.50)
mobius scout ./my-project --count 5
```

Run your first competition:

```bash
mobius run "Build a CLI that converts CSV to JSON"
```

Expand Down Expand Up @@ -78,21 +115,22 @@ Created: Protocol Engineer (anthropic/claude-sonnet-4-6) specs=[coding, archit
Created: Test Specialist (anthropic/claude-sonnet-4-6) specs=[testing, debugging]
Created: Dashboard Dev (anthropic/claude-sonnet-4-6) specs=[frontend, coding]
Created: Spec Auditor (anthropic/claude-sonnet-4-6) specs=[security, code-review]
Created: Perf Optimizer (google/gemini-2.5-flash) specs=[backend, algorithms]

Scout created 4 agents for my-project.
Scout created 5 agents for my-project.
```

> Record your own: `asciinema rec demo.cast` then run some competitions. Scripts in `scripts/`.

## Why Mobius?

| Problem | How Mobius solves it |
|---------|---------------------|
| "Which model is best for X?" | Run them all. Let judges decide per-task. |
| Model outputs vary wildly | 5 attempts + consensus judging smooths variance |
| No ground truth for creative tasks | Cross-family panel eliminates single-model bias |
| Agents degrade silently | Elo tracks performance over time; losers get evolved or retired |
| Selection is manual guesswork | Vector memory recalls what worked on similar past tasks |
| Problem | How Mobius solves it | Trade-off |
|---------|---------------------|-----------|
| Which model is best? | Run all in parallel; judges pick best per task | 3-5x token cost vs. single call |
| High output variance | Consensus scoring from 3 judges reduces outliers | Consensus can be noisy with small panels; ties resolved by score aggregation |
| Creative tasks lack ground truth | Cross-provider judges (Claude + Gemini + GPT) reduce vendor bias | Judges add ~$0.10 per match |
| Agents degrade silently | Elo tracks performance over time; losers get evolved or retired | Requires match history; cold start is random |
| Selection is manual guesswork | Vector memory recalls what worked on similar past tasks | Embedding similarity isn't perfect for all task types |

## Commands

Expand All @@ -103,16 +141,16 @@ mobius loop --rounds 10 # Self-improvement loop across varied tasks
mobius leaderboard # Elo rankings
mobius scout ./src # Auto-generate domain agents from your code
mobius evolve backend # Improve underperformers in a specialization
mobius train "task" --rounds N # Iterative training on a single challenge
mobius explain # Show last match's judge reasoning
mobius stats # Overview
mobius agent list # Browse agents
mobius agent show <slug> # Agent details
mobius agent export <slug> # Export agent as .claude/agents/ markdown
```

## Claude Code Skills (Free)

If you use [Claude Code](https://claude.com/claude-code) with a Pro subscription, these skills replace the paid API equivalents:
If you use [Claude Code](https://claude.com/claude-code) with any Claude Code subscription, these skills replace the paid API equivalents:

| Skill | Replaces | What it does |
|-------|----------|-------------|
Expand Down Expand Up @@ -156,40 +194,45 @@ After each competition, the winning agent's task is embedded (all-MiniLM-L6-v2)

## Architecture

### Request β†’ Execution β†’ Judgment β†’ Learning

```
src/mobius/
β”œβ”€β”€ cli.py # Typer CLI
β”œβ”€β”€ orchestrator.py # Competition coordinator
β”œβ”€β”€ swarm.py # Async parallel execution
β”œβ”€β”€ runner.py # Provider dispatcher
β”œβ”€β”€ judge.py # Cross-family judge panel
β”œβ”€β”€ tournament.py # Elo rating system
β”œβ”€β”€ selector.py # Agent selection strategies
β”œβ”€β”€ memory.py # Vector similarity (sqlite-vec)
β”œβ”€β”€ registry.py # Agent CRUD
β”œβ”€β”€ agent_builder.py # Opus-powered agent creation
β”œβ”€β”€ models.py # Pydantic data models
β”œβ”€β”€ config.py # Configuration
└── providers/
β”œβ”€β”€ base.py # Abstract interface
β”œβ”€β”€ anthropic.py # Claude models
β”œβ”€β”€ google.py # Gemini models
β”œβ”€β”€ openai.py # GPT models
└── openrouter.py # Multi-model gateway
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ CLI (cli.py) β€” Typer command handlers β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Orchestrator (orchestrator.py) β€” Coordinates a match β”‚
β”‚ 1. Select agents via selector.py (uses memory.py) β”‚
β”‚ 2. Run agents in parallel via swarm.py β”‚
β”‚ 3. Judge outputs via judge.py + runner.py β”‚
β”‚ 4. Update Elo ratings via tournament.py β”‚
β”‚ 5. Log results to database (db.py) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

**Provider abstraction** β€” All model calls go through `providers/` (anthropic, google, openai, openrouter), each implementing the same async interface. Concurrency is controlled at the swarm layer via semaphore.

**Data layer** β€” Pydantic models (`models.py`), SQLite with WAL mode (`db.py`), agent CRUD (`registry.py`), and vector similarity via sqlite-vec (`memory.py` + local Sentence-Transformers embeddings).

**Evolution** β€” `agent_builder.py` uses Opus to create/refine agents; `tournament.py` handles Elo math; `selector.py` picks agents via Specialist, Diverse, or Ensemble strategies.

## Cost

Mobius is designed to be cheap to run:
Costs below are for typical tasks (500-2000 tokens per agent attempt):

| Component | Models | Cost | Notes |
|-----------|--------|------|-------|
| Swarm (5 agents) | Haiku / Flash / GPT-4o-mini | ~$0.01-0.05 | Parallel; scales with task length |
| Judge panel | Opus + Gemini Pro + GPT-4o | ~$0.05-0.15 | Evaluates all 5 outputs |
| Full competition | 5 agents + 3 judges | ~$0.10-0.25 | One round |
| Bootstrap (one-time) | Opus | ~$1.50 | Creates initial agent pool |
| Scout | Opus | ~$0.50 | Analyzes codebase, creates domain agents |
| Vector embeddings | Sentence-Transformers | $0 | Runs locally, no API cost |

| What | Model tier | Cost |
|------|-----------|------|
| Swarm execution | Haiku / Flash / GPT-4o-mini | ~$0.01-0.05 per agent |
| Judge panel (API) | Opus + Gemini Pro + GPT-4o | ~$0.10 per match |
| Full competition | 5 agents + 3 judges | ~$0.15-0.35 |
| Bootstrap (one-time) | Opus | ~$1.50 |
**Claude Code users**: `/mobius-seed` and `/mobius-judge` use your Opus subscription directly β€” same quality, zero API cost. CLI commands (`mobius run`, `mobius bootstrap`, `mobius scout`) still require API keys.

**Claude Code users**: `/mobius-seed` and `/mobius-judge` use your Opus subscription directly β€” same quality, zero API cost.
*Costs estimated March 2026. Verify current pricing in your provider dashboards.*

## Configuration

Expand All @@ -198,10 +241,8 @@ Mobius is designed to be cheap to run:
| `MOBIUS_DATA_DIR` | `data` | Where the database lives |
| `MOBIUS_SWARM_SIZE` | `5` | Agents per competition |
| `MOBIUS_BUDGET_USD` | `50.0` | Global spending cap |
| `MOBIUS_AGENT_TIMEOUT_SECONDS` | `120` | Max time per agent execution |
| `MOBIUS_AGENT_MAX_TURNS` | `10` | Max tool-use turns per agent |

See [`config.py`](src/mobius/config.py) for all tunable parameters.
These are the env-overridable settings. Additional parameters (Elo K-factor, promotion thresholds, embedding model, evolution triggers, retirement streaks, and more) can be tuned directly in [`config.py`](src/mobius/config.py).

## Contributing

Expand Down
Loading