What if you could optimize any software system the same way you train a neural network?
AutoPilot is a PyTorch/Lightning-inspired framework for generalized optimization. It brings the rigor and developer experience of deep learning to non-differentiable systems. Structured feedback replaces numerical gradients. State mutations (like code edits or config updates) replace weight updates. The same forward -> loss -> backward -> optimizer.step() loop that trains neural networks now optimizes prompts, heuristics, rule engines, agents, and configurations -- deterministically, with memory, rollback, and policy gating.
Building complex, non-differentiable systems -- like AI agents, RAG pipelines, fraud detection heuristics, or rule-based engines -- is a manual, informal process today. You tweak a prompt or a regex rule, run the system, look at the output, decide if it got better, and repeat.
This process lacks the structured feedback loop that made deep learning iteration so fast:
- No Memory: There is no automatic log of what was already tried. You often re-try the same failed strategy multiple times.
- No Structured Feedback: Evaluation is often "looks right to me." There is no quantitative tracking of metrics across held-out validation sets.
- No Automatic Rollback: When a change makes things worse, you undo it by hand. If you're not sure, you guess.
- No Scalability: One person manually iterating is slow. There is no way to run this overnight, no way to hand it to an autonomous system, and no way to reproduce what happened three experiments ago.
Karpathy's autoresearch proved the loop works. Give an agent code, let it modify and evaluate, keep improvements, discard regressions, repeat. It ran 100 experiments overnight on a single file. But the entire orchestration lives in a markdown prompt. When to keep, when to discard, how to log results, when to revert are all natural language instructions the agent has to interpret correctly every time.
AutoPilot solves this by formalizing the iteration loop into the same structural abstractions that powered the deep learning revolution.
Optimizing any iterative system follows the same structure as training a neural network. AutoPilot formalizes this mapping into a real, typed interface.
In deep learning, you pass data through a model (forward pass). A loss function scores the output. Backpropagation computes gradients that explain how parameters should change. An optimizer reads those gradients and updates the weights. You repeat this in epochs, validate on held-out data, and checkpoint good states.
AutoPilot applies this exact structure to general software optimization:
- Module is your system (agent, rule engine, pipeline), exactly like
nn.Module. - Loss wraps an evaluator (Judge, profiler, or test suite) that produces structured feedback (gradients).
- Parameters mark what can be edited (prompts, JSON configs, source files via
PathParameter). - Optimizer applies changes based on gradients -- this can be an AI coding agent or a deterministic algorithm.
- Backward propagates structured feedback through the computation graph.
- Step triggers the update to the underlying parameters.
The difference is in what flows through the loop. Gradients can be text, JSON, or any arbitrary Python object. Weight updates can be code edits, file rewrites, or config tweaks. But the structure, the separation of concerns, and the lifecycle are identical to the PyTorch experience you already know.
AutoPilot is built for extreme extensibility. As long as you can define a forward pass and a way to score the result, you can optimize it:
- Prompt & AI Pipelines: Tune system prompts, RAG chunking parameters, or multi-agent routing logic based on LLM-judged evaluations.
- Heuristic & Rule Engines: Evolve fraud detection thresholds, spam filters, or trading algorithms where loss is based on precision/recall metrics.
- Configuration Tuning: Optimize database settings, cache eviction policies, or compiler flags using performance profiling reports as structured gradients.
- Simulation & Game Balancing: Adjust unit stats, physics parameters, or generation seeds based on win-rate or equilibrium metrics.
- Code Performance: Refactor SQL queries or tight loops using
EXPLAIN ANALYZEplans and profiler outputs as structured feedback for a coding optimizer.
AutoPilot isn't just a borrowed analogy; it's a structural equivalent that transfers everything ML practitioners know about training loops directly to software engineering:
| ML workflow | AutoPilot workflow |
|---|---|
| Training data | Eval dataset (test cases with ground truth) |
Forward pass (model(x)) |
Run the system on eval items (module(batch)) |
| Loss computation | Evaluator scores outputs, accumulates structured feedback |
Backward (loss.backward()) |
Feedback flows back to fill param.grad with "gradients" |
Optimizer step (optimizer.step()) |
Optimizer reads gradients and applies state mutations |
| Validation | Run on held-out split to check for regressions |
| Epoch | One full cycle: run all items -> judge -> gradient -> update -> redeploy |
| Overfitting | System tuned for train set quirks, failing on val/test |
| Checkpoint | Store snapshots code/config at each epoch, enabling rollback |
-
Model your system as a
Modulewithforward(batch). Declare what can change asParameterattributes -- files viaPathParameter, or custom subclasses for configs, prompts, thresholds. -
Define a Loss that accumulates per-batch feedback in
forward()and fillsparam.gradwith a structuredGradientinbackward(). This isn't just a number -- it tells the optimizer WHERE something failed and WHAT to fix. -
Choose an Optimizer: deterministic (like
RuleOptimizer-- reads gradients, applies heuristic fixes with zero LLM calls) or LLM-backed (AgentOptimizerwithClaudeCodeAgent-- reads gradients, edits code and prompts). -
Run the loop -- either a manual PyTorch-style
for epochloop, orTrainer.fit()which handles batching, validation, callbacks, and gradient accumulation automatically. -
Wire experiment lifecycle for production:
Experimentis a context manager (with experiment:) managing lifecycle and optionalStorefor content-addressed snapshots.StoreCheckpointCallbackauto-snapshots each epoch.CheckpointCallbacksaves training checkpoints for resumption viafit(ckpt_path=...).Policygates progression and triggers rollback on regression.
Two entry points: library (import and compose in Python) and CLI (uv run autopilot ...) for workspace operations -- experiments, store history, status, proposals, and diagnostics.
A hand-rolled for epoch: run(); eval(); if bad: revert() works for one-off tweaking. It breaks down when you need:
- Structured feedback that tells the optimizer WHERE and WHAT to fix --
Loss.backward()produces typedGradienton eachParameter, not just "accuracy dropped" - Gradient accumulation across batches with correct step boundaries --
accumulate_grad_batchesonTrainer, automatic_should_steplogic inEpochLoop - Train/val split discipline with separate metric phases --
EpochLoopswitchesmodule.eval(), runsvalidation_step, callsexperiment.on_validation_completeafter val - Policy gating with automatic rollback to the last passing epoch via content-addressed snapshots --
Policyreturns pass/fail;EpochOrchestratorcallsexperiment.rollback(last_accepted_epoch) - Experiment Records: Reproducibility and rollback via Forest/Tree persistence,
Experiment.state_dict(), Logger events, and Store snapshots -- no manual bookkeeping - The same Module working in both a manual loop and an automated Trainer -- progressive disclosure from explicit to orchestrated
AutoPilot standardizes all of this into a composable protocol with the same separation of Module / Loss / Optimizer / Trainer that made PyTorch productive for ML.
Like PyTorch + Lightning, AutoPilot offers two orchestration layers:
Manual loop (PyTorch-style) -- full control, plain Python objects:
from autopilot.ai.agents.claude_code import ClaudeCodeAgent
from autopilot.ai.gradient import ConcatCollator
from autopilot.ai.loss import JudgeLoss
from autopilot.ai.optimizer import AgentOptimizer
from autopilot.core.module.module import Module
module = MyModule()
loss = JudgeLoss(judge=MyJudge(), collator=ConcatCollator())
optimizer = AgentOptimizer(agent=ClaudeCodeAgent(), parameters=module.parameters())
module.train()
for epoch in range(5):
for batch in train_loader:
data = module(batch)
loss(data, batch)
loss.backward() # structured feedback fills param.grad
optimizer.step() # optimizer applies improvements (e.g. edits code)
optimizer.zero_grad()Non-agent optimizer (deterministic) -- no LLM required:
from autopilot.core.optimizer import Optimizer
from autopilot.core.parameter import Parameter
class ThresholdOptimizer(Optimizer):
def step(self):
for param in self._parameters:
if param.grad is not None:
old = float(param.data)
param.data = str(old + 0.01 * param.grad.value)Automated loop (Lightning-style) -- define steps, let Trainer handle the rest:
from autopilot.ai.agents.claude_code import ClaudeCodeAgent
from autopilot.ai.optimizer import AgentOptimizer
from autopilot.core.module.autopilot_module import AutoPilotModule
from autopilot.core.trainer.trainer import Trainer
class MyModule(AutoPilotModule):
def training_step(self, batch, batch_idx):
return self.forward(batch)
def configure_optimizers(self):
return AgentOptimizer(agent=ClaudeCodeAgent(), parameters=self.parameters())
trainer = Trainer(callbacks=[...], policy=my_policy, experiment=my_experiment)
trainer.fit(module, train_dataloaders=loader, max_epochs=10)| PyTorch / Lightning | AutoPilot |
|---|---|
nn.Module |
Module |
LightningModule |
AutoPilotModule |
Lightning Trainer |
Trainer |
nn.CrossEntropyLoss |
Loss / JudgeLoss |
optim.Adam |
Optimizer / AgentOptimizer |
nn.Parameter |
Parameter / PathParameter |
Tensor |
Datum / Gradient (can be any object) |
torchmetrics.Metric |
Metric |
EarlyStopping |
Policy + Gate |
ModelCheckpoint |
Store + StoreCheckpointCallback |
| Autograd engine | Operator / Context (graph via Datum.grad_fn; experiment tree uses core.node.Node separately) |
Dataset / DataLoader |
Dataset / ListDataset / DataLoader |
Lightning Callback |
Callback |
Lightning FitLoop |
Loop / EpochLoop |
| Data pipeline agent | GeneratorAgent (autopilot.ai.evaluation.generator) |
| Evaluation agent | JudgeAgent (autopilot.ai.evaluation.judge) |
See examples/ for runnable, self-contained projects:
- textmatch -- Deterministic rule optimization. Optimizes regex rules using a
RuleOptimizerand zero LLM calls. Shows the framework without AI dependencies. - protim -- Agent-driven prompt optimization. Optimizes a prompt file using
AgentOptimizerand Claude Code. - multi_module -- Multi-module trainer example. Uses
Trainer.fit(),EvalDatum,DataModule; seedata.py.
Each example directory is its own uv project; run uv run python run.py or uv run python run_trainer.py per that example's README.
The repo uses an editable src/autopilot layout. From the repo root:
uv sync && uv run autopilot --helpCold-start CLI sequence (empty disk to first optimization run):
uv run autopilot workspace init --context 'initialize workspace' # create .autopilot/ layout
uv run autopilot project init myproj --context 'bootstrap project' # scaffold project skeleton
uv run autopilot experiment create exp-1 --context 'initial experiment' # register an experiment
uv run autopilot optimize train --context 'first training epoch' # run one training epochMinimal Python example:
from autopilot.core.types import Datum
from autopilot.data.dataset import ListDataset
from autopilot.data.dataloader import DataLoader
loader = DataLoader(ListDataset([Datum(), Datum(), Datum()]), batch_size=1)
batch = next(iter(loader))
print(batch.id)- Uniform, Typed Interface: Compose systems the same way you compose PyTorch components. No string registries, no YAML configs. Instantiate objects, pass them in, call methods.
- Structured Feedback:
backward()fillsparam.gradwith actionable feedback, not just opaque scores. The optimizer readsparam.grad.render()andparam.render()to make targeted fixes. - Real Code/State Versioning:
FileStoreuses SHA-256 content addressing, snapshot manifests, and atomic writes.store.checkout(epoch)restores any previous state. - Experiment History: Reproducibility and rollback via
Experiment(context manager withstate_dict()), Forest/Tree persistence,Loggerfor metrics/events, andStorefor content-addressed snapshots.StoreCheckpointCallbackauto-snapshots each epoch. Training checkpoints viaCheckpointIO/JSONCheckpointIOandTrainer.save_checkpoint/fit(ckpt_path=...). - Policy Gating: Use
MinGate,MaxGate,RangeGate, andCustomGateto enforce quality bars and automate early stopping with rollback. - Experiment Lifecycle:
Experimentis a context manager (with experiment:) that manages store, lifecycle hooks (on_epoch_complete,on_validation_complete), rollback, and last-accepted-epoch tracking above the training loop. - Decision Traceability: Every mutating action carries an explicit reason via
--context. Experiments accumulate an append-only decision journal (ContextLog) recording why each epoch was accepted, rejected, or rolled back. Internal components (policy gates, optimizer, trainer) emit context entries automatically through a callback hook. Inspect the journal withexperiment show --context-log. - Production Infrastructure: Built-in CLI for experiments, project health, dataset management, diagnostics, and audit trails via
--expose.
| Command | Role |
|---|---|
workspace |
Workspace layout and management |
project |
Create, list, and check project health |
experiment |
Create, list, and manage experiment slugs |
optimize |
Drive the optimization loop |
ai |
Dataset generation and judging |
store |
Content-addressed code versioning |
status |
Experiment overview (epoch, metrics, stop reason) |
tree |
Manage exploration trees |
query |
Query experiments with composable filters |
checkout |
Navigate to an experiment (set HEAD + restore state) |
stabilize |
Stabilize experiment results into project root |
execute |
Execute Python code/files/modules with tracking |
debug |
Debug data collection and execution inspection |
diagnose |
Trace diagnostics and node heatmaps |
propose |
Create, verify, revert, and list proposals |
policy |
Policy checks and explanations |
dataset |
Dataset registry and splits |
report |
Reports and comparisons |
trace |
Trace collection and inspection |
Run uv run autopilot <command> --help for subcommands and flags.
Every mutating CLI command requires --context 'reason' explaining why the action is taken. This reason is recorded in two places:
- Execution record (
executions.jsonl): thecontextfield on the dispatch-levelExecutionRecord. - Experiment journal: when an experiment is active, the reason is appended to the experiment's
ContextLogviaadd_context(source='user').
Internal components also emit context entries automatically:
- Trainer: emits on experiment completion and failure.
- Policy gates: emit on epoch accept/reject decisions.
- AgentOptimizer: emits after successful agentic steps with gradient summaries.
Context logs are append-only and JSON-serializable. Inspect them with:
uv run autopilot experiment show exp-1 --context-log # full journal
uv run autopilot experiment show exp-1 --context-log --context-source policy # filter by source
uv run autopilot experiment show exp-1 --context-log --limit 5 # most recent 5 entries
uv run autopilot debug executions list --context-contains 'rollback' # search execution recordsfrom autopilot.core.types import Datum, EvalDatumfrom autopilot.core.module.module import Modulefrom autopilot.core.module.autopilot_module import AutoPilotModulefrom autopilot.core.parameter import Parameterfrom autopilot.core.gradient import Gradientfrom autopilot.core.loss import Lossfrom autopilot.core.optimizer import Optimizerfrom autopilot.core.trainer.trainer import Trainerfrom autopilot.core.metric import Metricfrom autopilot.core.experiment import Experimentfrom autopilot.core.store.base import Storefrom autopilot.core.environment import Environment, LocalEnvironmentfrom autopilot.core.checkpoint import CheckpointIO, JSONCheckpointIOfrom autopilot.core.callbacks.callback import Callbackfrom autopilot.core.loops.epoch import EpochLoopfrom autopilot.data.dataset import ListDatasetfrom autopilot.data.dataloader import DataLoaderfrom autopilot.data.datamodule import DataModule, Stagefrom autopilot.data.sampler import RandomSampler, SequentialSampler, BatchSampler, WeightedSamplerfrom autopilot.ai.parameter import PathParameterfrom autopilot.ai.loss import JudgeLossfrom autopilot.ai.optimizer import AgentOptimizerfrom autopilot.ai.agents.claude_code import ClaudeCodeAgentfrom autopilot.ai.environment import IsolatedEnvironmentfrom autopilot.ai.gradient import ConcatCollatorfrom autopilot.ai.evaluation.generator import GeneratorAgentfrom autopilot.ai.evaluation.judge import JudgeAgentfrom autopilot.policy.policy import Policy
Development uses uv sync / uv run ... from the repo root; the editable install matches [tool.hatch.build.targets.wheel] / packages = ["src/autopilot"].
src/autopilot/
core/ # Module, Trainer, Loss, Optimizer, Parameter, Gradient, Graph, Experiment, Store,
# Environment, LocalEnvironment, CheckpointIO, JSONCheckpointIO, Logger, loops, callbacks
data/ # Dataset, ListDataset, DataLoader, DataModule, Stage,
# Sampler, RandomSampler, BatchSampler, WeightedSampler,
# IncrementalSplitter, SplitAssignment
ai/ # agents, GeneratorAgent, JudgeAgent, optimizers, loss, gradient,
# AutoPilotExperiment, IsolatedEnvironment, MergeAgent, DatasetFingerprint
cli/ # argparse entry, commands, context, output
tracking/ # I/O helpers (utc_now_iso, read_json_dict, atomic/append), execution tracking
policy/ # Policy, Gate base classes
AutoPilot supports multiple projects in one workspace under autopilot/projects/<name>/:
workspace/
autopilot/
pyproject.toml
projects/
my-project/
cli.py
trainer.py
ai/
experiments/
datasets/
Each project has a cli.py that subclasses AutoPilotCLI and wires components in __init__:
from autopilot.cli.main import AutoPilotCLI
class MyCLI(AutoPilotCLI, project='my-project'):
def __init__(self):
super().__init__()
self.module = my_module
self.generator = MyGenerator() # project adapter; canonical eval type is GeneratorAgent
self.judge = MyJudge() # project adapter; canonical eval type is JudgeAgent
MyCLI()()Comprehensive documentation lives in source docstrings. See PHILOSOPHY.md for design
principles. CLI command details are in the cli-conventions skill and source docstrings.