Skip to content

Add autoresearch auto data pruning workflow#2

Draft
ahazeemi wants to merge 4 commits into
mainfrom
codex-auto-data-pruning-plan-l2DVi
Draft

Add autoresearch auto data pruning workflow#2
ahazeemi wants to merge 4 commits into
mainfrom
codex-auto-data-pruning-plan-l2DVi

Conversation

@ahazeemi

@ahazeemi ahazeemi commented Apr 5, 2026

Copy link
Copy Markdown
Owner

Summary

Add an autoresearch/ workflow for iterative auto data pruning experiments.

What changed

  • add the autoresearch-style experiment harness, docs, and runner
  • stream tokenization to shard files to avoid holding full corpora in memory
  • align supported scorer paths and TinyStories-specific documentation
  • clean up redundant comments and dead code in the new workflow

Why

The branch adds a focused harness for exploring data-pruning strategies with a fixed-budget training loop while keeping the workflow runnable on the default TinyStories setup.

Validation

  • python3 -m compileall autoresearch
  • bash -n autoresearch/run.sh

Notes

This PR is opened as a draft.

claude and others added 4 commits March 15, 2026 00:08
Describes how to adapt Karpathy's autoresearch autonomous experiment
loop to explore data pruning strategies using dPrune's scorer/pruner
pipeline, rather than model architecture changes.

https://claude.ai/code/session_01VGVrW3yjH69rVLLqafhjjc
Implements the autonomous experiment loop for data pruning using dPrune,
following Karpathy's autoresearch pattern:

- prepare.py: Data download, BPE tokenizer, dataloaders, val_bpb eval
- prune.py: Mutable dPrune pipeline (agent's target file)
- train.py: Minimal GPT with fixed 5-min time budget
- program.md: AI agent instructions with search strategy guide
- run.sh: Experiment runner with keep-or-revert git logic

The agent edits prune.py to explore scorer/pruner/ratio combinations,
trains on the pruned subset, and keeps changes that improve val_bpb.

https://claude.ai/code/session_01VGVrW3yjH69rVLLqafhjjc
@ahazeemi ahazeemi changed the title [codex] Add autoresearch auto data pruning workflow Add autoresearch auto data pruning workflow Apr 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants