discovery-forge

A hands-on demo of the annotation → prompt-improvement loop on top of W&B Weave.

A single agent runs once a day to survey autonomous research tools in the experiment-automation space and emit a structured feed (items/*.json + manifest.json). Tools accumulate into a global registry, so each run only profiles tools it hasn't seen before. Humans review/annotate the per-tool Weave traces, and that feedback drives edits to one prompt (researcher.md).

This Is the Hands-On Starting Point

For workshop participants: This repository's main branch is the hands-on starting point. There is no separate setup branch to check out — clone the repo, stay on main, and follow the steps below.

discovery-forge is built to be edited live during the session. You will:

Run the daily ResearcherAgent loop and watch the Weave traces it produces.
Annotate those research_run_<i> traces in Weave.
Use the project skills to turn that feedback into a prompt-only change to researcher.md, then rerun and compare.
Close the loop with offline evaluation against a published Weave dataset.

Everything you touch lives in this repo: the agent prompt (researcher.md), the improvement skills (skills/).

Companion Project: AI Engineering Dojo

This hands-on goes together with the AI Engineering Dojo: https://github.com/wandb/ai_engineering_dojo.

The AI Engineering Dojo is the learning-content repository for building, evaluating, monitoring, and continuously improving Auto Research Agents. Its main learning subject is Discovery Forge — the Dojo owns the learning content structure, domain specification, evaluation/monitoring concepts, and W&B Weave integration guidance, while this repository is the Discovery Forge codebase that participants actually run and improve.

In short: read the AI Engineering Dojo for the lesson flow, domain spec, and evaluation design, and use this repo as the live ResearcherAgent implementation you edit during the session.

Overview

One agent does discovery and profiling, run sequentially up to --max-tools times.

main.py
  └─ run_research()
      └─ orchestrator.run_briefing()
          └─ ResearcherAgent (×N)  — find ONE experiment-automation tool,
                                     verify it, and save a canonical profile
                                     (or reject it)
                                     → _registry/profiles/{slug}.md
                                     → items/{slug}.json + manifest.json

Each run is told what is already covered (an exclusion list built from the registry + this run's rejections) so the N runs don't converge on the same tools.

Scope: only tools that automate the "hypothesis → experiment → result → write" cycle. Deep-research tools (web-search + summarization only), curated lists, and survey pages are rejected by the agent's scope filter.

Agent Architecture

The orchestrator (orchestrator.py) loads the tool registry once, then runs the ResearcherAgent sequentially up to --max-tools times. Each run gets an exclusion list (registry + this run's saves) and a recency hint, writes its own search_web queries, verifies a candidate, and either profiles or rejects it. A CostBudget enforces --max-cost-usd with graceful shutdown, and after the loop the feed builder emits items/*.json + manifest.json.

The ResearcherAgent (gpt-5.4-mini) has one job: find a new in-scope tool, verify it, and profile or reject it. Its tools are search_web, is_known_tool, fetch_github_metadata_tool, save_source_tool, save_tool_profile_tool, save_rejected_profile_tool, and report_no_new_tool. The scope filter ("does this tool actually run experiments?") is the most important safety net and the main target of the feedback loop — deep-research tools and curated lists slipping in is the most common failure, and exactly what reviewers annotate.

Each run becomes one independent Weave root call, so reviewers can open and annotate it directly:

📦 research_run_1                  ← independent root call for human review
  └─ ResearcherAgent               ← agent span for the tool investigation
      ├─ openai.responses.create   ← model decides next action
      ├─ search_web                ← configured search backend tool
      ├─ fetch_github_metadata_tool
      └─ save_tool_profile_tool    ← accepted profile persistence
📦 research_run_2                  ← one root call per tool (rejected tools too)

For the lesson narrative around this agent (what it does and why), see the AI Engineering Dojo understand-the-agent chapter.

Daily Accumulation Model

The registry is global and persistent; each day writes its own run folder. Known tools (is_known_tool(url)) are skipped to save search / LLM cost, accepted profiles go straight into the registry, and metadata changes (stars, last commit) are logged to _updated_tools.jsonl. Every run is recorded in _profile_runs.jsonl with the run_id, status, workflow_name, agent_trace_id, prompt hash, and weave_call_id so a human can review and annotate that tool's trace directly.

daily_runs/
├── _registry/                  ← global, persistent across days
│   ├── tools.jsonl             # cumulative tool index
│   └── profiles/{slug}.md      # canonical ToolProfile per tool
│
└── {day}/                      ← per-day run + change log
    ├── run_metadata.json       # run_id, prompt hashes/refs, counts, tokens, cost
    ├── manifest.json           # Agentforge feed manifest
    ├── items/{slug}.json       # structured feed items
    ├── _profile_runs.jsonl     # per-tool trace links (incl. weave_call_id)
    ├── _new_candidates.jsonl   # tools profiled for the first time
    ├── _updated_tools.jsonl    # tools whose stars/last_commit changed
    └── _rejected_profiles.jsonl# out-of-scope tools (with reasons)

The agent writes its own search_web queries from the Query Example Pool in agents/researcher.md plus the exclusion list and recency hint, so query strategy stays prompt-editable. Skill-guided prompt improvement records live beside the prompt, not under daily_runs: src/discovery_forge/agents/improve_history/<day>/{plan,applied}.md.

Install

Requires Python 3.11+ and uv.

Clone the repo and stay on main — that branch is the hands-on starting point.

git clone <repo>
cd discovery-forge
git checkout main   # the hands-on starting point; no separate setup branch
uv sync
npx skills add wandb/skills # install wandb official skills
cp .env.example .env
# Fill OPENAI_API_KEY, WANDB_API_KEY, WANDB_ENTITY, WANDB_PROJECT
# and SERPER_API_KEY for the default backend.

Environment variables

Variable	Required	Purpose
`OPENAI_API_KEY`	✅	OpenAI model calls (and the `openai` search backend)
`SERPER_API_KEY`	✅	Default ResearcherAgent web search
`WANDB_API_KEY`	✅	W&B Weave tracing
`WANDB_ENTITY`	✅	Your W&B entity for Weave traces and eval results
`WANDB_PROJECT`	✅	W&B project; `.env.example` uses `discovery-forge` for the hands-on

The minimal .env.example contains the required/default hands-on keys. Add optional keys only when using the matching backend or integration.

Usage

Hands-On Default Flow

Start with the three root entrypoint files. They are intentionally small so a learner can open each one and follow the flow into src/discovery_forge/.

# 1. Run the daily ResearcherAgent loop
uv run python main.py \
  --day 2026-05-19 \
  --max-tools 5 \
  --max-cost-usd 5 \
  --search-backend serper

# 2. Improve from Weave annotations with the project skill
# read and follow skills/annotation-improvement/SKILL.md

# 3. Run offline eval against the published Weave dataset
uv run python evaluate.py \
  --verdict-dataset-ref '<verdict-dataset-ref>'

# 4. Improve from offline eval failures with the project skill
# read and follow skills/offline-eval-improvement/SKILL.md

Run a daily briefing

uv run python main.py --day 2026-05-19

The default search backend is serper. For a keys-minimal hands-on run, pass --search-backend openai explicitly.

Flags:

Flag	Default	Purpose
`--day`	today	ISO day id (e.g. `2026-05-19`)
`--output-dir`	`daily_runs`	Base output directory
`--max-tools`	10	Maximum ResearcherAgent runs (≈ max tools profiled)
`--max-cost-usd`	20.0	Hard cost ceiling — graceful shutdown on overage
`--search-backend`	`serper`	Search backend: `serper`, `perplexity`, or `openai`
`--since`	`month`	Restrict search results to this recency window: `day`, `week`, `month`, `year`, or `all` (no date filter). Honored by `serper`/`perplexity`; the `openai` backend can only be nudged via the prompt.
`--dry-run`	false	Validate the pipeline with no LLM calls (synthetic profiles)
`--rerun`	false	Allow re-running an existing day (auto-backs up the previous folder)

Improve From Weave Annotations

After annotating per-tool research_run_<i> calls in Weave, run the prompt-only improvement loop by following the project skill:

skills/annotation-improvement/SKILL.md

The skill reads the real trace feedback via the Weave SDK, plans the change, edits researcher.md, writes improve_history/<day>/{plan,applied}.md, and publishes the updated prompt. Datasets and scorers stay read-only evidence — the loop only edits the prompt. The Dojo improve-from-annotations and review-and-annotation chapters cover the queue setup and reviewer fields.

Offline evaluation

The eval dataset is published to Weave as verdict_quality_dataset; evaluate.py loads the pinned ref from src/discovery_forge/evaluation/evaluation_config.yaml:

uv run python evaluate.py                                  # evaluation_config.yaml verdict_quality
uv run python evaluate.py --verdict-dataset-key verdict_quality
uv run python evaluate.py --verdict-dataset-ref '<ref>'    # override

To publish a new dataset version, use discovery_forge.evaluation.datasets.publish_eval_dataset and update the matching ref in evaluation_config.yaml. To improve the prompt from failed eval rows, follow skills/offline-eval-improvement/SKILL.md with the dataset, start prompt, baseline evaluation ID, and max iteration count from the same config. See the Dojo build-eval-dataset chapter for dataset design.

Prompt versioning

Every non-dry run publishes the instruction file to Weave as researcher_instructions, and the agent uses that registered prompt for the run. run_metadata.json records both prompt_hashes and prompt_refs, and each trace includes the prompt ref.

If you configure Weave monitors/scorers in the UI, use the Researcher-* names (Researcher-quality-check, Researcher-category-check, Quality-classifiers). Older Profiler-* feedback is legacy evidence only.

Tests

# Unit tests (LLM mocked, free)
uv run pytest tests/unit/

# E2E smoke test (dry-run, no real API calls)
uv run pytest -m expensive tests/e2e/

# Everything (unit)
uv run pytest tests/ --ignore=tests/e2e

E2E tests are gated behind @pytest.mark.expensive so CI skips them by default.

Cost guardrail: --max-cost-usd (default $20) triggers a graceful shutdown when cumulative API spend exceeds the threshold. Outputs already on disk are kept and the feed is still built from whatever profiles exist.

Customizing the agent prompt

The agent instructions live as Markdown beside the agent implementation under src/discovery_forge/agents/, so you can tune the prompt without touching Python code.

agents/
├── researcher.py        # ResearcherAgent construction
├── researcher_tools.py  # ResearcherAgent tool wiring
├── researcher.md        # scope definition, Query Example Pool, profiling + citation rules
└── improve_history/     # dated prompt improvement plans and applied summaries

Non-goals

Auto-rewriting prompts without any human annotation step
GitHub releases / RSS notifications
HTML / dashboard output
Fully self-learning agents

Name		Name	Last commit message	Last commit date
Latest commit History 78 Commits
.github/workflows		.github/workflows
.reuse/templates		.reuse/templates
LICENSES		LICENSES
skills		skills
src/discovery_forge		src/discovery_forge
tests		tests
.env.example		.env.example
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
REUSE.toml		REUSE.toml
evaluate.py		evaluate.py
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

discovery-forge

This Is the Hands-On Starting Point

Companion Project: AI Engineering Dojo

Overview

Agent Architecture

Daily Accumulation Model

Install

Environment variables

Usage

Hands-On Default Flow

Run a daily briefing

Improve From Weave Annotations

Offline evaluation

Prompt versioning

Tests

Customizing the agent prompt

Non-goals

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

discovery-forge

This Is the Hands-On Starting Point

Companion Project: AI Engineering Dojo

Overview

Agent Architecture

Daily Accumulation Model

Install

Environment variables

Usage

Hands-On Default Flow

Run a daily briefing

Improve From Weave Annotations

Offline evaluation

Prompt versioning

Tests

Customizing the agent prompt

Non-goals

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages