Skip to content

fix: --resume now reuses existing run directory#192

Closed
jecanore wants to merge 2 commits intoaiming-lab:mainfrom
jecanore:fix/resume-reuses-existing-run
Closed

fix: --resume now reuses existing run directory#192
jecanore wants to merge 2 commits intoaiming-lab:mainfrom
jecanore:fix/resume-reuses-existing-run

Conversation

@jecanore
Copy link
Copy Markdown
Contributor

Summary

  • Fixed --resume creating a new run directory instead of reusing the existing one. Previously, cmd_run() always generated a fresh run_id/run_dir before checking checkpoints, so --resume without --output would always start from scratch.
  • Added _find_latest_run() helper that auto-detects the most recent run in artifacts/ with a checkpoint.json, reads the original run_id from it, and resumes from the correct stage.
  • Increased ACP timeout from 600s to 1200s to prevent CODE_GENERATION stage timeouts on complex experiment prompts.

Test plan

  • 5 new unit tests for _find_latest_run() and resume error handling
  • All 1264 existing tests pass (1 pre-existing failure in test_search_arxiv_mock unrelated to this change)
  • Manual: researchclaw run --resume finds latest run and resumes from checkpoint
  • Manual: researchclaw run --resume with no existing runs prints clear error

🤖 Generated with Claude Code

jecanore and others added 2 commits March 16, 2026 17:36
- Add resolve_config_path() to search for config.arc.yaml then config.yaml
- Change --config default to None (auto-detect) on run/validate/doctor
- Add _resolve_config_or_exit() helper with init hint on missing config
- Add `researchclaw init` subcommand with interactive provider selection
- String-based template replacement preserves YAML comments
… new one

Previously, `researchclaw run --resume` always generated a fresh run_id and
run_dir before checking for checkpoints, so it would look for a checkpoint
in the new empty directory, find nothing, and start from scratch.

Now --resume auto-detects the most recent run in artifacts/ that has a
checkpoint.json, reads the original run_id from it, and resumes from the
next stage. Also increases ACP timeout from 600s to 1200s to prevent
CODE_GENERATION timeouts on complex experiments.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@Jiaaqiliu
Copy link
Copy Markdown
Collaborator

Thanks for the contribution! The core improvements in this PR (resolve_config_path, cmd_init, config auto-detection, resume improvements) have already been incorporated into the current codebase through prior merges. The remaining unique change (_find_latest_run with generic matching) differs from the current approach which matches by topic hash — a more precise strategy. Closing as superseded.

@Jiaaqiliu Jiaaqiliu closed this Apr 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants