Skip to content

Capture SWE-bench patches from failed runs#751

Draft
neubig wants to merge 4 commits into
mainfrom
capture-swebench-error-patches
Draft

Capture SWE-bench patches from failed runs#751
neubig wants to merge 4 commits into
mainfrom
capture-swebench-error-patches

Conversation

@neubig

@neubig neubig commented Jun 12, 2026

Copy link
Copy Markdown
Member

Summary

  • preserve test_result on SWE-bench error outputs so failed runs can still be scored when they produced a patch
  • stage failed SWE-bench workspaces before collecting failure patches, so newly-created files are included
  • collect failed-run patches from the staged index with git diff --cached <base_commit> because failed runs do not commit and therefore do not advance HEAD

Scope

This PR is workspace-backend agnostic. It uses the existing RemoteWorkspace.execute_command() API and applies to Docker, Apptainer, and other remote workspaces where /workspace/<repo> is a git repo.

Apptainer-specific workspace/image changes are kept out of this PR and live in #745.

Verification

  • PYTHONPATH="$PWD" /home/gneubig/work/openhands-benchmarks-venv/bin/python -m pytest tests/test_swebench_failure_patch_capture.py -q
  • ADP partial scoring smoke: a captured error patch from a stuck run was applied and scored successfully by the SWE-bench scorer.

Draft while the full ADP SWE-bench experiment continues running.

Issue

Related: #749

@neubig neubig changed the title Capture SWE-bench patches on failed runs Capture SWE-bench failure patches and bind Apptainer workspaces Jun 12, 2026
@neubig neubig force-pushed the capture-swebench-error-patches branch from 75930f2 to 127766b Compare June 12, 2026 13:00
@neubig neubig changed the title Capture SWE-bench failure patches and bind Apptainer workspaces Capture SWE-bench patches from failed runs Jun 12, 2026
@neubig

neubig commented Jun 12, 2026

Copy link
Copy Markdown
Member Author

@OpenHands /review

Please run an AI review for this now-main-based generic failed-run patch capture PR. CI is green and Apptainer-specific changes have been split into #745.

@openhands-ai

openhands-ai Bot commented Jun 12, 2026

Copy link
Copy Markdown

Uh oh! There was an unexpected error starting the job :(

@neubig

neubig commented Jun 12, 2026

Copy link
Copy Markdown
Member Author

@OpenHands /review

@openhands-ai

openhands-ai Bot commented Jun 12, 2026

Copy link
Copy Markdown

I'm on it! neubig can track my progress at all-hands.dev

neubig commented Jun 12, 2026

Copy link
Copy Markdown
Member Author

Code Review

Overall: ✅ LGTM

The PR is well-designed and implements a clean solution for capturing git patches from failed SWE-bench runs.

What this PR does:

  1. Adds a collect_failure_test_result() hook to the base Evaluation class that benchmarks can override
  2. SWE-bench implementation stages the workspace with git add -A and captures a patch with git diff --cached
  3. Preserves test_result in error outputs so failed runs can still be scored
  4. Properly handles exceptions during patch collection with warnings

Strengths:

  • Clean architecture: The hook is in the base class, making it reusable across benchmarks
  • Proper error handling: Both git add and git diff failures are handled gracefully with warnings
  • Good separation of concerns: get_repo_path(), get_git_patch(), get_staged_git_patch(), and collect_failure_test_result() are all separate, focused methods
  • Defensive programming: Uses dict(test_result or {}) to avoid issues with None values
  • Comprehensive tests: Two tests covering both base class preservation and SWE-bench-specific behavior
  • Documentation: Good comments explaining the intent

Verified:

  • ✅ Pre-commit checks pass (Ruff format, Ruff lint, PEP8, Pyright)
  • ✅ Tests pass (2/2)

Minor notes (not blockers):

  • The unused error parameter in the base collect_failure_test_result hook is fine since it's meant to be overridden
  • The cast(RemoteWorkspace, workspace) in tests is necessary for type checking

This is a solid implementation that cleanly solves the problem of capturing partial outputs from failed runs.

@openhands-ai

openhands-ai Bot commented Jun 12, 2026

Copy link
Copy Markdown

OpenHands encountered an error: **Failed to send message to agent server: HTTP 503 error: no available server
**

See the conversation for more information.

@neubig

neubig commented Jun 12, 2026

Copy link
Copy Markdown
Member Author

@OpenHands /review

@openhands-ai

openhands-ai Bot commented Jun 12, 2026

Copy link
Copy Markdown

Uh oh! There was an unexpected error starting the job :(

@neubig

neubig commented Jun 12, 2026

Copy link
Copy Markdown
Member Author

@OpenHands /review

@openhands-ai

openhands-ai Bot commented Jun 12, 2026

Copy link
Copy Markdown

I'm on it! neubig can track my progress at all-hands.dev

neubig commented Jun 12, 2026

Copy link
Copy Markdown
Member Author

Code Review

Overall: ✅ LGTM

The PR is well-designed and implements a clean solution for capturing git patches from failed SWE-bench runs. The tests pass and pre-commit checks are clean.

What this PR does:

  1. Adds a collect_failure_test_result() hook to the base Evaluation class that benchmarks can override
  2. SWE-bench implementation stages the workspace with git add -A and captures a patch with git diff --cached
  3. Preserves test_result in error outputs so failed runs can still be scored
  4. Properly handles exceptions during patch collection with appropriate warnings

Strengths:

  • Clean hook design: The collect_failure_test_result() hook in Evaluation is a minimal, well-documented interface that allows other benchmarks to add their own failure recovery logic
  • Workspace-agnostic: Uses the existing RemoteWorkspace.execute_command() API, making it applicable to Docker, Apptainer, and other backends
  • Preserves workspace state: git add -A stages changes without committing, keeping the failed workspace in its final agent-visible state
  • Comprehensive error handling: Both git add failures and patch collection errors are caught and logged as warnings, never blocking evaluation

Files changed:

  • benchmarks/utils/evaluation.py: +34 lines for the hook and integration
  • benchmarks/swebench/run_infer.py: +84 lines for SWE-bench specific implementation
  • tests/test_swebench_failure_patch_capture.py: +94 lines for comprehensive test coverage

Test results:

tests/test_swebench_failure_patch_capture.py::test_error_output_preserves_failure_test_result PASSED
tests/test_swebench_failure_patch_capture.py::test_swebench_collect_failure_test_result_gets_git_patch PASSED

Recommendation:

Approve - the implementation is clean, well-tested, and follows the existing codebase patterns. Ready for merge when the draft status is lifted.

@openhands-ai

openhands-ai Bot commented Jun 12, 2026

Copy link
Copy Markdown

OpenHands encountered an error: **Failed to send message to agent server: HTTP 503 error: no available server
**

See the conversation for more information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant