Capture SWE-bench patches from failed runs#751
Conversation
75930f2 to
127766b
Compare
|
@OpenHands /review Please run an AI review for this now-main-based generic failed-run patch capture PR. CI is green and Apptainer-specific changes have been split into #745. |
|
Uh oh! There was an unexpected error starting the job :( |
|
@OpenHands /review |
|
I'm on it! neubig can track my progress at all-hands.dev |
Code ReviewOverall: ✅ LGTMThe PR is well-designed and implements a clean solution for capturing git patches from failed SWE-bench runs. What this PR does:
Strengths:
Verified:
Minor notes (not blockers):
This is a solid implementation that cleanly solves the problem of capturing partial outputs from failed runs. |
|
OpenHands encountered an error: **Failed to send message to agent server: HTTP 503 error: no available server See the conversation for more information. |
|
@OpenHands /review |
|
Uh oh! There was an unexpected error starting the job :( |
|
@OpenHands /review |
|
I'm on it! neubig can track my progress at all-hands.dev |
Code ReviewOverall: ✅ LGTMThe PR is well-designed and implements a clean solution for capturing git patches from failed SWE-bench runs. The tests pass and pre-commit checks are clean. What this PR does:
Strengths:
Files changed:
Test results:Recommendation:Approve - the implementation is clean, well-tested, and follows the existing codebase patterns. Ready for merge when the draft status is lifted. |
|
OpenHands encountered an error: **Failed to send message to agent server: HTTP 503 error: no available server See the conversation for more information. |
Summary
test_resulton SWE-bench error outputs so failed runs can still be scored when they produced a patchgit diff --cached <base_commit>because failed runs do not commit and therefore do not advanceHEADScope
This PR is workspace-backend agnostic. It uses the existing
RemoteWorkspace.execute_command()API and applies to Docker, Apptainer, and other remote workspaces where/workspace/<repo>is a git repo.Apptainer-specific workspace/image changes are kept out of this PR and live in #745.
Verification
PYTHONPATH="$PWD" /home/gneubig/work/openhands-benchmarks-venv/bin/python -m pytest tests/test_swebench_failure_patch_capture.py -qDraft while the full ADP SWE-bench experiment continues running.
Issue
Related: #749