Skip to content

Capture SWE-bench patches on failed runs#748

Draft
neubig wants to merge 2 commits into
mainfrom
capture-swebench-error-patches-main
Draft

Capture SWE-bench patches on failed runs#748
neubig wants to merge 2 commits into
mainfrom
capture-swebench-error-patches-main

Conversation

@neubig

@neubig neubig commented Jun 10, 2026

Copy link
Copy Markdown
Member

Summary

  • add a generic failure-result hook that runs before workspace cleanup
  • use the hook in SWE-bench to stage workspace changes and recover a git diff from failed/stuck runs
  • preserve recovered patches in error rows via test_result.git_patch so they can still be scored

Validation

  • /home/gneubig/work/openhands-benchmarks-venv/bin/python -m pytest -q tests/test_swebench_failure_patch_capture.py tests/test_error_output_serialization.py
  • /home/gneubig/work/openhands-benchmarks-venv/bin/python -m ruff check benchmarks/utils/evaluation.py benchmarks/swebench/run_infer.py tests/test_swebench_failure_patch_capture.py
  • /home/gneubig/work/openhands-benchmarks-venv/bin/python -m ruff format --check benchmarks/utils/evaluation.py benchmarks/swebench/run_infer.py tests/test_swebench_failure_patch_capture.py
  • /home/gneubig/work/openhands-benchmarks-venv/bin/python -m py_compile benchmarks/utils/evaluation.py benchmarks/swebench/run_infer.py tests/test_swebench_failure_patch_capture.py

This PR was created by Codex on behalf of Graham Neubig.

Issue

Closes #749

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Track PR #748: Capture SWE-bench patches on failed runs

1 participant