Add SWE-bench Apptainer image builds and runtime wiring#745
Conversation
|
@OpenHands /review Please run an AI review for this Apptainer runtime/image-build PR. CI is green and the generic failed-run patch capture has been split into #751. |
|
Uh oh! There was an unexpected error starting the job :( |
|
@OpenHands /review |
|
I'm on it! neubig can track my progress at all-hands.dev |
|
@OpenHands /review |
|
Uh oh! There was an unexpected error starting the job :( |
|
@OpenHands /review |
|
I'm on it! neubig can track my progress at all-hands.dev |
|
@OpenHands /review |
|
I'm on it! neubig can track my progress at all-hands.dev |
|
@OpenHands /review |
|
I'm on it! neubig can track my progress at all-hands.dev |
|
@OpenHands /review |
|
I'm on it! neubig can track my progress at all-hands.dev |
Codex AI ReviewOverall: LGTM - Recommend approveReviewed current head
I do not see remaining blocking issues in this PR. |
|
✅ Review complete. This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here. |
all-hands-bot
left a comment
There was a problem hiding this comment.
Code Review: Add SWE-bench Apptainer image builds and runtime wiring
🟡 Acceptable - Solid feature implementation with minor areas for improvement.
Summary
This PR adds local Apptainer SIF builds for SWE-bench evaluation when registry images are unavailable, plus tokenizer and condenser configuration wiring. The feature is well-designed for its intended use case (HPC environments without Docker access).
[CRITICAL ISSUES]
None identified.
[IMPROVEMENT OPPORTUNITIES]
- [benchmarks/swebench/run_infer.py, Line 117] Robustness:
getattr(self, "current_attempt", 1)defaults to 1, but this attribute may not be initialized in all code paths. Consider validating or documenting this behavior.
[TESTING GAPS]
- Tests use
FakeApptainerWorkspaceto capture constructor arguments - this is acceptable for the integration patterns being tested. - Consider adding a test that verifies the workspace directory is cleaned up after use (if applicable to the workflow).
[OTHER OBSERVATIONS]
-
Deleted
apptainer_eval.py: The removal of local Apptainer evaluation is intentional and documented in the README. This is fine. -
SDK Submodule Update: 1.24.0 → 1.27.0 is a significant jump. The SDK commits reference
Fix Apptainer tokenizer binds and token condensation,Count tokens with tokenizer chat templates, andAdd OpenAI chat completions gateway. The addition ofopenaito agent-server dependencies is expected based on the chat completions gateway feature. No concerns. -
Build Caching Strategy: The SIF caching by
(sdk_sha, dockerfile_hash, custom_tag, target)is good - ensures rebuilds when SDK or Dockerfile changes.
[RISK ASSESSMENT]
- [Overall PR]
⚠️ Risk Assessment: 🟢 LOW
This PR introduces a fallback build path for HPC environments without Docker. The changes are additive and don't affect existing functionality (registry images are still preferred). The Apptainer build logic has proper error handling and test coverage.
VERDICT:
✅ Worth merging - Core logic is sound, minor improvement suggested above.
KEY INSIGHT:
The multi-package-manager support in _package_install_script() is pragmatic and necessary for HPC environments with varied base images.
This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation
|
@OpenHands /review Addressed the |
|
I'm on it! neubig can track my progress at all-hands.dev |
|
@OpenHands /review Pushed 1706c8e to fix the CI-only fake evaluation setup; ready for review on the current head. |
|
I'm on it! neubig can track my progress at all-hands.dev |
|
🔍 Review in progress... |
|
✅ Review complete. This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here. |
all-hands-bot
left a comment
There was a problem hiding this comment.
Code Review: Add SWE-bench Apptainer image builds and runtime wiring
🟢 Good taste - Well-structured feature with elegant fallback design.
Summary
This PR adds local Apptainer SIF builds for HPC environments where Docker registries are unavailable. The implementation is clean: registry images are preferred, and local builds serve as a graceful fallback. The tokenizer bind mounting and workspace directory isolation are solid additions.
[CRITICAL ISSUES]
None identified.
[IMPROVEMENT OPPORTUNITIES]
- [benchmarks/swebench/run_infer.py, Line 146] Robustness:
os.makedirs(mount_dir, mode=0o700, exist_ok=False)will raiseFileExistsErrorif the directory already exists. Since the mount_dir includes a UUID suffix ({uuid.uuid4().hex[:8]}), collisions should be rare, but in high-concurrency scenarios or quick retries, this could fail unexpectedly. Consider usingexist_ok=Trueas a precaution, or at minimum catch and handleFileExistsErrorgracefully.
[STYLE NOTES]
- The multi-package-manager shell script in
_package_install_script()is intentionally verbose for portability. Acceptable. - The 326-line
apptainer_build.pyis on the larger side but the organization is logical. Consider splitting if it grows further.
[TESTING GAPS]
None identified. The test coverage for:
- Unsupported build targets returning errors
- Definition file content verification
- Forced rebuild preserving existing SIFs
- Workspace/caching behavior with monkeypatching
...is appropriate for this feature.
[OTHER OBSERVATIONS]
-
SDK Submodule Update: 1.24.0 → 1.27.0. The
openaiaddition toopenhands-agent-serverdependencies aligns with the "Add OpenAI chat completions gateway" feature mentioned in commit messages. The version bump is expected given the Apptainer changes. -
Condenser Config: The
condenser_max_tokensandcondenser_max_output_tokensadditions are straightforward CLI/model parameter wiring. No concerns. -
README Documentation: Clear distinction between Option 1 (pre-built) and Option 2 (local SIF build) is helpful for HPC users.
[RISK ASSESSMENT]
- [Overall PR]
⚠️ Risk Assessment: 🟢 LOW
Additive changes with graceful fallback behavior. No breaking changes to existing workflows. The Apptainer build logic has proper error handling and the local SIF caching prevents redundant builds.
VERDICT:
✅ Worth merging - Core logic is sound, one minor robustness suggestion above.
KEY INSIGHT:
The dual-path approach (registry image → local SIF fallback) with environment-variable control (OPENHANDS_APPTAINER_FORCE_BUILD) is a pragmatic design that balances performance (prefer cached registry images) with flexibility (build locally when needed).
This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation
|
✅ Review complete. This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here. |
all-hands-bot
left a comment
There was a problem hiding this comment.
Code Review: Add SWE-bench Apptainer image builds and runtime wiring
🟡 Acceptable - Functionally sound implementation with a few areas for improvement.
[CRITICAL ISSUES]
None identified.
[IMPROVEMENT OPPORTUNITIES]
-
[benchmarks/swebench/apptainer_build.py, Line 32-33] Security: The
_sanitize_filenamefunction only allows alphanumeric characters, dots, underscores, and hyphens. Consider adding validation to ensure the final path cannot traverse outside_build_root(). -
[benchmarks/swebench/apptainer_build.py, Lines 163-165] Complexity: The f-string definition file content includes shell scripts. Consider extracting the
%postsection shell script into a separate helper function for readability and testing. -
[benchmarks/swebench/run_infer.py, Lines 139-151] Robustness: The
get_apptainer_mount_dirfunction silently retries on collision. Consider logging when collisions occur for debugging.
[TESTING GAPS]
None - Comprehensive test coverage including: unsupported targets, definition content, forced rebuild, workspace initialization, tokenizer mounts, and collision handling.
[RISK ASSESSMENT]
- [Overall PR]
⚠️ Risk Assessment: 🟢 LOW
This PR adds a fallback mechanism for building Apptainer images locally when registry images are unavailable. The implementation is self-contained and only activates when explicitly needed.
[VERDICT]
✅ Worth merging - Core functionality is solid, tests are comprehensive, and the fallback mechanism provides good resilience.
[KEY INSIGHT]
The Apptainer build fallback is well-designed: it only triggers when registry images are missing, respects environment variables for caching/build forcing, and properly handles SDK version pinning through git SHA hashing.
This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation
Summary
transformersinto built agent-server SIFs so HF chat-template token counting works inside Apptainercustom_tokenizerpaths into Apptainer workspaces and pass condenser token/output limits through the SWE-bench harness/workspacefor Apptainer inference so repo setup does not depend on mutating the SIF filesystem under fakeroot/compat modevendor/software-agent-sdkto43376f18, which is proposed in Fix Apptainer tokenizer binds and token condensation software-agent-sdk#3641Scope
This PR contains Apptainer/HPC runtime and image-building support for SWE-bench. Generic failed-run patch capture is split out into #751 so it can be reviewed independently and applies to Docker as well.
Validation
IMAGE_TAG_PREFIX=test PYTHONPATH="$PWD" /home/gneubig/work/openhands-benchmarks-venv/bin/python -m pytest tests/test_swebench_apptainer_build.py tests/test_swebench_apptainer_mount_dir.py tests/test_condenser_config.py -q/workspacepermission failures are currently observed.Note: the local alternate worktree does not have the SDK submodule checked out, so
IMAGE_TAG_PREFIX=testis used for the focused test command to avoid reading the SDK Dockerfile only for image tag construction.Issue
Closes #746