Add SWE-bench Apptainer image builds and runtime wiring by neubig · Pull Request #745 · OpenHands/benchmarks

neubig · 2026-06-09T21:51:47Z

Summary

add local Apptainer agent-server SIF builds for SWE-bench images when the matching registry image is missing
reuse local SIFs by SDK SHA, Dockerfile content hash, SWE-bench image tag, and build target
install transformers into built agent-server SIFs so HF chat-template token counting works inside Apptainer
bind host custom_tokenizer paths into Apptainer workspaces and pass condenser token/output limits through the SWE-bench harness
bind a per-instance writable host directory onto /workspace for Apptainer inference so repo setup does not depend on mutating the SIF filesystem under fakeroot/compat mode
update vendor/software-agent-sdk to 43376f18, which is proposed in Fix Apptainer tokenizer binds and token condensation software-agent-sdk#3641

Scope

This PR contains Apptainer/HPC runtime and image-building support for SWE-bench. Generic failed-run patch capture is split out into #751 so it can be reviewed independently and applies to Docker as well.

Validation

IMAGE_TAG_PREFIX=test PYTHONPATH="$PWD" /home/gneubig/work/openhands-benchmarks-venv/bin/python -m pytest tests/test_swebench_apptainer_build.py tests/test_swebench_apptainer_mount_dir.py tests/test_condenser_config.py -q
prior validation from the same branch: py_compile, ruff check, and ruff format check on the touched benchmark files
ADP full SWE-bench run is currently using these Apptainer runtime/image changes with 2x L40S and 4 workers; no missing-image or /workspace permission failures are currently observed.

Note: the local alternate worktree does not have the SDK submodule checked out, so IMAGE_TAG_PREFIX=test is used for the focused test command to avoid reading the SDK Dockerfile only for image tag construction.

Issue

Closes #746

neubig · 2026-06-12T13:08:41Z

@OpenHands /review

Please run an AI review for this Apptainer runtime/image-build PR. CI is green and the generic failed-run patch capture has been split into #751.

openhands-ai · 2026-06-12T13:09:26Z

Uh oh! There was an unexpected error starting the job :(

neubig · 2026-06-12T13:11:28Z

@OpenHands /review

openhands-ai · 2026-06-12T13:11:58Z

I'm on it! neubig can track my progress at all-hands.dev

neubig · 2026-06-12T13:27:18Z

@OpenHands /review

openhands-ai · 2026-06-12T13:28:02Z

Uh oh! There was an unexpected error starting the job :(

neubig · 2026-06-12T14:50:28Z

@OpenHands /review

openhands-ai · 2026-06-12T14:50:52Z

I'm on it! neubig can track my progress at all-hands.dev

neubig · 2026-06-12T15:11:56Z

@OpenHands /review

openhands-ai · 2026-06-12T15:12:20Z

I'm on it! neubig can track my progress at all-hands.dev

neubig · 2026-06-12T16:42:15Z

@OpenHands /review

openhands-ai · 2026-06-12T16:42:39Z

I'm on it! neubig can track my progress at all-hands.dev

neubig · 2026-06-12T16:55:50Z

@OpenHands /review

openhands-ai · 2026-06-12T16:56:15Z

I'm on it! neubig can track my progress at all-hands.dev

neubig · 2026-06-12T18:00:45Z

Codex AI Review

Overall: LGTM - Recommend approve

Reviewed current head c301324d3120ef57c78777f4e2fbbec65539b852. The PR is a coherent Apptainer runtime/image-build change set:

Adds local SIF build fallback when the current agent-server image is not available in the registry.
Preserves existing SIFs on failed forced rebuilds and has regression coverage for that behavior.
Wires writable /workspace bind mounts and tokenizer bind mounts for Apptainer workspaces.
Adds condenser/tokenizer settings needed by the evaluated models.
Keeps the generic failed-run patch capture split into Capture SWE-bench patches from failed runs #751.
CI is green: pre-commit and tests both pass.

I do not see remaining blocking issues in this PR.

all-hands-bot · 2026-06-16T19:28:12Z

✅ Review complete.

This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here.

all-hands-bot

Code Review: Add SWE-bench Apptainer image builds and runtime wiring

🟡 Acceptable - Solid feature implementation with minor areas for improvement.

Summary

This PR adds local Apptainer SIF builds for SWE-bench evaluation when registry images are unavailable, plus tokenizer and condenser configuration wiring. The feature is well-designed for its intended use case (HPC environments without Docker access).

[CRITICAL ISSUES]

None identified.

[IMPROVEMENT OPPORTUNITIES]

[benchmarks/swebench/run_infer.py, Line 117] Robustness: getattr(self, "current_attempt", 1) defaults to 1, but this attribute may not be initialized in all code paths. Consider validating or documenting this behavior.

[TESTING GAPS]

Tests use FakeApptainerWorkspace to capture constructor arguments - this is acceptable for the integration patterns being tested.
Consider adding a test that verifies the workspace directory is cleaned up after use (if applicable to the workflow).

[OTHER OBSERVATIONS]

Deleted apptainer_eval.py: The removal of local Apptainer evaluation is intentional and documented in the README. This is fine.
SDK Submodule Update: 1.24.0 → 1.27.0 is a significant jump. The SDK commits reference Fix Apptainer tokenizer binds and token condensation, Count tokens with tokenizer chat templates, and Add OpenAI chat completions gateway. The addition of openai to agent-server dependencies is expected based on the chat completions gateway feature. No concerns.
Build Caching Strategy: The SIF caching by (sdk_sha, dockerfile_hash, custom_tag, target) is good - ensures rebuilds when SDK or Dockerfile changes.

[RISK ASSESSMENT]

[Overall PR] ⚠️ Risk Assessment: 🟢 LOW
This PR introduces a fallback build path for HPC environments without Docker. The changes are additive and don't affect existing functionality (registry images are still preferred). The Apptainer build logic has proper error handling and test coverage.

VERDICT:
✅ Worth merging - Core logic is sound, minor improvement suggested above.

KEY INSIGHT:
The multi-package-manager support in _package_install_script() is pragmatic and necessary for HPC environments with varied base images.

This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation

neubig · 2026-06-16T19:59:51Z

@OpenHands /review

Addressed the current_attempt suggestion in 1c1573e; ready for another automated review.

openhands-ai · 2026-06-16T20:00:25Z

I'm on it! neubig can track my progress at all-hands.dev

neubig · 2026-06-16T20:04:41Z

@OpenHands /review

Pushed 1706c8e to fix the CI-only fake evaluation setup; ready for review on the current head.

openhands-ai · 2026-06-16T20:05:07Z

I'm on it! neubig can track my progress at all-hands.dev

all-hands-bot · 2026-06-16T22:06:16Z

🔍 Review in progress...

all-hands-bot · 2026-06-16T22:06:27Z

✅ Review complete.

This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here.

all-hands-bot

Code Review: Add SWE-bench Apptainer image builds and runtime wiring

🟢 Good taste - Well-structured feature with elegant fallback design.

Summary

This PR adds local Apptainer SIF builds for HPC environments where Docker registries are unavailable. The implementation is clean: registry images are preferred, and local builds serve as a graceful fallback. The tokenizer bind mounting and workspace directory isolation are solid additions.

[CRITICAL ISSUES]

None identified.

[IMPROVEMENT OPPORTUNITIES]

[benchmarks/swebench/run_infer.py, Line 146] Robustness: os.makedirs(mount_dir, mode=0o700, exist_ok=False) will raise FileExistsError if the directory already exists. Since the mount_dir includes a UUID suffix ({uuid.uuid4().hex[:8]}), collisions should be rare, but in high-concurrency scenarios or quick retries, this could fail unexpectedly. Consider using exist_ok=True as a precaution, or at minimum catch and handle FileExistsError gracefully.

[STYLE NOTES]

The multi-package-manager shell script in _package_install_script() is intentionally verbose for portability. Acceptable.
The 326-line apptainer_build.py is on the larger side but the organization is logical. Consider splitting if it grows further.

[TESTING GAPS]

None identified. The test coverage for:

Unsupported build targets returning errors
Definition file content verification
Forced rebuild preserving existing SIFs
Workspace/caching behavior with monkeypatching

...is appropriate for this feature.

[OTHER OBSERVATIONS]

SDK Submodule Update: 1.24.0 → 1.27.0. The openai addition to openhands-agent-server dependencies aligns with the "Add OpenAI chat completions gateway" feature mentioned in commit messages. The version bump is expected given the Apptainer changes.
Condenser Config: The condenser_max_tokens and condenser_max_output_tokens additions are straightforward CLI/model parameter wiring. No concerns.
README Documentation: Clear distinction between Option 1 (pre-built) and Option 2 (local SIF build) is helpful for HPC users.

[RISK ASSESSMENT]

[Overall PR] ⚠️ Risk Assessment: 🟢 LOW
Additive changes with graceful fallback behavior. No breaking changes to existing workflows. The Apptainer build logic has proper error handling and the local SIF caching prevents redundant builds.

VERDICT:
✅ Worth merging - Core logic is sound, one minor robustness suggestion above.

KEY INSIGHT:
The dual-path approach (registry image → local SIF fallback) with environment-variable control (OPENHANDS_APPTAINER_FORCE_BUILD) is a pragmatic design that balances performance (prefer cached registry images) with flexibility (build locally when needed).

This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation

all-hands-bot · 2026-06-16T22:14:21Z

✅ Review complete.

This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here.

all-hands-bot

Code Review: Add SWE-bench Apptainer image builds and runtime wiring

🟡 Acceptable - Functionally sound implementation with a few areas for improvement.

[CRITICAL ISSUES]

None identified.

[IMPROVEMENT OPPORTUNITIES]

[benchmarks/swebench/apptainer_build.py, Line 32-33] Security: The _sanitize_filename function only allows alphanumeric characters, dots, underscores, and hyphens. Consider adding validation to ensure the final path cannot traverse outside _build_root().
[benchmarks/swebench/apptainer_build.py, Lines 163-165] Complexity: The f-string definition file content includes shell scripts. Consider extracting the %post section shell script into a separate helper function for readability and testing.
[benchmarks/swebench/run_infer.py, Lines 139-151] Robustness: The get_apptainer_mount_dir function silently retries on collision. Consider logging when collisions occur for debugging.

[TESTING GAPS]

None - Comprehensive test coverage including: unsupported targets, definition content, forced rebuild, workspace initialization, tokenizer mounts, and collision handling.

[RISK ASSESSMENT]

[Overall PR] ⚠️ Risk Assessment: 🟢 LOW

This PR adds a fallback mechanism for building Apptainer images locally when registry images are unavailable. The implementation is self-contained and only activates when explicitly needed.

[VERDICT]

✅ Worth merging - Core functionality is solid, tests are comprehensive, and the fallback mechanism provides good resilience.

[KEY INSIGHT]

The Apptainer build fallback is well-designed: it only triggers when registry images are missing, respects environment variables for caching/build forcing, and properly handles SDK version pinning through git SHA hashing.

This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation

Add local Apptainer SWE-bench image builds

0ffb6fd

neubig mentioned this pull request Jun 10, 2026

Track PR #745: Add local Apptainer SWE-bench image builds #746

Open

Wire SWE-bench Apptainer tokenizer and condenser settings

0afb8c5

neubig changed the title ~~Add local Apptainer SWE-bench image builds~~ Add SWE-bench Apptainer image builds and tokenizer wiring Jun 10, 2026

neubig added 2 commits June 10, 2026 15:49

Fix Apptainer SWE-bench CI checks

5b5901c

Use SIFs for local SWE-bench Apptainer builds

d1c3162

neubig mentioned this pull request Jun 12, 2026

Capture SWE-bench patches from failed runs #751

Draft

Bind writable Apptainer workspaces for SWE-bench

d82cf71

neubig changed the title ~~Add SWE-bench Apptainer image builds and tokenizer wiring~~ Add SWE-bench Apptainer image builds and runtime wiring Jun 12, 2026

Preserve Apptainer SIF on failed rebuild

c301324

neubig marked this pull request as ready for review June 16, 2026 19:25

neubig added the review-this label Jun 16, 2026

all-hands-bot reviewed Jun 16, 2026

View reviewed changes

Comment thread benchmarks/swebench/run_infer.py

chore: address PR review feedback (#745)

1c1573e

fix: CI failure on PR #745

1706c8e

neubig added review-this and removed review-this labels Jun 16, 2026

all-hands-bot approved these changes Jun 16, 2026

View reviewed changes

Comment thread benchmarks/swebench/run_infer.py

chore: address Apptainer workspace review feedback (#745)

7a0658a

neubig added review-this and removed review-this labels Jun 16, 2026

all-hands-bot reviewed Jun 16, 2026

View reviewed changes

Comment thread benchmarks/swebench/apptainer_build.py

Comment thread benchmarks/swebench/apptainer_build.py

Conversation

neubig commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Scope

Validation

Issue

Uh oh!

neubig commented Jun 12, 2026

Uh oh!

openhands-ai Bot commented Jun 12, 2026

Uh oh!

neubig commented Jun 12, 2026

Uh oh!

openhands-ai Bot commented Jun 12, 2026

Uh oh!

neubig commented Jun 12, 2026

Uh oh!

openhands-ai Bot commented Jun 12, 2026

Uh oh!

neubig commented Jun 12, 2026

Uh oh!

openhands-ai Bot commented Jun 12, 2026

Uh oh!

neubig commented Jun 12, 2026

Uh oh!

openhands-ai Bot commented Jun 12, 2026

Uh oh!

neubig commented Jun 12, 2026

Uh oh!

openhands-ai Bot commented Jun 12, 2026

Uh oh!

neubig commented Jun 12, 2026

Uh oh!

openhands-ai Bot commented Jun 12, 2026

Uh oh!

neubig commented Jun 12, 2026

Codex AI Review

Overall: LGTM - Recommend approve

Uh oh!

all-hands-bot commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Code Review: Add SWE-bench Apptainer image builds and runtime wiring

Summary

Uh oh!

Uh oh!

neubig commented Jun 16, 2026

Uh oh!

openhands-ai Bot commented Jun 16, 2026

Uh oh!

neubig commented Jun 16, 2026

Uh oh!

openhands-ai Bot commented Jun 16, 2026

Uh oh!

all-hands-bot commented Jun 16, 2026

Uh oh!

all-hands-bot commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Code Review: Add SWE-bench Apptainer image builds and runtime wiring

Summary

Uh oh!

Uh oh!

all-hands-bot commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Code Review: Add SWE-bench Apptainer image builds and runtime wiring

[CRITICAL ISSUES]

[IMPROVEMENT OPPORTUNITIES]

[TESTING GAPS]

[RISK ASSESSMENT]

[VERDICT]

[KEY INSIGHT]

neubig commented Jun 9, 2026 •

edited

Loading

all-hands-bot commented Jun 16, 2026 •

edited

Loading

all-hands-bot commented Jun 16, 2026 •

edited

Loading

all-hands-bot commented Jun 16, 2026 •

edited

Loading