Skip to content

Add SWE-bench Apptainer image builds and runtime wiring#745

Open
neubig wants to merge 9 commits into
mainfrom
add-swebench-apptainer-build
Open

Add SWE-bench Apptainer image builds and runtime wiring#745
neubig wants to merge 9 commits into
mainfrom
add-swebench-apptainer-build

Conversation

@neubig

@neubig neubig commented Jun 9, 2026

Copy link
Copy Markdown
Member

Summary

  • add local Apptainer agent-server SIF builds for SWE-bench images when the matching registry image is missing
  • reuse local SIFs by SDK SHA, Dockerfile content hash, SWE-bench image tag, and build target
  • install transformers into built agent-server SIFs so HF chat-template token counting works inside Apptainer
  • bind host custom_tokenizer paths into Apptainer workspaces and pass condenser token/output limits through the SWE-bench harness
  • bind a per-instance writable host directory onto /workspace for Apptainer inference so repo setup does not depend on mutating the SIF filesystem under fakeroot/compat mode
  • update vendor/software-agent-sdk to 43376f18, which is proposed in Fix Apptainer tokenizer binds and token condensation software-agent-sdk#3641

Scope

This PR contains Apptainer/HPC runtime and image-building support for SWE-bench. Generic failed-run patch capture is split out into #751 so it can be reviewed independently and applies to Docker as well.

Validation

  • IMAGE_TAG_PREFIX=test PYTHONPATH="$PWD" /home/gneubig/work/openhands-benchmarks-venv/bin/python -m pytest tests/test_swebench_apptainer_build.py tests/test_swebench_apptainer_mount_dir.py tests/test_condenser_config.py -q
  • prior validation from the same branch: py_compile, ruff check, and ruff format check on the touched benchmark files
  • ADP full SWE-bench run is currently using these Apptainer runtime/image changes with 2x L40S and 4 workers; no missing-image or /workspace permission failures are currently observed.

Note: the local alternate worktree does not have the SDK submodule checked out, so IMAGE_TAG_PREFIX=test is used for the focused test command to avoid reading the SDK Dockerfile only for image tag construction.

Issue

Closes #746

@neubig neubig changed the title Add local Apptainer SWE-bench image builds Add SWE-bench Apptainer image builds and tokenizer wiring Jun 10, 2026
@neubig neubig changed the title Add SWE-bench Apptainer image builds and tokenizer wiring Add SWE-bench Apptainer image builds and runtime wiring Jun 12, 2026
@neubig

neubig commented Jun 12, 2026

Copy link
Copy Markdown
Member Author

@OpenHands /review

Please run an AI review for this Apptainer runtime/image-build PR. CI is green and the generic failed-run patch capture has been split into #751.

@openhands-ai

openhands-ai Bot commented Jun 12, 2026

Copy link
Copy Markdown

Uh oh! There was an unexpected error starting the job :(

@neubig

neubig commented Jun 12, 2026

Copy link
Copy Markdown
Member Author

@OpenHands /review

@openhands-ai

openhands-ai Bot commented Jun 12, 2026

Copy link
Copy Markdown

I'm on it! neubig can track my progress at all-hands.dev

@neubig

neubig commented Jun 12, 2026

Copy link
Copy Markdown
Member Author

@OpenHands /review

@openhands-ai

openhands-ai Bot commented Jun 12, 2026

Copy link
Copy Markdown

Uh oh! There was an unexpected error starting the job :(

@neubig

neubig commented Jun 12, 2026

Copy link
Copy Markdown
Member Author

@OpenHands /review

@openhands-ai

openhands-ai Bot commented Jun 12, 2026

Copy link
Copy Markdown

I'm on it! neubig can track my progress at all-hands.dev

@neubig

neubig commented Jun 12, 2026

Copy link
Copy Markdown
Member Author

@OpenHands /review

@openhands-ai

openhands-ai Bot commented Jun 12, 2026

Copy link
Copy Markdown

I'm on it! neubig can track my progress at all-hands.dev

@neubig

neubig commented Jun 12, 2026

Copy link
Copy Markdown
Member Author

@OpenHands /review

@openhands-ai

openhands-ai Bot commented Jun 12, 2026

Copy link
Copy Markdown

I'm on it! neubig can track my progress at all-hands.dev

@neubig

neubig commented Jun 12, 2026

Copy link
Copy Markdown
Member Author

@OpenHands /review

@openhands-ai

openhands-ai Bot commented Jun 12, 2026

Copy link
Copy Markdown

I'm on it! neubig can track my progress at all-hands.dev

@neubig

neubig commented Jun 12, 2026

Copy link
Copy Markdown
Member Author

Codex AI Review

Overall: LGTM - Recommend approve

Reviewed current head c301324d3120ef57c78777f4e2fbbec65539b852. The PR is a coherent Apptainer runtime/image-build change set:

  • Adds local SIF build fallback when the current agent-server image is not available in the registry.
  • Preserves existing SIFs on failed forced rebuilds and has regression coverage for that behavior.
  • Wires writable /workspace bind mounts and tokenizer bind mounts for Apptainer workspaces.
  • Adds condenser/tokenizer settings needed by the evaluated models.
  • Keeps the generic failed-run patch capture split into Capture SWE-bench patches from failed runs #751.
  • CI is green: pre-commit and tests both pass.

I do not see remaining blocking issues in this PR.

@neubig neubig marked this pull request as ready for review June 16, 2026 19:25

all-hands-bot commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator

Review complete.

This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here.

@all-hands-bot all-hands-bot left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review: Add SWE-bench Apptainer image builds and runtime wiring

🟡 Acceptable - Solid feature implementation with minor areas for improvement.

Summary

This PR adds local Apptainer SIF builds for SWE-bench evaluation when registry images are unavailable, plus tokenizer and condenser configuration wiring. The feature is well-designed for its intended use case (HPC environments without Docker access).


[CRITICAL ISSUES]

None identified.


[IMPROVEMENT OPPORTUNITIES]

  • [benchmarks/swebench/run_infer.py, Line 117] Robustness: getattr(self, "current_attempt", 1) defaults to 1, but this attribute may not be initialized in all code paths. Consider validating or documenting this behavior.

[TESTING GAPS]

  • Tests use FakeApptainerWorkspace to capture constructor arguments - this is acceptable for the integration patterns being tested.
  • Consider adding a test that verifies the workspace directory is cleaned up after use (if applicable to the workflow).

[OTHER OBSERVATIONS]

  1. Deleted apptainer_eval.py: The removal of local Apptainer evaluation is intentional and documented in the README. This is fine.

  2. SDK Submodule Update: 1.24.0 → 1.27.0 is a significant jump. The SDK commits reference Fix Apptainer tokenizer binds and token condensation, Count tokens with tokenizer chat templates, and Add OpenAI chat completions gateway. The addition of openai to agent-server dependencies is expected based on the chat completions gateway feature. No concerns.

  3. Build Caching Strategy: The SIF caching by (sdk_sha, dockerfile_hash, custom_tag, target) is good - ensures rebuilds when SDK or Dockerfile changes.


[RISK ASSESSMENT]

  • [Overall PR] ⚠️ Risk Assessment: 🟢 LOW
    This PR introduces a fallback build path for HPC environments without Docker. The changes are additive and don't affect existing functionality (registry images are still preferred). The Apptainer build logic has proper error handling and test coverage.

VERDICT:
Worth merging - Core logic is sound, minor improvement suggested above.

KEY INSIGHT:
The multi-package-manager support in _package_install_script() is pragmatic and necessary for HPC environments with varied base images.


This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation

Comment thread benchmarks/swebench/run_infer.py
@neubig

neubig commented Jun 16, 2026

Copy link
Copy Markdown
Member Author

@OpenHands /review

Addressed the current_attempt suggestion in 1c1573e; ready for another automated review.

@openhands-ai

openhands-ai Bot commented Jun 16, 2026

Copy link
Copy Markdown

I'm on it! neubig can track my progress at all-hands.dev

@neubig

neubig commented Jun 16, 2026

Copy link
Copy Markdown
Member Author

@OpenHands /review

Pushed 1706c8e to fix the CI-only fake evaluation setup; ready for review on the current head.

@openhands-ai

openhands-ai Bot commented Jun 16, 2026

Copy link
Copy Markdown

I'm on it! neubig can track my progress at all-hands.dev

Copy link
Copy Markdown
Collaborator

🔍 Review in progress...

all-hands-bot commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator

Review complete.

This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here.

@all-hands-bot all-hands-bot left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review: Add SWE-bench Apptainer image builds and runtime wiring

🟢 Good taste - Well-structured feature with elegant fallback design.

Summary

This PR adds local Apptainer SIF builds for HPC environments where Docker registries are unavailable. The implementation is clean: registry images are preferred, and local builds serve as a graceful fallback. The tokenizer bind mounting and workspace directory isolation are solid additions.


[CRITICAL ISSUES]

None identified.


[IMPROVEMENT OPPORTUNITIES]

  • [benchmarks/swebench/run_infer.py, Line 146] Robustness: os.makedirs(mount_dir, mode=0o700, exist_ok=False) will raise FileExistsError if the directory already exists. Since the mount_dir includes a UUID suffix ({uuid.uuid4().hex[:8]}), collisions should be rare, but in high-concurrency scenarios or quick retries, this could fail unexpectedly. Consider using exist_ok=True as a precaution, or at minimum catch and handle FileExistsError gracefully.

[STYLE NOTES]

  • The multi-package-manager shell script in _package_install_script() is intentionally verbose for portability. Acceptable.
  • The 326-line apptainer_build.py is on the larger side but the organization is logical. Consider splitting if it grows further.

[TESTING GAPS]

None identified. The test coverage for:

  • Unsupported build targets returning errors
  • Definition file content verification
  • Forced rebuild preserving existing SIFs
  • Workspace/caching behavior with monkeypatching

...is appropriate for this feature.


[OTHER OBSERVATIONS]

  1. SDK Submodule Update: 1.24.0 → 1.27.0. The openai addition to openhands-agent-server dependencies aligns with the "Add OpenAI chat completions gateway" feature mentioned in commit messages. The version bump is expected given the Apptainer changes.

  2. Condenser Config: The condenser_max_tokens and condenser_max_output_tokens additions are straightforward CLI/model parameter wiring. No concerns.

  3. README Documentation: Clear distinction between Option 1 (pre-built) and Option 2 (local SIF build) is helpful for HPC users.


[RISK ASSESSMENT]

  • [Overall PR] ⚠️ Risk Assessment: 🟢 LOW
    Additive changes with graceful fallback behavior. No breaking changes to existing workflows. The Apptainer build logic has proper error handling and the local SIF caching prevents redundant builds.

VERDICT:
Worth merging - Core logic is sound, one minor robustness suggestion above.

KEY INSIGHT:
The dual-path approach (registry image → local SIF fallback) with environment-variable control (OPENHANDS_APPTAINER_FORCE_BUILD) is a pragmatic design that balances performance (prefer cached registry images) with flexibility (build locally when needed).


This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation

Comment thread benchmarks/swebench/run_infer.py

all-hands-bot commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator

Review complete.

This review was performed through OpenHands Cloud Automation. You can log in and view the conversation here.

@all-hands-bot all-hands-bot left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review: Add SWE-bench Apptainer image builds and runtime wiring

🟡 Acceptable - Functionally sound implementation with a few areas for improvement.


[CRITICAL ISSUES]

None identified.


[IMPROVEMENT OPPORTUNITIES]

  • [benchmarks/swebench/apptainer_build.py, Line 32-33] Security: The _sanitize_filename function only allows alphanumeric characters, dots, underscores, and hyphens. Consider adding validation to ensure the final path cannot traverse outside _build_root().

  • [benchmarks/swebench/apptainer_build.py, Lines 163-165] Complexity: The f-string definition file content includes shell scripts. Consider extracting the %post section shell script into a separate helper function for readability and testing.

  • [benchmarks/swebench/run_infer.py, Lines 139-151] Robustness: The get_apptainer_mount_dir function silently retries on collision. Consider logging when collisions occur for debugging.


[TESTING GAPS]

None - Comprehensive test coverage including: unsupported targets, definition content, forced rebuild, workspace initialization, tokenizer mounts, and collision handling.


[RISK ASSESSMENT]

  • [Overall PR] ⚠️ Risk Assessment: 🟢 LOW

This PR adds a fallback mechanism for building Apptainer images locally when registry images are unavailable. The implementation is self-contained and only activates when explicitly needed.


[VERDICT]

Worth merging - Core functionality is solid, tests are comprehensive, and the fallback mechanism provides good resilience.

[KEY INSIGHT]

The Apptainer build fallback is well-designed: it only triggers when registry images are missing, respects environment variables for caching/build forcing, and properly handles SDK version pinning through git SHA hashing.


This review was generated by an AI agent (OpenHands) on behalf of the user through OpenHands Automation. View conversation

Comment thread benchmarks/swebench/apptainer_build.py
Comment thread benchmarks/swebench/apptainer_build.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Track PR #745: Add local Apptainer SWE-bench image builds

2 participants