Agent Diagnostic
Investigated using the create-spike skill backed by principal-engineer-reviewer. Skills loaded: debug-openshell-cluster, openshell-cli, create-spike.
Investigation summary:
- Traced
sandbox create --from through run.rs → build.rs → push.rs
- Confirmed three discrete full-image heap allocations in
push_local_images()
- Verified bollard
upload_to_container API supports streaming via body_try_stream — no API constraint blocks the fix
- Confirmed
tar crate size-in-header constraint requires a seekable (disk) intermediate
- Identified
tempfile is already a dev-dep; needs promotion to prod dep
- No test coverage exists for
push.rs
Description
Running openshell sandbox create --from <dockerfile-dir> on a large sandbox image (~3.7 GB) causes the openshell process to be killed by the Linux OOM killer. The image export pipeline in crates/openshell-bootstrap/src/push.rs buffers the entire Docker image tar three times in memory before importing it into the gateway, producing ~11 GB peak allocation.
Expected: The export → tar wrap → upload pipeline streams data in O(chunk) memory.
Actual: ~3× image size held in memory simultaneously → OOM kill on images ≥ ~4 GB on machines with ≤ 16 GB free RAM.
Root Cause
Data flow
sandbox create --from ./my-image/
└─ run.rs:build_from_dockerfile()
└─ build.rs:build_and_push_image()
├─ build_image() ← streaming, no issue
└─ push_local_images()
├─ collect_export() ← BUFFER 1: 3.7 GB Vec<u8>
├─ wrap_in_tar() ← BUFFER 2: 3.7 GB Vec<u8> (both live simultaneously)
└─ upload_archive() ← BUFFER 3: 3.7 GB Bytes::copy_from_slice
Buffer 1 — collect_export (push.rs:97–107)
async fn collect_export(docker: &Docker, images: &[&str]) -> Result<Vec<u8>> {
let mut stream = docker.export_images(images); // already a Stream<Item=Result<Bytes>>
let mut buf = Vec::new();
while let Some(chunk) = stream.next().await {
buf.extend_from_slice(&bytes); // ← 3.7 GB accumulates here
}
Ok(buf)
}
Buffer 2 — wrap_in_tar (push.rs:114–131)
fn wrap_in_tar(file_path: &str, data: &[u8]) -> Result<Vec<u8>> {
let mut builder = tar::Builder::new(Vec::new());
header.set_size(data.len() as u64);
builder.append(&header, data)?; // ← copies full 3.7 GB into new Vec<u8>
builder.into_inner() // ← returns second 3.7 GB allocation
}
Both image_tar (Buffer 1) and outer_tar (Buffer 2) are live simultaneously at push.rs:54. Peak ~7.4 GB at this point.
Buffer 3 — upload_archive (push.rs:135–151)
docker.upload_to_container(
container_name,
Some(options),
bollard::body_full(Bytes::copy_from_slice(archive)), // ← third 3.7 GB copy
)
All three allocations overlap in scope. Peak ~11 GB.
| Step |
Allocation |
Size |
collect_export |
Vec<u8> (image tar) |
~3.7 GB |
wrap_in_tar |
Vec<u8> (outer tar) |
~3.7 GB |
upload_archive |
Bytes (HTTP body) |
~3.7 GB |
| Total peak |
|
~11 GB |
Streaming Fix: Feasibility Confirmed
Bollard supports streaming
upload_to_container accepts a BodyType which has three constructors:
| Function |
Input |
bollard::body_full(Bytes) |
Materialized — current, causes Buffer 3 |
bollard::body_stream(impl Stream<Item=Bytes>) |
Infallible stream |
bollard::body_try_stream(impl Stream<Item=Result<Bytes, io::Error>>) |
Best fit |
There is no API constraint blocking streaming.
Tar size-in-header constraint requires temp file
The POSIX tar format encodes the entry byte size in the 512-byte header, written before the data. The tar crates append_writer` escape hatch requires a seekable destination. This means:
- A fully in-memory zero-copy pipeline is not feasible without knowing the export size in advance
- Writing the raw export to a temp file first, then constructing the outer tar header from the known file size, then streaming the header + file chunks is the correct approach
Proposed streaming pipeline
export_images() stream
│ (chunked Bytes)
▼
write to NamedTempFile ← sync via spawn_blocking or async write
│ (flush + seek to start)
▼
stat file → build 512-byte tar header in memory (set_size = file.len())
│
▼
Stream: [header_bytes] ++ [file content chunks] ++ [tar padding]
│ (impl Stream<Item=Result<Bytes, io::Error>>)
▼
bollard::body_try_stream(...)
▼
upload_to_container() ← single pass, O(chunk) RAM
Peak memory drops from ~11 GB to O(stream chunk size) + one temp file on disk (~3.7 GB disk, ~few MB RAM).
Dependencies
| Primitive |
Status |
tempfile::NamedTempFile |
Promote from dev-dep → prod dep in Cargo.toml |
tokio::fs::File + AsyncReadExt |
Available (tokio already a dep) |
futures::stream::try_unfold |
Available (futures already a dep) |
tokio_util::io::ReaderStream |
Optional (already in lockfile as transitive dep; cleaner ergonomics) |
Design Questions
-
Single vs. two temp files: Writing raw export to file A, then outer tar to file B keeps logic simple. Writing export to file A, constructing the 512-byte header in memory, then streaming header_bytes ++ file_chunks ++ padding avoids file B entirely. The latter is preferable — no second disk write, avoids a second ~3.7 GB allocation.
-
Should OPENSHELL_SCRATCH_DIR be supported? Some CI environments mount /tmp as tmpfs (in-RAM), defeating the purpose. An env-var override is a nice-to-have but not blocking for v1 — users can redirect $TMPDIR.
-
Disk space pre-check: A proactive check (docker image inspect → VirtualSize vs. available disk) before starting the export would give a clear error instead of a cryptic OS write failure. Nice-to-have.
-
build.rs build context: create_build_context_tar materializes the Dockerfile build context into a Vec<u8>. Same pattern, much smaller scale (MBs not GBs). Out of scope for this fix but worth tracking as a follow-up.
Reproduction Steps
- Build a large sandbox image (≥ 3 GB uncompressed), e.g.:
openshell sandbox create --from sandboxes/gemini
- Observe:
[progress] Exported 3745 MiB
Killed
- Exit code:
137 (SIGKILL from OOM killer)
Environment
- Image size: ~3.7 GB (base sandbox +
@google/gemini-cli@0.34.0)
- Host RAM: 23 GB total, ~13 GB available at time of failure
- Swap: 8 GB total, ~12 MB free at time of failure
- OOM killer log:
Out of memory: Killed process (openshell) total-vm:19913320kB, anon-rss:13626992kB
Scope
| Dimension |
Assessment |
| Type |
perf / bug — correct behavior, resource-fatal on large inputs |
| Complexity |
Medium — clear fix, async streaming around sync tar with size-in-header constraint |
| Confidence |
High — no API blockers, temp-file path is proven |
| Primary files |
push.rs (logic), Cargo.toml (promote tempfile to prod dep) |
| Tests needed |
Unit tests for streaming helpers (tar header construction, stream reassembly); no Docker daemon required |
Agent-First Checklist
Agent Diagnostic
Investigated using the
create-spikeskill backed byprincipal-engineer-reviewer. Skills loaded:debug-openshell-cluster,openshell-cli,create-spike.Investigation summary:
sandbox create --fromthroughrun.rs → build.rs → push.rspush_local_images()upload_to_containerAPI supports streaming viabody_try_stream— no API constraint blocks the fixtarcrate size-in-header constraint requires a seekable (disk) intermediatetempfileis already a dev-dep; needs promotion to prod deppush.rsDescription
Running
openshell sandbox create --from <dockerfile-dir>on a large sandbox image (~3.7 GB) causes theopenshellprocess to be killed by the Linux OOM killer. The image export pipeline incrates/openshell-bootstrap/src/push.rsbuffers the entire Docker image tar three times in memory before importing it into the gateway, producing ~11 GB peak allocation.Expected: The export → tar wrap → upload pipeline streams data in O(chunk) memory.
Actual: ~3× image size held in memory simultaneously → OOM kill on images ≥ ~4 GB on machines with ≤ 16 GB free RAM.
Root Cause
Data flow
Buffer 1 —
collect_export(push.rs:97–107)Buffer 2 —
wrap_in_tar(push.rs:114–131)Both
image_tar(Buffer 1) andouter_tar(Buffer 2) are live simultaneously atpush.rs:54. Peak ~7.4 GB at this point.Buffer 3 —
upload_archive(push.rs:135–151)All three allocations overlap in scope. Peak ~11 GB.
collect_exportVec<u8>(image tar)wrap_in_tarVec<u8>(outer tar)upload_archiveBytes(HTTP body)Streaming Fix: Feasibility Confirmed
Bollard supports streaming
upload_to_containeraccepts aBodyTypewhich has three constructors:bollard::body_full(Bytes)bollard::body_stream(impl Stream<Item=Bytes>)bollard::body_try_stream(impl Stream<Item=Result<Bytes, io::Error>>)There is no API constraint blocking streaming.
Tar size-in-header constraint requires temp file
The POSIX tar format encodes the entry byte size in the 512-byte header, written before the data. The
tarcratesappend_writer` escape hatch requires a seekable destination. This means:Proposed streaming pipeline
Peak memory drops from ~11 GB to O(stream chunk size) + one temp file on disk (~3.7 GB disk, ~few MB RAM).
Dependencies
tempfile::NamedTempFileCargo.tomltokio::fs::File+AsyncReadExtfutures::stream::try_unfoldtokio_util::io::ReaderStreamDesign Questions
Single vs. two temp files: Writing raw export to file A, then outer tar to file B keeps logic simple. Writing export to file A, constructing the 512-byte header in memory, then streaming
header_bytes ++ file_chunks ++ paddingavoids file B entirely. The latter is preferable — no second disk write, avoids a second ~3.7 GB allocation.Should
OPENSHELL_SCRATCH_DIRbe supported? Some CI environments mount/tmpastmpfs(in-RAM), defeating the purpose. An env-var override is a nice-to-have but not blocking for v1 — users can redirect$TMPDIR.Disk space pre-check: A proactive check (
docker image inspect→VirtualSizevs. available disk) before starting the export would give a clear error instead of a cryptic OS write failure. Nice-to-have.build.rsbuild context:create_build_context_tarmaterializes the Dockerfile build context into aVec<u8>. Same pattern, much smaller scale (MBs not GBs). Out of scope for this fix but worth tracking as a follow-up.Reproduction Steps
137(SIGKILL from OOM killer)Environment
@google/gemini-cli@0.34.0)Out of memory: Killed process (openshell) total-vm:19913320kB, anon-rss:13626992kBScope
perf/ bug — correct behavior, resource-fatal on large inputspush.rs(logic),Cargo.toml(promotetempfileto prod dep)Agent-First Checklist
debug-openshell-cluster,openshell-cli,create-spike)