fix(engine): respect cgroup memory limits in low-memory detection by miguel-heygen · Pull Request #1373 · heygen-com/hyperframes

miguel-heygen · 2026-06-12T14:55:54Z

Problem

getSystemTotalMb() returned os.totalmem() — the host's physical RAM, not the container limit. A 4 GB Docker container on a 32 GB host never auto-flagged as low-memory, so the low-memory render profile (frame-cache sizing, Chrome heap flags, worker count) failed to activate exactly where it's needed most: Docker, the documented fallback when host rendering fails. The blind spot was already documented in the module's own comment; it produced three identical Docker-timeout reports in a single day (#1193, #1194, #1195) plus #1236, with PRODUCER_LOW_MEMORY_MODE as the only escape hatch.

Fix

Read the cgroup v2 limit (/sys/fs/cgroup/memory.max) with a v1 fallback (memory.limit_in_bytes, its no-limit sentinel ignored via a 2^60-byte cutoff), parse via a pure exported function, and return min(host, cgroup).
Best-effort probe: any unreadable cgroup file degrades to host-RAM detection — it can never fail the render pipeline. Non-Linux platforms never touch /sys.
Public API unchanged; config.ts, browserManager.ts, and parallelCoordinator.ts inherit the fix through getSystemTotalMb(). The now-outdated caveat comment is updated, keeping the PRODUCER_LOW_MEMORY_MODE guidance for environments without readable cgroup files.

Tests

Full matrix: v2 numeric limit honored, v2 max ignored, v1 numeric honored, v1 sentinel ignored, absent files → host value, malformed content → host value, limit larger than host → host wins, v2 precedence over v1, and non-Linux platforms never read /sys (asserted via a throwing mock). The headline repro: 4 GiB cgroup limit on a 32 GiB host now flags low-memory.

Engine suite 715/715, typecheck clean, oxlint/oxfmt clean.

Refs #1193, #1194, #1195, #1236

james-russo-rames-d-jusso

Solid fix, clean tests, full cgroup v1/v2 sentinel matrix covered. Low-memory render profile will finally activate inside the documented Docker fallback path. No blockers. Two concerns worth a tweak before merge (observability gap on cgroup-read failure, and getSystemTotalMb() now does sync FS reads on every call with no caching across ~7 production call sites), plus a few small nits.

Concerns (non-blocking but worth addressing):

packages/engine/src/services/systemMemory.ts:80-88 — readCgroupFile() swallows every error path silently. The PR body emphasizes "best-effort probe… can never fail the render pipeline" which is correct, but the rubric on this file is "if detection degrades, we should know." Today: a misconfigured container where /sys/fs/cgroup/memory.max exists but is EACCES → silently falls back to os.totalmem(), identical observable behavior to a host with no cgroup at all. The repro from the PR body (4 GiB container on 32 GiB host) would silently not engage low-memory mode and we'd be back to the #1193/#1194/#1195 timeout shape with no log signal. Suggest a one-shot log.warn (debug-level fine) on first failure, gated to fire at most once per process — keeps the never-fail-pipeline guarantee while giving oncall a needle to grep.
packages/engine/src/services/systemMemory.ts:14-20 (getSystemTotalMb) — no module-level cache. Production call sites I traced: config.ts:242, 249, 283 (via isLowMemorySystem default arg), browserManager.ts:520, 527, renderOrchestrator.ts:1702 — that's 5-6 sync readFileSync calls per resolveConfig() + per-job log. Sysfs is fast (~µs each) so this isn't a render-perf regression, but the cgroup limit is invariant for the process lifetime — a let cached: number | null = null memo at module scope would eliminate the repeated reads with zero behavior change. Same shape probeNvidiaVramMb already uses via _cachedVramMb at browserManager.ts:507, so there's a precedent in the codebase.

Nits:

packages/engine/src/services/systemMemory.ts:16 — CGROUP_V1_NO_LIMIT_CUTOFF_BYTES = 2n ** 60n (~1.15 EiB) is a clever sentinel-catcher (handles 9223372036854771712, LONG_MAX, and unsigned -1 all at once), but the rationale isn't in a comment. A future reader will wonder why 2^60 specifically vs the literal sentinel. One-liner: // Catches the canonical v1 "no limit" sentinel (~8 EiB) plus LONG_MAX / unsigned -1 variants; well above any plausible real container limit.
PR body says parallelCoordinator.ts… inherits the fix through getSystemTotalMb(). I grepped both packages/engine/src/services/parallelCoordinator.ts and packages/producer/src/services/parallelCoordinator.ts — neither imports systemMemory or calls getSystemTotalMb/isLowMemorySystem directly. They presumably get it via cfg.lowMemoryMode flowing through, which is fine but the body's phrasing implies a direct call. Trivial, but worth fixing for the next person reading the PR for context.
systemMemory.ts:43-47 (the updated caveat comment) — "should set PRODUCER_LOW_MEMORY_MODE explicitly rather than relying on auto-detection" still tells Lambda users to set the env var, which is now only true for the no-readable-cgroup case (your fix). Reads slightly stale — Lambda actually does expose cgroup-v2 these days. Minor wording cleanup, optional.

Things NOT in the diff worth checking:

No log emission anywhere when cgroup IS read successfully. When the fix engages (cgroup limit < host RAM, low-memory mode activates), the existing renderOrchestrator.ts:1697-1703 log fires with totalMemMb — but that just shows the final number, not where it came from. Adding source: "cgroup_v2" | "cgroup_v1" | "host" would make the next "why didn't low-memory engage on this container?" investigation trivial. Especially important since the whole motivation here is "we couldn't tell which detection path was running."
Edge case not in the test matrix: container with cgroup file present but 0 bytes content. Some sandboxes / chroots create the file but never populate it. The current parsePositiveByteLimitMb treats empty string as malformed (!/^\d+$/.test("") → returns null → falls back to host). Good behavior, but it's only tested via the malformed-"" parametrized case at systemMemory.test.ts:204-219, not the full getSystemTotalMb flow with both files-empty. Belt-and-suspenders nice-to-have, not a real gap.
Memory-cgroup nesting via systemd-managed user slices. On dev laptops running rootless Docker via systemd-user, the path becomes /sys/fs/cgroup/user.slice/user-1000.slice/.../memory.max, not the canonical paths the diff reads. The fix correctly falls back to host RAM (which is what dev laptops want anyway, since the threshold catches them), so this isn't a regression — but if anyone later tries to "improve" this to read /proc/self/cgroup to find the right path, that's a classic foot-gun (the v1 controller line vs v2 unified line format differ, format changes across kernel versions). Worth a code comment near CGROUP_V2_MEMORY_MAX_PATH saying "we deliberately read the well-known paths, not /proc/self/cgroup-derived paths, to avoid the parsing complexity."
Failing CI shard (regression, regression-shards (shard-5)) — both look unrelated to this diff (touches no rendering / regression-suite code), but worth a glance at the shard-5 log to confirm before merge so it doesn't get retconned into "memory detection broke regression."
Determinism contract — verified, no concern. All 7 production callers are config-resolution / per-job startup. Nothing render-hot calls getSystemTotalMb, so the new sync FS read can't leak into render-time and doesn't violate the no-IO-in-render-path contract.

Review by Rames D Jusso

miguel-heygen · 2026-06-12T15:34:41Z

Both reviews addressed:

parallelCoordinator: confirmed the direct os.totalmem() reads — calculateOptimalWorkers and getSystemResources didn't inherit anything via cfg, so worker sizing was still host-RAM-based in containers. Both now use getSystemTotalMb(). CLI telemetry reports the effective total too (the cli→engine edge already existed; cli/tsconfig.json gains the gcp-cloud-run/sdk source alias matching the existing producer/aws-lambda entries so cli typechecks from source). A repo-wide grep confirms the only remaining totalmem() call sites live inside systemMemory.ts.
Probe detectability + cost: the cgroup result is now cached for the process lifetime (mirroring the _cachedVramMb pattern, with a test reset hook), and a present-but-unreadable cgroup file warns once per process instead of silently falling back — absence (ENOENT) stays silent. Added the 2^60 cutoff rationale comment and refreshed the stale serverless caveat.

Regression shard failures were the org-wide buildx cache outage (image build died before any render ran); the force-push retriggers a clean run.

james-russo-rames-d-jusso

R2 — ready to stamp. Both R1 concerns resolved (warn-once + cached probe), three of five nits/should-checks landed, the two CI failures at HEAD are pre-existing main-branch breakage from an unrelated studio PR (not from this diff). Two cosmetic items still open (#6 source log on success path, #8 /proc/self/cgroup foot-gun comment) — both genuinely optional, not blocking. Bonus: you pulled in a real worker-sizing fix I didn't catch in R1.

R1 carry-over status:

✅ EACCES log emission — systemMemory.ts:117-124 warnCgroupReadFailure(): once-per-process gated via _warnedCgroupReadFailure, only fires on non-ENOENT/ENOTDIR codes (absence stays silent — exactly the rubric), test at systemMemory.test.ts:255-275 verifies once-only + EACCES surfaced + ENOENT silent.
✅ Module-level cache — systemMemory.ts:21 _cachedCgroupLimitMb + getCgroupLimitMb() short-circuit at :77. Same shape as _cachedVramMb. Test-reset hook _resetCgroupLimitCacheForTests exposed at :71. Subtle but correct: undefined = not yet probed, null = probed/no limit, number = cached limit.
✅ 2^60 rationale comment — systemMemory.ts:18 now reads // Kernel no-limit sentinel is page-rounded 2^63-1 (~9223372036854771712); >= 2^60 is implausible as a real limit.
✅ PR body fixed — parallelCoordinator.ts now genuinely does call getSystemTotalMb() directly (parallelCoordinator.ts:29,157,433). Went one better than my suggestion: instead of correcting the misclaim, made it true. calculateOptimalWorkers + getSystemResources now respect container limits — that was a real worker-sizing bug in containers I missed in R1.
✅ Lambda caveat refresh — systemMemory.ts:157-163 now reads "Lambda tiers with readable cgroup ceilings, inherit the tighter container limit" and reserves the env-var guidance for environments that hide cgroup files. Accurate.
❌ Source-of-truth log (source: "cgroup_v2" | "cgroup_v1" | "host" on the renderOrchestrator success path) — not added. renderOrchestrator.ts:1697-1703 untouched. Non-blocking; can be a follow-up if it bites again.
✅ Empty-bytes edge case — systemMemory.test.ts:280-310 it.each(["", "garbage", "-1", "0"]) now runs the full getSystemTotalMb flow with file content mocked, not just the parser directly. Belt-and-suspenders covered.
❌ /proc/self/cgroup foot-gun comment — not added near CGROUP_V2_MEMORY_MAX_PATH (:16). Pure cosmetic, leave it.
✅ CI shard failures at R1 — diagnosed as org-wide buildx cache outage, force-push retriggered. The remaining failures at HEAD aren't from this PR (see below).
N/A — determinism contract, no concern, already verified in R1.

Bonus catch: cli/src/telemetry/system.ts now also reports getSystemTotalMb() instead of totalmem(), and cli/tsconfig.json gains the gcp-cloud-run/sdk source alias (matching existing producer/aws-lambda entries) so a fresh checkout cli-typecheck resolves from source. Repo-wide grep confirms totalmem() now only lives inside systemMemory.ts itself — single source of truth.

CI status at HEAD c22c8b8:

Engine Test ✅, Typecheck ✅, Build ✅, CLI smoke (required) ✅, Smoke: global install ✅, runtime + studio + SDK smokes all ✅, CodeQL all ✅.
regression / preview-regression jobs ❌ at HEAD — but the Preflight (lint + format) sub-jobs failed (not the actual regression shards), which means the failures cascade from:
CI / Lint ❌ — packages/studio/src/components/editor/DomEditOverlay.tsx: oxlint says onSelectElementById declared but never used. Untouched by this PR.
CI / Format ❌ — same file DomEditOverlay.tsx has an oxfmt formatting issue. Untouched by this PR.
Verdict: both CI failures are pre-existing main-branch breakage from a separate PR's studio work. Nothing in #1373 touches packages/studio/. Heads-up to merge-main once that's fixed; otherwise the regression shards stay red and the green state is masked.

Anything NEW worth flagging:

parallelCoordinator.ts:157 (calculateOptimalWorkers) and :433 (getSystemResources) now correctly use getSystemTotalMb() — this is meaningful beyond what the original PR scope advertised. Workers-per-container will be sized off the cgroup limit, not host RAM. Worth verifying once in a real 4 GiB container that recommendedWorkers drops accordingly — but no smoke gap to block on, MEMORY_PER_WORKER_MB-based math is well-tested.

R2 review by Rames D Jusso

getSystemTotalMb returned os.totalmem() — the host's physical RAM — so a 4GB Docker container on a 32GB host never auto-flagged as low-memory and the low-memory render profile didn't activate exactly where it's needed most. Read the cgroup v2 limit (/sys/fs/cgroup/memory.max, with the v1 fallback and its no-limit sentinel handled) and use min(host, cgroup). The probe is best-effort and non-Linux platforms never touch /sys. Review follow-ups: worker sizing (calculateOptimalWorkers) and the getSystemResources diagnostics previously read os.totalmem() directly and now use getSystemTotalMb(), so container limits actually govern parallel spawn decisions; CLI telemetry reports the effective total as well. The cgroup probe result is cached for the process lifetime (the limit is immutable per process) with a test reset hook; a detected limit logs once so operators can see which source governs, and a present-but-unreadable cgroup file warns once instead of failing silently — absence stays silent. The root-path-vs-/proc/self/cgroup trade-off is documented at the path constants. cli/tsconfig.json gains the gcp-cloud-run/sdk source alias (matching the existing producer and aws-lambda entries) so the cli typecheck resolves from source in a fresh checkout. Refs #1193, #1194, #1195, #1236

miguel-heygen · 2026-06-12T15:59:06Z

R2 follow-ups: rebased onto current main (past the DomEditOverlay lint breakage), plus the two cosmetic items — a detected cgroup limit now logs once on the success path so operators can see which source governs memory-adaptive behaviour, and the root-paths-vs-/proc/self/cgroup trade-off is documented at the path constants (container namespace root is the case this probe exists for; nested systemd slices on bare hosts intentionally fall back to total-RAM detection).

jrusso1020

LGTM.

james-russo-rames-d-jusso

R3 — green to merge. Item #8 fully addressed, item #6 addressed via a different (legitimate) approach, all required CI green at HEAD 7d1bd99.

R2 open-item status:

✅ Item #8 (foot-gun comment) — packages/engine/src/services/systemMemory.ts:16-22. Multi-line comment above CGROUP_V2_MEMORY_MAX_PATH explicitly warns future readers off /proc/self/cgroup: explains namespace-root mount semantics, calls out the nested-slice case on systemd hosts, and justifies why bare-host nesting is acceptable not-covered. Exactly what was asked.
⚠️ Item #6 (source-of-truth log) — partially-addressed, alternative approach. The [Render] Low-memory render profile active … log at producer/src/services/renderOrchestrator.ts:1697-1703 still emits only {totalMemMb, thresholdMb} — no source: cgroup_v2 | cgroup_v1 | host field. Instead added a one-shot detection log at systemMemory.ts:96-100: [SystemMemory] cgroup memory limit detected: N MiB — it governs memory-adaptive render behaviour instead of host RAM. Operationally this gives oncalls the same signal (cgroup engaged vs silent fallback to host), gated by _cachedCgroupLimitMb !== null so it fires once per process. Minor gap: the message doesn't disambiguate v1 vs v2 ("detected" is the same string for both paths). Non-blocking — the value the source-tag would unlock (knowing which detection path engaged) is mostly there.

CI status post-rebase:

All 7 required checks pass at HEAD 7d1bd99: Build, Lint, Format, Typecheck, Test, Test: runtime contract, Render/Tests on windows-latest. The R2 DomEditOverlay.tsx Lint/Format failure is gone — fix landed independently as #1377 (fix(studio): remove unused onSelectElementById after indicator removal) and the rebase pulled it in. The 8 regression-shards are still in-progress but not in the required-merge set.

Anything NEW since R2:

Single rebased commit 7d1bd99. PR scope unchanged: 5 files (cli/telemetry/system.ts, cli/tsconfig.json, engine/parallelCoordinator.ts, engine/systemMemory.ts, engine/systemMemory.test.ts). No drive-by additions, no scope creep.
Nice net-new touch: getCgroupLimitMb() now logs once on first success AND warnCgroupReadFailure() warns once on present-but-unreadable files (gated by _warnedCgroupReadFailure), with absence staying silent. The dual-flag pattern means a misconfigured container surfaces a one-line WARN rather than silently falling back. Good defensive choice.

R3 review by Rames D Jusso

vanceingalls

LGTM

james-russo-rames-d-jusso reviewed Jun 12, 2026

View reviewed changes

miguel-heygen force-pushed the fix/cgroup-memory-limit branch from ccda96a to c22c8b8 Compare June 12, 2026 15:32

james-russo-rames-d-jusso reviewed Jun 12, 2026

View reviewed changes

james-russo-rames-d-jusso mentioned this pull request Jun 12, 2026

fix(engine): real back-pressure in StreamingEncoder.writeFrame #1372

Merged

miguel-heygen force-pushed the fix/cgroup-memory-limit branch from c22c8b8 to 233b467 Compare June 12, 2026 15:53

miguel-heygen force-pushed the fix/cgroup-memory-limit branch from 233b467 to 7d1bd99 Compare June 12, 2026 15:56

jrusso1020 approved these changes Jun 12, 2026

View reviewed changes

james-russo-rames-d-jusso mentioned this pull request Jun 12, 2026

fix(producer): pass resolved engine config through every encode path #1371

Merged

james-russo-rames-d-jusso reviewed Jun 12, 2026

View reviewed changes

vanceingalls approved these changes Jun 12, 2026

View reviewed changes

miguel-heygen merged commit 2ce5b42 into main Jun 12, 2026
42 checks passed

miguel-heygen deleted the fix/cgroup-memory-limit branch June 12, 2026 16:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(engine): respect cgroup memory limits in low-memory detection#1373

fix(engine): respect cgroup memory limits in low-memory detection#1373
miguel-heygen merged 1 commit into
mainfrom
fix/cgroup-memory-limit

miguel-heygen commented Jun 12, 2026

Uh oh!

james-russo-rames-d-jusso left a comment

Uh oh!

miguel-heygen commented Jun 12, 2026

Uh oh!

james-russo-rames-d-jusso left a comment

Uh oh!

miguel-heygen commented Jun 12, 2026

Uh oh!

jrusso1020 left a comment

Uh oh!

james-russo-rames-d-jusso left a comment

Uh oh!

vanceingalls left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

miguel-heygen commented Jun 12, 2026

Problem

Fix

Tests

Uh oh!

james-russo-rames-d-jusso left a comment

Choose a reason for hiding this comment

Uh oh!

miguel-heygen commented Jun 12, 2026

Uh oh!

james-russo-rames-d-jusso left a comment

Choose a reason for hiding this comment

Uh oh!

miguel-heygen commented Jun 12, 2026

Uh oh!

jrusso1020 left a comment

Choose a reason for hiding this comment

Uh oh!

james-russo-rames-d-jusso left a comment

Choose a reason for hiding this comment

Uh oh!

vanceingalls left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants