fix(engine): real back-pressure in StreamingEncoder.writeFrame#1372
Conversation
james-russo-rames-d-jusso
left a comment
There was a problem hiding this comment.
Solid fix. The drain-vs-exit race correctly handles the listener-leak case the prior fire-and-forget design couldn't. One real concern about microtask accumulation on the shared exitPromise over long renders; everything else is small.
Concerns:
-
packages/engine/src/services/streamingEncoder.ts:452-472—exitPromise.then(...)registered per back-pressuredwriteFramecall leaks closures until FFmpeg exits. Every slow-path write addsexitPromise.then(() => { drainAbort.abort(); return "exit"; })to the sharedexitPromise's reaction list. V8 doesn't GC.thencallbacks on an unsettled Promise — so on a 1h render at 30fps under steady back-pressure you'd accumulate ~108K registrations, each pinning a freshAbortController+ closure (~hundreds of bytes each → tens of MB). That's well under the 80 GB OOM this PR fixes, but it directly works against the PR's heap-stability goal and is the exact kind of slow leak the multi-worker cap was masking. Fix is small: keep one persistentexitPromiseconsumer outsidewriteFrame, e.g. set anexitedFlag+ anArray<AbortController>purged on each call, OR havewaitForDrainOrExitusePromise.race([oncePromise, new Promise(r => ffmpeg.once('close', r))])with a one-shot listener that auto-removes. The latter matches the symmetry of the currentonce(stdin, 'drain', { signal })and avoids the shared-Promise reaction list entirely. -
packages/producer/src/services/render/stages/captureStreamingStage.ts:198,266+ HDR loops —falsereturn value fromwriteFrameis still discarded by every caller. If FFmpeg exits mid-render,writeFrameshort-circuits and returnsfalseinstantly, but the producer loop keeps capturing frames and walking the reorder buffer untilclose()eventually surfaces the failure. Not a regression from this PR (old code had the same issue) — but the newPromise<boolean>signature is the natural seam to short-circuit on. Aif (!(await currentEncoder.writeFrame(buffer))) break;per loop would let the failure propagate ~one frame faster and stop wasting capture work. Non-blocking, but worth one line.
Nits:
streamingEncoder.ts:468-471— theif (result === "drain") drainAbort.abort();defensive abort is harmless but redundant:once()with{ signal }removes its listener on resolve too, andAbortController.abort()after resolve is a no-op. Could drop the conditional for clarity, or comment why the defense is there.- Comment on
streamingEncoder.ts:489-497is great but the phrase "the shared exit promise wins the race" is the exact line that masks the leak in concern #1 — worth nuancing once the fix lands.
Things NOT in the diff worth checking:
-
Observability for time-spent-back-pressured. HDR loops already roll drain time into
encoderWriteMs, butcaptureStreamingStagehas no equivalent. If a render gets slower post-merge, oncall can't distinguish capture-bound vs encoder-bound without code-reading. A counter or histogram on the slow path (incremented whenaccepted === false) would pay for itself the first time someone asks "why is this render slow now?". -
close()racing with a pendingwriteFrame.close()callsstdin.end()which flushes; if the kernel pipe drains, the pending writeFrame resolvestrue, but the producer loop is already tearing down — the result is consumed by no one and the buffer copy was wasted. Probably fine in practice but no test asserts the behavior. Worth one test case: spawn → write returns false →close()→ assert pending writeFrame settles (either to true or false, but doesn't hang). -
Determinism contract holds. Verified — no
setTimeout/Date.now()introduced in the back-pressure path;'drain'is event-driven.-framerateis configured at FFmpeg spawn so output frame timestamps remain deterministic regardless of producer wall-clock blocking. Vai's lane on the broader runtime semantics; from a determinism-contract reading this is clean. -
Concurrent
writeFramecallers and drain-listener accumulation. Confirmed not reachable: every caller goes throughreorderBuffer.waitForFrame→writeFrame→advanceTo, which strictly serializes. If a future caller bypasses the reorder buffer (e.g. an audio side-channel),once(stdin, 'drain')from two concurrent callers would both register listeners and both resolve on the same'drain'event — currently safe (both returntrue), but worth a "callers must serialize" docstring on the encoder interface so it doesn't get bypassed silently later.
Review by Rames D Jusso
79d193c to
5647a5b
Compare
|
Addressed both review points:
|
james-russo-rames-d-jusso
left a comment
There was a problem hiding this comment.
R2 — concerns resolved, ship once CI unblocks. The load-bearing closure-leak fix landed exactly as prescribed — one-shot once() listeners with abort-in-finally, plus the close-before-attach re-check. All five writeFrame call sites now go through ensureFrameWritten. Two non-blocking items I flagged for "things not in the diff" are still open (observability counter, serialization docstring). CI is red but on an unrelated file — DomEditOverlay.tsx lint/format breakage that has nothing to do with this PR.
R1 carry-over status:
- ✅ Closure leak (
streamingEncoder.ts:452-485) — resolved.waitForDrainOrExitnow usesonce(stdin,'drain',{signal})+once(ffmpeg,'close',{signal}), aborts infinally, re-checksexitStatus !== "running"after listener attachment. Zero.then()on the sharedexitPromise. The new test atstreamingEncoder.test.ts:632-666loops 12 back-pressured writes and assertsproc.listenerCount("close")returns to baseline each cycle — the exact V8-reaction-list shape I was worried about, verified. - ✅ Callers discard
false(captureStreamingStage.ts:198,266 + captureHdrHybridLoop.ts:188 + captureHdrSequentialLoop.ts:193,210) — resolved. All five sites:ensureFrameWritten(await encoder.writeFrame(buf), i). Throws a frame-indexedError("Streaming encoder exited before frame N was written")instead of discarding. Stricter than thebreak;I suggested — actually surfaces the failure up the stack. - ✅ Redundant defensive abort — N/A. Entirely refactored away; new code aborts unconditionally in
finally. - ✅ Misleading "shared exit promise wins the race" comment — resolved. New comment block at
streamingEncoder.ts:455-461explicitly explains why they don't race the shared exit promise (V8 retains.thenentries on unsettled promises). - ❌ Back-pressure observability counter on
captureStreamingStage— still open. HDR loops still getaddHdrTiming(hdrPerf, "encoderWriteMs", writeStart); streaming stage has no equivalent. Non-blocking, follow-up. - ✅
close()racing pendingwriteFrametest — resolved. New test "resolves false when close fires after write returns false before await attaches listeners" (the close-before-attach hang window) atstreamingEncoder.test.ts:696-723, plus "resolves false instead of hanging when FFmpeg exits before drain" at670-694. - ✅ Determinism — N/A (verified clean in R1).
- ❌ "Callers must serialize" docstring on
StreamingEncoder.writeFrame— still open. Non-blocking; today every caller goes throughreorderBuffer.waitForFrameso it's not reachable.
CI status — red but unrelated to this PR:
CI / Lint—oxlinterror onpackages/studio/src/components/editor/DomEditOverlay.tsx: "Parameter 'onSelectElementById' is declared but never used."CI / Format—oxfmt --checkcomplains about the sameDomEditOverlay.tsx.regression(required) andWindows render verificationare SKIPPED+FAILURE-cascade because preflight (lint+format) gated them out.- Every check from this PR's actual diff is green:
Test(1m46s),Typecheck,Build,Test: runtime contract,CLI smoke (required),SDK: unit + contract + smoke,CodeQL,Fallow audit,Smoke: global install. - Same
DomEditOverlay.tsxlint hit is blocking #1373 too — same studio main-branch breakage from a separate PR.
Anything NEW worth flagging:
- Drive-by swap of
writeFileSync→writeFileExclusiveSyncincaptureHdrHybridLoop.ts:331andcaptureHdrSequentialLoop.ts:204for the HDR debug-dump path. Unrelated to back-pressure but defensible — prevents two parallel HDR workers from stomping the sameframe_NNNN_final_rgb48le.binif the debug-dump cadences ever overlap. - The
ensureFrameWrittenhelper lives incaptureHdrFrameShared.ts:212and is used by both the streaming stage and HDR loops — good shared seam, well-placed. - Commit message is unusually thorough and accurate — correctly names the V8 reaction-list mechanism, the ~108K closure estimate, and the
MULTI_WORKER_MAX_DURATION_SECONDSfollow-up framing.
R2 review by Rames D Jusso
5647a5b to
dc6073c
Compare
writeFrame returned the stdin.write boolean synchronously; when FFmpeg encoded slower than workers captured, Node's writable buffer grew without bound (multi-worker worst case ~80GB over a 1h render) until the kernel OOM-killed the process. writeFrame is now async: a buffered write awaits the drain event before resolving, so back-pressure propagates through the frame reorder buffer to the capture loops and in-flight frames stay bounded. Inactivity-timer semantics are preserved: no reset before drain, so a hung FFmpeg still trips SIGTERM. The drain wait races one-shot drain/close listeners (aborted in a finally) rather than chaining onto the shared exit promise — V8 retains reaction-list entries on unsettled promises, so per-frame .then chains would accumulate ~108K closures over a 1h back-pressured render. An exit-status re-check after listener attachment closes the close-before-attach hang window. All five writeFrame call sites (streaming stage and HDR loops) check the result via a shared ensureFrameWritten guard and stop the render with a frame-indexed error when the encoder is gone instead of discarding the boolean. The MULTI_WORKER_MAX_DURATION_SECONDS cap can be relaxed in a follow-up now that buffering is bounded. Fixes #1353
dc6073c to
8b5f476
Compare
|
R2 follow-ups: rebased onto current main (past the DomEditOverlay lint breakage fixed in #1377), and the "callers must serialize" contract is now documented on |
jrusso1020
left a comment
There was a problem hiding this comment.
LGTM. The drain-race-against-exitPromise shape with abort-controller-driven listener cleanup is exactly the durable fix this surface needed.
james-russo-rames-d-jusso
left a comment
There was a problem hiding this comment.
R3 — item #8 ✅ landed cleanly; item #5 regression-shards in progress as expected for a render PR). Non-blocking to ship: #5 was already flagged R2 as non-blocking concern. Worth a one-line ack about the observability gap before stamp.
R2 open-item status:
⚠️ #5 Back-pressure observability counter — not added.packages/producer/src/services/render/stages/captureStreamingStage.ts:199and:267still callensureFrameWritten(await currentEncoder.writeFrame(buffer), frameIndex)with no surroundingaddStreamingTiming/ counter / histogram. HDR loops do still wrap withaddHdrTiming(hdrPerf, "encoderWriteMs", writeStart)after the await (captureHdrHybridLoop.ts:189,captureHdrSequentialLoop.ts:193/:210), so the HDR path does roll drain time intoencoderWriteMs— but the streaming stage emits nothing. When chat-host renders get slower post-merge, oncall still won't be able to distinguish capture-bound vs encoder-bound on the streaming path. Same gap as R2.- ✅ #8 "Callers must serialize" docstring — added on the interface declaration at
packages/engine/src/services/streamingEncoder.ts:130-136. JSDoc onStreamingEncoder.writeFramereads: "Callers must serialize calls — one in-flight writeFrame per encoder (the frame reorder buffer provides this ordering); concurrent calls would interleave frame bytes on the pipe and race the drain wait." Lands on the interface, so IDE hover surfaces it on every call site. Exactly the shape requested.
CI status post-rebase:
All required checks green at HEAD 8b5f476:
CI / Lint✅,CI / Format✅,CI / Build✅,CI / Typecheck✅,CI / Test✅,CI / CLI smoke (required)✅,CI / SDK: unit + contract + smoke✅,CI / Test: runtime contract✅,CI / Studio: load smoke✅,CI / Smoke: global install✅,CI / Semantic PR title✅,CI / File size check✅,CI / Fallow audit✅CodeQL✅ (all three analyzers),Windows render verification✅,player-perf✅,preview-regression✅regression-shards (shard-1..8)are IN_PROGRESS — normal for a render-pipeline PR; the lint/format failure R2 flagged onDomEditOverlay.tsxis gone post-rebase.
Anything NEW:
Single squashed commit since R2 (8b5f476), 6 files / 247 ins / 31 del. Diff cross-check:
- Closure-leak fix preserved:
streamingEncoder.ts:462-484still uses one-shotonce(stdin, 'drain', { signal })+once(ffmpeg, 'close', { signal })withAbortController.abort()in finally — not.thenon the sharedexitPromise. Comment at line 462 explicitly documents not racingexitPromise.then(...)to avoid V8 reaction-list retention. The only.thencalls (lines 471, 474) are on the per-writeonce(...)promises that settle when drain/close fires or get aborted in finally — not long-lived. Load-bearing fix intact. - Close-before-attach hang window closed:
streamingEncoder.ts:478-480re-checksexitStatus !== "running"after listener attachment, as the R2 review prescribed. - Shared
ensureFrameWrittenguard atcaptureHdrFrameShared.ts:309-313is the right shape — single helper, frame-indexed error, no callers swallow the boolean. Applied at all five write sites. - No new feature flags, no API/wire changes, no test deletions.
streamingEncoder.test.tsgot +166 lines covering the new async path.
R3 review by Rames D Jusso
Problem
writeFramereturnedstdin.write()'s boolean synchronously. Node's highWaterMark is advisory, so when FFmpeg encodes slower than workers capture, the writable buffer grows without bound — the #1353 / PR #1351 review numbers put the multi-worker worst case at ~80 GB over a 1h render, OOM-killing the producer before the render finishes. The 1800s multi-worker cap bounds the practical risk but is a band-aid.Fix
writeFrameis now(buffer: Buffer) => Promise<boolean>:accepted === true: unchanged fast path (reset inactivity timer, resolvetrue).accepted === false: awaitdrainraced against the shared FFmpeg exit promise — if FFmpeg dies before draining, the promise resolvesfalseinstead of hanging forever, and the drain listener is removed either way (listener-leak safe across thousands of frames).exitStatusis re-checked after drain.All five call sites now await, so back-pressure propagates through the frame reorder buffer to the capture loops:
captureStreamingStage.ts:198/266,captureHdrSequentialLoop.ts:192/209,captureHdrHybridLoop.ts:188(the HDR loops weren't listed in the issue — found via caller sweep).MULTI_WORKER_MAX_DURATION_SECONDSis deliberately untouched; relaxing it is the follow-up this PR enables.Tests
drain, then resolvestrue; drain listener count returns to 0.falsepromptly (no hang), listener cleaned up.Fixes #1353