fix(producer): pass resolved engine config through every encode path#1371
Conversation
james-russo-rames-d-jusso
left a comment
There was a problem hiding this comment.
Fix is right and the call-site coverage is complete — verified all consumers of encodeFramesFromDir / encodeFramesChunkedConcat. runEncodeStage is the single chokepoint, so the orchestrator and the distributed renderChunk path both inherit the fix. One CI test failure to address before merge; otherwise non-blocking notes.
Blockers:
packages/engine/src/services/chunkEncoder.test.ts:100— the"uses the default timeout when config is omitted"test is failing on CI (Testjob 81055571378, exit code 1). It assertsresult.success === trueon a real ffmpeg spawn with default timeout, and the PR body says "Engine suite 698/698" but actual is 1 failed / 691 passed / 6 skipped. Either the GitHub Actions runner is too slow/under-provisioned to encode 2 frames in 600s of default timeout (unlikely but possible — worth printingresult.erroron failure), or the tiny 2x2 PNG fixture is hitting an ffmpeg pixel-format / dimension constraint that doesn't repro locally. Needs investigation before merge — green CI is the load-bearing claim of the fix.
Nits / non-blocking:
packages/producer/src/services/render/stages/encodeStage.ts:244—engineCfg = job.config.producerConfig ?? resolveConfig()is now resolved a second time per-job (the orchestrator already did this atrenderOrchestrator.ts:1518and threadscfgthrough other stages). Matches the GIF path's prior pattern as noted in the PR body, so consistent, but the cleaner fix is to add anengineCfgfield toEncodeStageInputand have the orchestrator pass its existingcfgin. Punt to a follow-up.packages/engine/src/services/chunkEncoder.ts:598— the final concat step inencodeFramesChunkedConcat(stream-copy mux) still has no timeout. In practice concat is fast, but it's the one ffmpeg spawn left in this file without timeout coverage — for consistency consider gating it onffmpegProcessTimeout(the lighter "process" timeout, not encode).packages/engine/src/services/chunkEncoder.ts:548-567— per-chunk loop doesn't listen tosignal.abortthe wayencodeFramesFromDirat line 419 does. Pre-existing, not introduced by this PR, but now that the chunked path is timeout-correct an aborted job will still wait for the active chunk's encode timeout instead of SIGTERMing immediately.- Test mock
encodeStage.test.ts:36-38stubsresolveConfigto return{ ffmpegEncodeTimeout: 12_345 }and asserts arg-position pass-through. That validates plumbing but not that the resolved value actually drives ffmpeg's kill timer — the engine-sidechunkEncoder.test.tstests do that with real ffmpeg, so coverage is split correctly across the two suites. Worth a one-line comment in the producer test pointing at the engine-side tests for completeness.
Things NOT in the diff worth checking:
- Confirmed all call sites of
encodeFramesFromDirandencodeFramesChunkedConcat: onlyencodeStage.ts:291and:300. Thepackages/producer/src/services/chunkEncoder.tsshim is a pure re-export with no callers outside the engine package. No unguarded sites — call-site coverage is clean. spawnStreamingEncoder(separate path, gated byenableStreamingEncode) already correctly threadsffmpegStreamingTimeoutconfig viacaptureStreamingStage.ts:164andcaptureHdrStage.ts:296. Not in scope here, but the PR title "every encode path" implies it — worth a one-liner in the PR body clarifying scope is frame-dir encode paths, not the streaming-encoder path which already had timeout coverage.- Backwards-compat:
EngineConfigimport + new optional trailing param onencodeFramesChunkedConcatis additive — no external callers broken (verified grep above). - Determinism:
resolveConfig()reads env vars at call time, and this PR re-calls it insiderunEncodeStage. IfFFMPEG_ENCODE_TIMEOUT_MSever mutated mid-render, two stages would see different values — paranoid edge case, but it's the kind of thing the HF determinism contract usually quotes-into-config-at-job-start. SnapshottingengineCfginto the job at orchestrator entry (per the first nit above) closes this too. - Observability: the SIGTERM-via-timer path doesn't log "timed out at Nms" before killing — when this fires in prod, the only signal is
FFmpeg exited with code 255 / signal 15which is exactly the failure mode #1348 reported. Worth a one-linelog.warn("ffmpeg encode timeout reached", { timeoutMs })at the kill site so future timeouts are diagnosable without log-spelunking.
Review by Rames D Jusso
8bd3d21 to
6b9602e
Compare
|
All review points addressed:
Engine suite 702/702, producer encode-stage tests 6/6, both typechecks clean. |
james-russo-rames-d-jusso
left a comment
There was a problem hiding this comment.
R2 — ship it. Every R1 blocker + nit either resolved or resolved-differently-in-a-better-way. Test rewrite is the highlight — correctly identified that the failing test was environment-fragile-by-construction and replaced the whole real-ffmpeg suite with a mocked-spawn + fake-timers suite that asserts the kill boundary deterministically (no kill at 999ms, SIGTERM at 1000ms, default 600s honored when config omitted). Engine 702/702 is now real.
R1 carry-over status:
- ✅ CI blocker
chunkEncoder.test.ts:100— resolved. Full rewrite to mockedchild_process.spawn+vi.useFakeTimers(). Real-ffmpeg dependency removed from all encode-timeout tests. Test job 81065023999 = pass (1m54s). 8 new tests covering kill-boundary + non-timeout exit + default-timeout-when-omitted + concat-timeout for both direct and chunked paths. - ✅
encodeStage.ts:244engineCfg double-resolve — resolved exactly as prescribed. New optionalEncodeStageInput.engineConfigfield (encodeStage.ts:88), orchestrator passesengineConfig: cfgatrenderOrchestrator.ts:2257, stage usesinput.engineConfig ?? job.config.producerConfig ?? resolveConfig()— fallback preserved for distributed chunks + direct callers. - ✅ Concat spawn no timeout
chunkEncoder.ts:598— resolved.chunkEncoder.ts:631-637addssetTimeout(...encodeTimeout)+clearTimeouton both close and error handlers, sameappendEncodeTimeoutMessagesemantics as the per-chunk and direct paths. ⚠️ Per-chunk loopsignal.abortlistener — still pre-existing. Outer loop checkssignal?.abortedat iteration boundary (chunkEncoder.ts:527) but not during in-flight chunk encode. The new encode-timeout-via-SIGTERM partially mitigates (an aborted job will SIGTERM no later thanffmpegEncodeTimeoutms in), but a true mid-encode abort listener would be ideal. Non-blocking, fine as-is; surface for a follow-up if it bites.- ❌
encodeStage.test.ts:36-38cross-suite comment — not added. Tests were rewritten end-to-end so the original line range no longer exists; no top-of-file comment pointing at engine-side tests. Cosmetic, non-blocking — coverage is solid on both sides regardless. - ✅ No new call sites — confirmed. Still only
encodeStage.ts:298and:305post-edit. Thepackages/producer/src/services/chunkEncoder.tsre-export shim remains caller-less outside the engine package. - ❌ PR body scope clarification — not updated. PR body unchanged;
spawnStreamingEncoderpath still implicitly out of scope. Cosmetic. - ✅ Backwards-compat — additive trailing param + additive
EncodeStageInputfield, no breaks. - ✅ Determinism (engineCfg snapshot mid-render) — closed by item 2.
cfgis resolved once at orchestrator entry and threaded; env-mutation-mid-render no longer affects the orchestrator path. - ✅
log.warn("ffmpeg encode timeout reached")— resolved-differently (arguably better). No log line added, butappendEncodeTimeoutMessage(chunkEncoder.ts:42-45) appendsFFmpeg killed after exceeding ffmpegEncodeTimeout (N ms)to theEncodeResult.errorstring itself, which propagates intoformatFfmpegError's output. Future timeouts will be diagnosable from the failure record rather than requiring log correlation. Functionally closes the "bare exit-255" gap.
CI status:
All CI-suite required checks green at HEAD 6b9602e: Build, Test, Typecheck, Lint, Fallow audit, SDK unit/contract/smoke, CLI smoke, Studio load smoke, Smoke global install. CodeQL all green. Two long-runners still IN_PROGRESS (regression-shards × 8 + Windows render/tests, started 15:43) — these are the normal slow shards, not blockers for this fix's correctness claim.
Anything NEW worth flagging:
appendEncodeTimeoutMessagehelper +timedOutflag (chunkEncoder.ts:42, 436, 450) cleanly disambiguate SIGTERM-via-encode-timeout from user-abort. Previously a timeout-then-abort race could surface as either "Aborted" or "FFmpeg exited with code 143" depending on ordering; now thesignal?.aborted && !timedOutguard at line 450 forces timeout to win, which is the right behavior (the timeout fired the kill, the abort is a downstream consequence). Tests cover the disambiguation inchunkEncoder.test.ts:130-160.
R2 review by Rames D Jusso
encodeFramesFromDir was called with 5 of its 6 args, dropping the config param — the encode timeout always fell back to the hardcoded 600s default and FFMPEG_ENCODE_TIMEOUT_MS was silently ignored, so any encode over 600s wall time was deterministically SIGTERM-killed. Resolve the engine config once in the encode stage and pass it to the non-chunked, chunked, and GIF paths. The chunked per-chunk encodes previously had no timeout at all; they now honor the same config value. Review follow-ups: the chunked path's final concat spawn gains the same config-driven timeout (it previously had none); every encode-timeout kill now appends 'FFmpeg killed after exceeding ffmpegEncodeTimeout (N ms)' to the failure instead of surfacing a bare exit-255; and the orchestrator threads its already-resolved config into the encode stage via an optional EncodeStageInput.engineConfig field (direct callers and distributed chunks keep the producerConfig ?? resolveConfig() fallback). The encode-timeout tests run against a mocked child_process spawn with fake timers, removing the real-ffmpeg dependency that made the previous default-timeout test environment-fragile in CI. Fixes #1348
6b9602e to
3c84383
Compare
james-russo-rames-d-jusso
left a comment
There was a problem hiding this comment.
R3 — ship-it stance from R2 holds. All R2 open items are cosmetic-status-unchanged (#4 still pre-existing, #5 not added, #7 PR body unchanged) — none were blockers and none became blockers. Core fix + the R2 review follow-ups (concat-spawn timeout, appendEncodeTimeoutMessage, orchestrator engineConfig thread-through, mocked-spawn tests) are all intact at HEAD 3c84383. CI: all required green, the 8 regression-shards are still in-progress from the rebase ~10min ago.
R2 open-item status:
⚠️ #4 per-chunk mid-encode abort —packages/engine/src/services/chunkEncoder.ts:526-527, 559-597. Outer loop still only checkssignal?.abortedat iteration boundary; nosignal.addEventListener('abort', ...)wired into the per-chunkspawn(contrast with the non-chunked path at:427-431, 484which does register/remove an abort listener). Unchanged from R2 — pre-existing pattern, non-blocking, abort still lands at the next iteration. Worth a follow-up issue if you care; not for this PR.- ❌ #5 cross-suite test comment —
packages/producer/src/services/render/stages/encodeStage.test.ts:1-8jumps straight from imports intoresolvedEngineConfig. No comment pointing at engine-sidechunkEncoder.test.tsfor the kill-timer side. Pure docs-nit, doesn't affect coverage (engine-side:116-282exercises bothencodeFramesFromDirandencodeFramesChunkedConcattimeout kills with fake timers). - ❌ #7 PR body scope clarification — body still says "all three encode paths" / "every encode path" without flagging that streaming-encoder already had timeout coverage. Cosmetic.
CI status post-rebase:
- Required green:
Build,Test,Typecheck,Lint,Format,CLI smoke (required),SDK: unit + contract + smoke,Test: runtime contract,Tests on windows-latest,Render on windows-latest,Analyze (javascript-typescript),Analyze (python),preview-regression,player-perf,Studio: load smoke,Smoke: global install. - In-progress:
regression-shardsshards 1-8 (started 15:53:59Z, ~6-15min typical). Nothing failing. DomEditOverlay.tsxlint hit from R2 was #1372/#1373 only — confirmed absent here.
Anything NEW since R2:
- Single squashed commit
3c84383(collapsed everything into one). 5 files / +581/-15. Touches exactly the expected surface:chunkEncoder.{ts,test.ts},encodeStage.{ts,test.ts},renderOrchestrator.ts:2257(theengineConfig: cfgthread-through). - Verified intact:
appendEncodeTimeoutMessagehelper atchunkEncoder.ts:42used at:469, :491, :578, :590, :646, :658(all 4 spawn paths — non-chunked main, non-chunked error, per-chunk main, per-chunk error, concat main, concat error).engineConfig?: Pick<EngineConfig, "ffmpegEncodeTimeout">field atencodeStage.ts:88, resolution chain at:247(input.engineConfig ?? job.config.producerConfig ?? resolveConfig()). - No surprises — rebased + did the cosmetics-or-skip dance only.
R3 review by Rames D Jusso
Problem
encodeFramesFromDirwas called with 5 of its 6 arguments (encodeStage.ts:297), silently dropping the trailingconfigparam. The encode timeout therefore always fell back to the hardcoded 600s default —FFMPEG_ENCODE_TIMEOUT_MS(and the rest ofresolveConfig()) never reached this path, so any encode whose wall time exceeded 600s was deterministically SIGTERM-killed (FFmpeg exited with code 255, ffmpeg loggingExiting normally, received signal 15). Verified exactly as reported in #1348.Fix
job.config.producerConfig ?? resolveConfig()— same pattern the orchestrator already uses) and pass it to all three encode paths: non-chunked, chunked, and GIF.encodeFramesChunkedConcatgains an optional trailingconfigparam; its per-chunk ffmpeg spawns previously had no timeout at all and now honor the same value.producerConfig?.ffmpegEncodeTimeout ?? DEFAULT, which also ignored env overrides whenproducerConfigwas unset; it now uses the resolved config.Tests
ffmpegEncodeTimeoutkills both the direct and the chunked encode; omitting config still succeeds on the default.runEncodeStagetests assert the resolved config object reachesencodeFramesFromDir(arg 5),encodeFramesChunkedConcat(arg 6), and both GIF ffmpeg invocations.Fixes #1348