Skip to content

fix(producer): pass resolved engine config through every encode path#1371

Merged
miguel-heygen merged 1 commit into
mainfrom
fix/1348-encode-timeout-config
Jun 12, 2026
Merged

fix(producer): pass resolved engine config through every encode path#1371
miguel-heygen merged 1 commit into
mainfrom
fix/1348-encode-timeout-config

Conversation

@miguel-heygen

Copy link
Copy Markdown
Collaborator

Problem

encodeFramesFromDir was called with 5 of its 6 arguments (encodeStage.ts:297), silently dropping the trailing config param. The encode timeout therefore always fell back to the hardcoded 600s default — FFMPEG_ENCODE_TIMEOUT_MS (and the rest of resolveConfig()) never reached this path, so any encode whose wall time exceeded 600s was deterministically SIGTERM-killed (FFmpeg exited with code 255, ffmpeg logging Exiting normally, received signal 15). Verified exactly as reported in #1348.

Fix

  • Resolve the engine config once in the encode stage (job.config.producerConfig ?? resolveConfig() — same pattern the orchestrator already uses) and pass it to all three encode paths: non-chunked, chunked, and GIF.
  • encodeFramesChunkedConcat gains an optional trailing config param; its per-chunk ffmpeg spawns previously had no timeout at all and now honor the same value.
  • The GIF path previously read producerConfig?.ffmpegEncodeTimeout ?? DEFAULT, which also ignored env overrides when producerConfig was unset; it now uses the resolved config.

Tests

  • Engine (real ffmpeg): a 1ms ffmpegEncodeTimeout kills both the direct and the chunked encode; omitting config still succeeds on the default.
  • Producer: runEncodeStage tests assert the resolved config object reaches encodeFramesFromDir (arg 5), encodeFramesChunkedConcat (arg 6), and both GIF ffmpeg invocations.
  • Engine suite 698/698, producer typecheck clean, oxlint/oxfmt clean.

Fixes #1348

@james-russo-rames-d-jusso james-russo-rames-d-jusso left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix is right and the call-site coverage is complete — verified all consumers of encodeFramesFromDir / encodeFramesChunkedConcat. runEncodeStage is the single chokepoint, so the orchestrator and the distributed renderChunk path both inherit the fix. One CI test failure to address before merge; otherwise non-blocking notes.

Blockers:

  • packages/engine/src/services/chunkEncoder.test.ts:100 — the "uses the default timeout when config is omitted" test is failing on CI (Test job 81055571378, exit code 1). It asserts result.success === true on a real ffmpeg spawn with default timeout, and the PR body says "Engine suite 698/698" but actual is 1 failed / 691 passed / 6 skipped. Either the GitHub Actions runner is too slow/under-provisioned to encode 2 frames in 600s of default timeout (unlikely but possible — worth printing result.error on failure), or the tiny 2x2 PNG fixture is hitting an ffmpeg pixel-format / dimension constraint that doesn't repro locally. Needs investigation before merge — green CI is the load-bearing claim of the fix.

Nits / non-blocking:

  • packages/producer/src/services/render/stages/encodeStage.ts:244engineCfg = job.config.producerConfig ?? resolveConfig() is now resolved a second time per-job (the orchestrator already did this at renderOrchestrator.ts:1518 and threads cfg through other stages). Matches the GIF path's prior pattern as noted in the PR body, so consistent, but the cleaner fix is to add an engineCfg field to EncodeStageInput and have the orchestrator pass its existing cfg in. Punt to a follow-up.
  • packages/engine/src/services/chunkEncoder.ts:598 — the final concat step in encodeFramesChunkedConcat (stream-copy mux) still has no timeout. In practice concat is fast, but it's the one ffmpeg spawn left in this file without timeout coverage — for consistency consider gating it on ffmpegProcessTimeout (the lighter "process" timeout, not encode).
  • packages/engine/src/services/chunkEncoder.ts:548-567 — per-chunk loop doesn't listen to signal.abort the way encodeFramesFromDir at line 419 does. Pre-existing, not introduced by this PR, but now that the chunked path is timeout-correct an aborted job will still wait for the active chunk's encode timeout instead of SIGTERMing immediately.
  • Test mock encodeStage.test.ts:36-38 stubs resolveConfig to return { ffmpegEncodeTimeout: 12_345 } and asserts arg-position pass-through. That validates plumbing but not that the resolved value actually drives ffmpeg's kill timer — the engine-side chunkEncoder.test.ts tests do that with real ffmpeg, so coverage is split correctly across the two suites. Worth a one-line comment in the producer test pointing at the engine-side tests for completeness.

Things NOT in the diff worth checking:

  • Confirmed all call sites of encodeFramesFromDir and encodeFramesChunkedConcat: only encodeStage.ts:291 and :300. The packages/producer/src/services/chunkEncoder.ts shim is a pure re-export with no callers outside the engine package. No unguarded sites — call-site coverage is clean.
  • spawnStreamingEncoder (separate path, gated by enableStreamingEncode) already correctly threads ffmpegStreamingTimeout config via captureStreamingStage.ts:164 and captureHdrStage.ts:296. Not in scope here, but the PR title "every encode path" implies it — worth a one-liner in the PR body clarifying scope is frame-dir encode paths, not the streaming-encoder path which already had timeout coverage.
  • Backwards-compat: EngineConfig import + new optional trailing param on encodeFramesChunkedConcat is additive — no external callers broken (verified grep above).
  • Determinism: resolveConfig() reads env vars at call time, and this PR re-calls it inside runEncodeStage. If FFMPEG_ENCODE_TIMEOUT_MS ever mutated mid-render, two stages would see different values — paranoid edge case, but it's the kind of thing the HF determinism contract usually quotes-into-config-at-job-start. Snapshotting engineCfg into the job at orchestrator entry (per the first nit above) closes this too.
  • Observability: the SIGTERM-via-timer path doesn't log "timed out at Nms" before killing — when this fires in prod, the only signal is FFmpeg exited with code 255 / signal 15 which is exactly the failure mode #1348 reported. Worth a one-line log.warn("ffmpeg encode timeout reached", { timeoutMs }) at the kill site so future timeouts are diagnosable without log-spelunking.

Review by Rames D Jusso

@miguel-heygen miguel-heygen force-pushed the fix/1348-encode-timeout-config branch from 8bd3d21 to 6b9602e Compare June 12, 2026 15:42
@miguel-heygen

Copy link
Copy Markdown
Collaborator Author

All review points addressed:

  • Red Test check (root cause + fix): the failing uses the default timeout when config is omitted test ran a real ffmpeg encode, which passed locally but failed in the CI sandbox at default settings — environment-fragile by construction (and the source of the 698/698-local vs 1-fail-CI discrepancy). All encode-timeout tests now run against a mocked child_process spawn with fake timers (same idiom as streamingEncoder.test.ts), asserting deterministically at the boundary: no kill at 999ms, SIGTERM at 1000ms, default 600000ms honored when config is omitted. No system-ffmpeg dependency remains in these tests.
  • Untimed concat step: the chunked path's final concat spawn now carries the same config-driven timeout with clearTimeout on both close and error — it previously had none.
  • Bare exit-255 on timeout: every encode-timeout kill (direct, per-chunk, concat) appends FFmpeg killed after exceeding ffmpegEncodeTimeout (N ms) to the failure, and a timedOut flag prevents a timeout from being misreported as user-abort.
  • Config re-resolution nit: the orchestrator now threads its already-resolved config into the stage via an optional EncodeStageInput.engineConfig (single call site); the producerConfig ?? resolveConfig() fallback stays for distributed chunks and direct callers.

Engine suite 702/702, producer encode-stage tests 6/6, both typechecks clean.

@james-russo-rames-d-jusso james-russo-rames-d-jusso left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

R2 — ship it. Every R1 blocker + nit either resolved or resolved-differently-in-a-better-way. Test rewrite is the highlight — correctly identified that the failing test was environment-fragile-by-construction and replaced the whole real-ffmpeg suite with a mocked-spawn + fake-timers suite that asserts the kill boundary deterministically (no kill at 999ms, SIGTERM at 1000ms, default 600s honored when config omitted). Engine 702/702 is now real.

R1 carry-over status:

  1. CI blocker chunkEncoder.test.ts:100 — resolved. Full rewrite to mocked child_process.spawn + vi.useFakeTimers(). Real-ffmpeg dependency removed from all encode-timeout tests. Test job 81065023999 = pass (1m54s). 8 new tests covering kill-boundary + non-timeout exit + default-timeout-when-omitted + concat-timeout for both direct and chunked paths.
  2. encodeStage.ts:244 engineCfg double-resolve — resolved exactly as prescribed. New optional EncodeStageInput.engineConfig field (encodeStage.ts:88), orchestrator passes engineConfig: cfg at renderOrchestrator.ts:2257, stage uses input.engineConfig ?? job.config.producerConfig ?? resolveConfig() — fallback preserved for distributed chunks + direct callers.
  3. Concat spawn no timeout chunkEncoder.ts:598 — resolved. chunkEncoder.ts:631-637 adds setTimeout(...encodeTimeout) + clearTimeout on both close and error handlers, same appendEncodeTimeoutMessage semantics as the per-chunk and direct paths.
  4. ⚠️ Per-chunk loop signal.abort listener — still pre-existing. Outer loop checks signal?.aborted at iteration boundary (chunkEncoder.ts:527) but not during in-flight chunk encode. The new encode-timeout-via-SIGTERM partially mitigates (an aborted job will SIGTERM no later than ffmpegEncodeTimeout ms in), but a true mid-encode abort listener would be ideal. Non-blocking, fine as-is; surface for a follow-up if it bites.
  5. encodeStage.test.ts:36-38 cross-suite comment — not added. Tests were rewritten end-to-end so the original line range no longer exists; no top-of-file comment pointing at engine-side tests. Cosmetic, non-blocking — coverage is solid on both sides regardless.
  6. No new call sites — confirmed. Still only encodeStage.ts:298 and :305 post-edit. The packages/producer/src/services/chunkEncoder.ts re-export shim remains caller-less outside the engine package.
  7. PR body scope clarification — not updated. PR body unchanged; spawnStreamingEncoder path still implicitly out of scope. Cosmetic.
  8. Backwards-compat — additive trailing param + additive EncodeStageInput field, no breaks.
  9. Determinism (engineCfg snapshot mid-render) — closed by item 2. cfg is resolved once at orchestrator entry and threaded; env-mutation-mid-render no longer affects the orchestrator path.
  10. log.warn("ffmpeg encode timeout reached") — resolved-differently (arguably better). No log line added, but appendEncodeTimeoutMessage (chunkEncoder.ts:42-45) appends FFmpeg killed after exceeding ffmpegEncodeTimeout (N ms) to the EncodeResult.error string itself, which propagates into formatFfmpegError's output. Future timeouts will be diagnosable from the failure record rather than requiring log correlation. Functionally closes the "bare exit-255" gap.

CI status:

All CI-suite required checks green at HEAD 6b9602e: Build, Test, Typecheck, Lint, Fallow audit, SDK unit/contract/smoke, CLI smoke, Studio load smoke, Smoke global install. CodeQL all green. Two long-runners still IN_PROGRESS (regression-shards × 8 + Windows render/tests, started 15:43) — these are the normal slow shards, not blockers for this fix's correctness claim.

Anything NEW worth flagging:

  • appendEncodeTimeoutMessage helper + timedOut flag (chunkEncoder.ts:42, 436, 450) cleanly disambiguate SIGTERM-via-encode-timeout from user-abort. Previously a timeout-then-abort race could surface as either "Aborted" or "FFmpeg exited with code 143" depending on ordering; now the signal?.aborted && !timedOut guard at line 450 forces timeout to win, which is the right behavior (the timeout fired the kill, the abort is a downstream consequence). Tests cover the disambiguation in chunkEncoder.test.ts:130-160.

R2 review by Rames D Jusso

encodeFramesFromDir was called with 5 of its 6 args, dropping the config
param — the encode timeout always fell back to the hardcoded 600s default
and FFMPEG_ENCODE_TIMEOUT_MS was silently ignored, so any encode over
600s wall time was deterministically SIGTERM-killed. Resolve the engine
config once in the encode stage and pass it to the non-chunked, chunked,
and GIF paths. The chunked per-chunk encodes previously had no timeout at
all; they now honor the same config value.

Review follow-ups: the chunked path's final concat spawn gains the same
config-driven timeout (it previously had none); every encode-timeout
kill now appends 'FFmpeg killed after exceeding ffmpegEncodeTimeout
(N ms)' to the failure instead of surfacing a bare exit-255; and the
orchestrator threads its already-resolved config into the encode stage
via an optional EncodeStageInput.engineConfig field (direct callers and
distributed chunks keep the producerConfig ?? resolveConfig() fallback).
The encode-timeout tests run against a mocked child_process spawn with
fake timers, removing the real-ffmpeg dependency that made the previous
default-timeout test environment-fragile in CI.

Fixes #1348
@miguel-heygen miguel-heygen force-pushed the fix/1348-encode-timeout-config branch from 6b9602e to 3c84383 Compare June 12, 2026 15:52

@jrusso1020 jrusso1020 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@james-russo-rames-d-jusso james-russo-rames-d-jusso left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

R3 — ship-it stance from R2 holds. All R2 open items are cosmetic-status-unchanged (#4 still pre-existing, #5 not added, #7 PR body unchanged) — none were blockers and none became blockers. Core fix + the R2 review follow-ups (concat-spawn timeout, appendEncodeTimeoutMessage, orchestrator engineConfig thread-through, mocked-spawn tests) are all intact at HEAD 3c84383. CI: all required green, the 8 regression-shards are still in-progress from the rebase ~10min ago.

R2 open-item status:

  • ⚠️ #4 per-chunk mid-encode abortpackages/engine/src/services/chunkEncoder.ts:526-527, 559-597. Outer loop still only checks signal?.aborted at iteration boundary; no signal.addEventListener('abort', ...) wired into the per-chunk spawn (contrast with the non-chunked path at :427-431, 484 which does register/remove an abort listener). Unchanged from R2 — pre-existing pattern, non-blocking, abort still lands at the next iteration. Worth a follow-up issue if you care; not for this PR.
  • #5 cross-suite test commentpackages/producer/src/services/render/stages/encodeStage.test.ts:1-8 jumps straight from imports into resolvedEngineConfig. No comment pointing at engine-side chunkEncoder.test.ts for the kill-timer side. Pure docs-nit, doesn't affect coverage (engine-side :116-282 exercises both encodeFramesFromDir and encodeFramesChunkedConcat timeout kills with fake timers).
  • #7 PR body scope clarification — body still says "all three encode paths" / "every encode path" without flagging that streaming-encoder already had timeout coverage. Cosmetic.

CI status post-rebase:

  • Required green: Build, Test, Typecheck, Lint, Format, CLI smoke (required), SDK: unit + contract + smoke, Test: runtime contract, Tests on windows-latest, Render on windows-latest, Analyze (javascript-typescript), Analyze (python), preview-regression, player-perf, Studio: load smoke, Smoke: global install.
  • In-progress: regression-shards shards 1-8 (started 15:53:59Z, ~6-15min typical). Nothing failing.
  • DomEditOverlay.tsx lint hit from R2 was #1372/#1373 only — confirmed absent here.

Anything NEW since R2:

  • Single squashed commit 3c84383 (collapsed everything into one). 5 files / +581/-15. Touches exactly the expected surface: chunkEncoder.{ts,test.ts}, encodeStage.{ts,test.ts}, renderOrchestrator.ts:2257 (the engineConfig: cfg thread-through).
  • Verified intact: appendEncodeTimeoutMessage helper at chunkEncoder.ts:42 used at :469, :491, :578, :590, :646, :658 (all 4 spawn paths — non-chunked main, non-chunked error, per-chunk main, per-chunk error, concat main, concat error). engineConfig?: Pick<EngineConfig, "ffmpegEncodeTimeout"> field at encodeStage.ts:88, resolution chain at :247 (input.engineConfig ?? job.config.producerConfig ?? resolveConfig()).
  • No surprises — rebased + did the cosmetics-or-skip dance only.

R3 review by Rames D Jusso

@vanceingalls vanceingalls left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@miguel-heygen miguel-heygen merged commit 5917c03 into main Jun 12, 2026
43 checks passed
@miguel-heygen miguel-heygen deleted the fix/1348-encode-timeout-config branch June 12, 2026 16:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

encodeFramesFromDir is called without its config param — FFMPEG_ENCODE_TIMEOUT_MS is silently ignored, every encode > 600s is killed (signal 15)

4 participants