Skip to content

fix(engine): add encode timeout to encodeFramesChunkedConcat ffmpeg invocations#1361

Open
leorivastech wants to merge 4 commits into
heygen-com:mainfrom
leorivastech:fix/chunked-encode-timeout
Open

fix(engine): add encode timeout to encodeFramesChunkedConcat ffmpeg invocations#1361
leorivastech wants to merge 4 commits into
heygen-com:mainfrom
leorivastech:fix/chunked-encode-timeout

Conversation

@leorivastech

@leorivastech leorivastech commented Jun 11, 2026

Copy link
Copy Markdown

Summary

Follow-up to #1352, as suggested in review: the chunked-concat encode path (encodeFramesChunkedConcat) had no timeout at all — neither the per-chunk ffmpeg spawns nor the final -f concat spawn were guarded. A hung ffmpeg process on this path stalled the render indefinitely. (#1348 assumed the 600s default applied per chunk; reading the source shows there was no guard on this path whatsoever.)

Fixes #1348.

Supersedes #1352 (contains its commits), per review recommendation.

Changes

  • chunkEncoder.ts: encodeFramesChunkedConcat now takes the same trailing config?: Partial<Pick<EngineConfig, "ffmpegEncodeTimeout">> parameter as encodeFramesFromDir, and both inner ffmpeg invocations (each chunk + the final concat) run through the shared runFfmpeg helper, which already implements the spawn/stderr/timeout/abort pattern (SIGTERM on expiry). The timeout applies per ffmpeg invocation, not to the whole sequence — chunks are short, so only a genuinely hung process trips it. Default behavior without config is unchanged (DEFAULT_CONFIG.ffmpegEncodeTimeout). Net −12 lines.
  • Reusing runFfmpeg also wires the abort signal into the in-flight process: cancelling mid-chunk now kills the running ffmpeg instead of letting the chunk run to completion (previously the signal was only checked between chunks).
  • encodeStage.ts: resolves producerConfig ?? resolveConfig() once and threads it into both the chunked and non-chunked encode calls, so FFMPEG_ENCODE_TIMEOUT_MS now applies on the chunked path too.
  • Tests:
    • Producer: chunked-path case in encodeStage-encodeTimeoutConfig.test.ts (enableChunkedEncode: true) asserting the trailing config reaches encodeFramesChunkedConcat.
    • Engine: chunkEncoder.test.ts cases asserting encodeFramesChunkedConcat threads ffmpegEncodeTimeout into every runFfmpeg invocation (2 chunks + concat, 42_000 sentinel), plus the no-config default.

Verification

  • Engine chunkEncoder.test.ts — 65/65 pass (includes the new timeout-threading cases)
  • Producer encodeStage-encodeTimeoutConfig.test.ts — 4/4 pass
  • typecheck passes on engine and producer; oxlint / oxfmt --check clean; fallow audit gate clean on changed files

@jrusso1020 jrusso1020 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

REQUEST_CHANGES — the diff doesn't match the title/body. The title is fix(engine): add encode timeout to encodeFramesChunkedConcat ffmpeg invocations and the summary describes engine-level changes to chunkEncoder.ts (timeout + config? param on each chunk spawn + the final -f concat spawn). I see zero changes to packages/engine/src/services/chunkEncoder.ts in the diff — git show pr-1361 -- packages/engine/src/services/chunkEncoder.ts is empty.

What the PR actually contains:

  • The encodeStage.ts call-site change (already on the branch from #1352).
  • A new encodeStage-encodeTimeoutConfig.test.ts with 3 tests — producerConfig present, producerConfig absent → resolveConfig(), GIF two-pass. All test runEncodeStage against a mocked encodeFramesChunkedConcat stub, so the chunked path is never exercised. Both makeEncodeInput paths set enableChunkedEncode: false. The body's "added a chunked-path case... asserting the trailing config reaches encodeFramesChunkedConcat" doesn't match — there's no such case.

The new tests themselves are solid quality (good mock isolation, the GIF two-pass assertion is a nice catch beyond what #1352 had), but they lock the #1352 surface, not the chunked-concat surface this PR claims to fix.

Pick one:

  1. Add the actual engine changes — give encodeFramesChunkedConcat the same trailing config? param as encodeFramesFromDir, wrap each spawn("ffmpeg", ...) (the per-chunk loop AND the final -f concat -c copy invocation) in setTimeout → kill("SIGTERM") → clearTimeout matching chunkEncoder.ts:428-433's pattern. Then add a real chunked-path test (set enableChunkedEncode: true + assert the captured config flows through).
  2. Rescope the PR — retitle to test(producer): lock encode-stage config threading for ffmpegEncodeTimeout and rewrite the body to match. Leave the chunkEncoder fix for a separate PR.

Either lands cleanly; the current shape is a mismatch between intent and contents.

Also worth verifying once the engine changes land: timeout applies per-spawn vs total (PR body says per-invocation — the setTimeout reset on close is the existing pattern there and matches). And the final concat spawn often runs in seconds even on long renders since it's just stream-copy — a 600s default is generous, but make sure it doesn't fire prematurely on edge cases like 10k-chunk renders.

Review by Jerrai

@leorivastech

Copy link
Copy Markdown
Author

Good catch — apologies for the noise: the engine commit was rejected by the pre-commit audit (fallow flagged duplication in my hand-rolled timer blocks) and I pushed without noticing it hadn't landed.

d0dc402 now contains the actual changes:

  • encodeFramesChunkedConcat takes the same trailing config param as encodeFramesFromDir, and both ffmpeg invocations (each chunk + the final concat) now run through the shared runFfmpeg helper, which already implements the timeout (SIGTERM on expiry) — so FFMPEG_ENCODE_TIMEOUT_MS applies per invocation on this path too. Net −12 lines vs. duplicating the timer pattern.
  • Reusing runFfmpeg also wires the abort signal into the in-flight process, so the "abort mid-chunk" note from the PR body is addressed by the same change (body updated).
  • The chunked-path test is in: enableChunkedEncode: true asserting the captured trailing config reaches encodeFramesChunkedConcat — 4/4 passing, engine suite 63/63.

On the concat-timeout note: the timeout applies per invocation and the concat is stream-copy, so the 600s default leaves generous headroom even at high chunk counts; FFMPEG_ENCODE_TIMEOUT_MS raises it if an extreme case ever needs it.

@miguel-heygen miguel-heygen left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

Verdict: real fix, worth merging (supersedes #1352)

What it fixes

  1. encodeFramesFromDir missing config (#1348): passes job.config.producerConfig ?? resolveConfig() as the 6th arg so FFMPEG_ENCODE_TIMEOUT_MS is actually honored. Without this, every non-chunked encode > 600s gets SIGTERM'd.

  2. encodeFramesChunkedConcat raw spawn with no timeout: the chunked path was using bare spawn("ffmpeg", args) — no abort signal propagation, no timeout. Replaced with runFfmpeg(args, { signal, timeout: encodeTimeout }), which is what every other FFmpeg invocation uses. This is arguably a bigger fix than (1) since the chunked path had zero timeout management.

  3. Cleanup: DRYs the error result construction into a fail() helper, removing ~60 lines of duplicated boilerplate.

Test coverage

Tests are solid — 4 cases covering:

  • producerConfig present → threads custom timeout (no resolveConfig call)
  • producerConfig absent → falls back to resolveConfig()
  • Chunked path → config threaded to encodeFramesChunkedConcat
  • GIF path → both runFfmpeg passes get the resolved timeout

What's missing

No unit test for the encodeFramesChunkedConcat implementation change itself — the test mocks encodeFramesChunkedConcat at the call site level but doesn't verify that runFfmpeg inside chunkEncoder.ts actually receives the timeout. Consider adding a test in packages/engine/src/services/chunkEncoder.test.ts that:

  1. Spawns encodeFramesChunkedConcat with a custom config { ffmpegEncodeTimeout: 42_000 }
  2. Mocks runFfmpeg and asserts each chunk invocation + the concat invocation receives timeout: 42_000

This would catch regressions if someone refactors the runFfmpeg call inside encodeFramesChunkedConcat without threading config.

Producer baseline tests are not needed — this is a timeout configuration fix, not a visual output change. Frame content is identical.

Relationship to #1352

This PR supersedes #1352 (same author). #1352 only fixes encodeFramesFromDir; this one fixes both paths. Recommend closing #1352 if this merges.

@leorivastech

Copy link
Copy Markdown
Author

Added the engine-side test suggested in review (96722d4): encodeFramesChunkedConcat with ffmpegEncodeTimeout: 42_000 asserting all three ffmpeg invocations (2 chunks + concat) receive the configured timeout, plus the no-config default case. 65/65 passing. Body updated with Fixes #1348 since this PR supersedes #1352.

…ENCODE_TIMEOUT_MS is honored

encodeFramesFromDir takes a trailing config param carrying
ffmpegEncodeTimeout, but the encode stage called it with only 5 of its
6 arguments, so the timeout always fell back to the hardcoded 600000ms
default and FFMPEG_ENCODE_TIMEOUT_MS was silently ignored. Any encode
exceeding 600s wall time (e.g. vp9 alpha on CPU-only machines) was
deterministically killed with signal 15.

Pass `job.config.producerConfig ?? resolveConfig()` at the call site,
matching the existing pattern in renderOrchestrator. The GIF encode
path had the same flaw — falling back to DEFAULT_CONFIG instead of the
env-aware resolveConfig() — and is fixed the same way.

Fixes heygen-com#1348
…meout

Covers the heygen-com#1348 regression surface: asserts the encode stage passes
producerConfig as the trailing config argument to encodeFramesFromDir
when present, falls back to resolveConfig() for in-process renders, and
threads the resolved timeout into both GIF encode passes. Verified the
suite fails (3/3) against the pre-fix call site.
…nvocations

The chunked-concat encode path had no timeout at all: neither the
per-chunk ffmpeg spawns nor the final concat spawn were guarded, so a
hung ffmpeg process stalled the render forever. Both spawns now go
through the shared runFfmpeg helper, which already implements the
timer + SIGTERM + abort-signal pattern, with the timeout resolved from
the new trailing config parameter (ffmpegEncodeTimeout, env-overridable
via FFMPEG_ENCODE_TIMEOUT_MS) and defaulting to DEFAULT_CONFIG.

Reusing runFfmpeg also wires the abort signal into the in-flight
process, so cancelling mid-chunk now kills ffmpeg instead of letting
the chunk run to completion.

The encode stage resolves producerConfig ?? resolveConfig() once and
threads it into both the chunked and non-chunked encode calls.
…meout to every invocation

Covers the implementation side of the chunked-path timeout fix: with a
custom config, each chunk encode and the final concat must receive the
configured timeout through runFfmpeg; without config, the default
600000ms applies. Locks the contract so a refactor of the runFfmpeg
calls inside encodeFramesChunkedConcat cannot silently drop the
config threading.
@leorivastech leorivastech force-pushed the fix/chunked-encode-timeout branch from 96722d4 to f991503 Compare June 12, 2026 06:28
@leorivastech

Copy link
Copy Markdown
Author

Rebased onto main to resolve the conflict with #1365 — no functional changes; the runFfmpeg routing picks up the new getFfmpegBinary() resolution automatically.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

encodeFramesFromDir is called without its config param — FFMPEG_ENCODE_TIMEOUT_MS is silently ignored, every encode > 600s is killed (signal 15)

3 participants