Skip to content

Gracefully drain egress on high CPU before hard kill#1247

Open
chalapathi444 wants to merge 1 commit into
livekit:mainfrom
chalapathi444:feat/graceful-cpu-stop
Open

Gracefully drain egress on high CPU before hard kill#1247
chalapathi444 wants to merge 1 commit into
livekit:mainfrom
chalapathi444:feat/graceful-cpu-stop

Conversation

@chalapathi444

Copy link
Copy Markdown

What

Today, when an instance crosses the CPU kill threshold (0.95), the monitor calls KillProcess, which terminates the offending egress and reports it as FAILED — losing whatever was recorded.

This PR adds a graceful path that drains and uploads the egress instead, so it finishes as COMPLETE with its output preserved, and only falls back to the existing hard kill when load is critical.

How

  • Soft threshold (0.85): once load holds above it for a few update cycles, the monitor picks the highest-CPU egress and gracefully stops it.
  • Hard threshold (0.95): the action is switched from KillProcess to GracefulStop as well, so output is preserved wherever possible (the SIGINT/Kill path is still available for true emergencies).
  • cmd/server now listens for SIGTERMHandler.GracefulStop, leaving SIGINT for the existing immediate Kill.
  • ProcessManager.GracefulStop signals SIGTERM to the handler process, triggering a normal pipeline drain + upload.
  • A new end reason "CPU limit reached" (types.EndReasonCPULimit) is set on EOS so webhooks can distinguish CPU-triggered stops from a StopEgress API call. It only annotates the result; the ACTIVE → ENDING → COMPLETE flow is unchanged.

The generated servicefakes fake is updated for the new ProcessManager method.

Notes

  • Behavior below 0.85 is unchanged.
  • gofmt clean. Verified the code compiles and interfaces are satisfied locally; I was not able to run the GStreamer-dependent integration tests in my environment, so a CI run / maintainer test is appreciated.
  • Thresholds (0.85 / 0.95) are currently constants — happy to move them into CPUCostConfig if you'd prefer them configurable.

Adds a graceful-stop path that drains and uploads an egress when CPU
load crosses a soft threshold (0.85), before the existing hard-kill
threshold (0.95). A graceful stop sends EOS with a new "CPU limit
reached" end reason, so the egress finishes with status COMPLETE and
its output is uploaded, instead of being killed and marked FAILED.

- cmd/server listens for SIGTERM and calls Handler.GracefulStop,
  leaving SIGINT for the existing immediate Kill path.
- ProcessManager.GracefulStop signals SIGTERM to the handler process,
  triggering a pipeline drain + upload.
- The monitor escalates by CPU load: drain one egress once load holds
  above the soft threshold, and switch the hard-threshold action from
  Kill to GracefulStop so output is preserved where possible.
@chalapathi444 chalapathi444 requested a review from a team as a code owner June 6, 2026 06:45
@CLAassistant

CLAassistant commented Jun 6, 2026

Copy link
Copy Markdown

CLA assistant check
All committers have signed the CLA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants