Gracefully drain egress on high CPU before hard kill#1247
Open
chalapathi444 wants to merge 1 commit into
Open
Conversation
Adds a graceful-stop path that drains and uploads an egress when CPU load crosses a soft threshold (0.85), before the existing hard-kill threshold (0.95). A graceful stop sends EOS with a new "CPU limit reached" end reason, so the egress finishes with status COMPLETE and its output is uploaded, instead of being killed and marked FAILED. - cmd/server listens for SIGTERM and calls Handler.GracefulStop, leaving SIGINT for the existing immediate Kill path. - ProcessManager.GracefulStop signals SIGTERM to the handler process, triggering a pipeline drain + upload. - The monitor escalates by CPU load: drain one egress once load holds above the soft threshold, and switch the hard-threshold action from Kill to GracefulStop so output is preserved where possible.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Today, when an instance crosses the CPU kill threshold (
0.95), the monitor callsKillProcess, which terminates the offending egress and reports it as FAILED — losing whatever was recorded.This PR adds a graceful path that drains and uploads the egress instead, so it finishes as COMPLETE with its output preserved, and only falls back to the existing hard kill when load is critical.
How
0.85): once load holds above it for a few update cycles, the monitor picks the highest-CPU egress and gracefully stops it.0.95): the action is switched fromKillProcesstoGracefulStopas well, so output is preserved wherever possible (the SIGINT/Killpath is still available for true emergencies).cmd/servernow listens for SIGTERM →Handler.GracefulStop, leaving SIGINT for the existing immediateKill.ProcessManager.GracefulStopsignals SIGTERM to the handler process, triggering a normal pipeline drain + upload."CPU limit reached"(types.EndReasonCPULimit) is set on EOS so webhooks can distinguish CPU-triggered stops from aStopEgressAPI call. It only annotates the result; the ACTIVE → ENDING → COMPLETE flow is unchanged.The generated
servicefakesfake is updated for the newProcessManagermethod.Notes
0.85is unchanged.gofmtclean. Verified the code compiles and interfaces are satisfied locally; I was not able to run the GStreamer-dependent integration tests in my environment, so a CI run / maintainer test is appreciated.0.85/0.95) are currently constants — happy to move them intoCPUCostConfigif you'd prefer them configurable.