Skip to content

fix(envd): eliminate OOM score and nice priority race in child processes#2028

Open
arkamar wants to merge 6 commits intomainfrom
fix/envd-oom-race
Open

fix(envd): eliminate OOM score and nice priority race in child processes#2028
arkamar wants to merge 6 commits intomainfrom
fix/envd-oom-race

Conversation

@arkamar
Copy link
Contributor

@arkamar arkamar commented Feb 28, 2026

Summary

  • Wraps user commands in a shell that sets oom_score_adj and resets nice priority
    before exec, eliminating the race where grandchildren could inherit envd's
    protected OOM score (-1000) or elevated CPU priority (nice -20)
  • Sets Nice=-20 on the envd systemd service for improved daemon responsiveness under load
  • Logs the user's original command instead of the internal wrapper in all error messages
    and structured logs

Motivation

Previously, adjustOomScore and resetNice were called after cmd.Start(),
leaving a race window where the child process (and any grandchildren it spawns quickly)
could run with envd's inherited OOM score of -1000 and nice of -20. Under specific
scenarios — e.g. shells that immediately fork subprocesses — grandchildren would be
born before the parent's values were corrected, making them effectively immune to the
OOM killer and running at elevated CPU priority.

The fix uses an inline /bin/sh wrapper:

echo 100 > /proc/$$/oom_score_adj && exec nice -n 0 "$@"

This sets the OOM score and nice value atomically before exec-ing the user's command,
so all descendants inherit the correct values from the very first instruction.

Note on the logging commit

The commit that strips the wrapper from logged commands (cf306862e) can be excluded
from this PR if desired. It is included because without it, users would see confusing
log output like /bin/sh -c echo 100 > /proc/$$/oom_score_adj && exec nice -n 0 ...
instead of their actual command. The tradeoff is that if the command fails inside the
wrapper (e.g. /bin/sh itself errors), the error message won't show the full wrapped
command. I believe the cleaner user experience is worth it in practice.

Obsoletes

Obsoletes #2003 — this PR subsumes the nice priority feature from that PR and fixes
the race condition in its resetNice approach.

Closes: #2003


Note

Cursor Bugbot is generating a summary for commit 9336012. Configure here.

The previous approach set the child's oom_score_adj after cmd.Start(),
leaving a race window where grandchildren could spawn and inherit the
parent's protected OOM score (-1000) before it was corrected to 100.

Replace the post-start adjustOomScore call with an inline /bin/sh wrapper
that writes oom_score_adj before exec-ing the user command, ensuring all
descendants inherit the correct score from the start.
…rocesses

Set Nice=-20 in envd.service to improve responsiveness under load.
Reset nice to the default level (0) for user processes, PTY processes,
and socat port-forwarding processes so only envd itself benefits from
the elevated priority.
Error messages and log entries were showing the full wrapped command
(sh -c 'echo 100 > /proc/...' ...) instead of the user's actual command.
Extract userCommand() helper to format the original cmd+args for logging.
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 93360127dc

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

// Wrap the command in a shell that sets the OOM score and nice value before exec-ing the actual command.
// This eliminates the race window where grandchildren could inherit the parent's protected OOM score (-1000)
// or high CPU priority (nice -20) before the post-start calls had a chance to correct them.
oomWrapperScript := fmt.Sprintf(`echo %d > /proc/$$/oom_score_adj && exec nice -n %d "${@}"`, defaultOomScore, defaultNice)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If writing to /proc/$$/oom_score_adj fails (e.g., capability restrictions in certain environments), the && causes the user's command to silently not run at all. The previous adjustOomScore logged the error and continued executing. Consider using ; instead of &&, or redirecting stderr and ignoring the write failure, to preserve the old behavior where an OOM adjustment failure is non-fatal.

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

@arkamar arkamar marked this pull request as draft February 28, 2026 14:15
nice(1) applies a relative adjustment, not absolute. Since envd runs at
nice -20, 'nice -n 0' was a no-op. Read the current nice via Getpriority
and compute the delta to defaultNice, so child processes are correctly
reset to nice 0. Also use absolute path /usr/bin/nice to avoid PATH issues.
@arkamar arkamar marked this pull request as ready for review February 28, 2026 20:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants