fix: detect conflicting host kubelet before gateway start#433
fix: detect conflicting host kubelet before gateway start#433BenediktSchackenberg wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdded host kubelet detection and warning: new Changes
Sequence Diagram(s)sequenceDiagram
rect rgba(220,230,241,0.5)
participant User as User (TTY / Non‑TTY)
participant Script as Setup Script
participant Host as Host (processes / systemd)
participant Docker as Docker / container runtime
end
Script->>Host: detect_kubelet_conflict() — probe for kubelet/kubelite/k3s,\nmicrok8s status, systemd services
alt conflict found
Script->>User: warn_kubelet_conflict(detail) — emit multi-line guidance
alt interactive (TTY)
User-->>Script: reply (Y/N)
alt confirmed (Y)
Script->>Docker: proceed with cgroupns change / restart
else not confirmed
Script-->>User: abort/exit
end
else non-interactive
Script-->>User: fail fast with stop/retry instructions
end
else no conflict
Script->>Docker: proceed with cgroupns change / restart
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Pull request overview
Adds host-side preflight detection of conflicting Kubernetes kubelets to reduce openshell gateway start CrashLoopBackOff scenarios caused by cgroupns=host cgroup contention (kubepods), per issue #431.
Changes:
- Add a non-blocking kubelet/k3s warning in
scripts/setup.shbefore starting the gateway. - Add a blocking (with interactive override) conflicting-kubelet detector in
scripts/setup-spark.shbefore configuring Docker forcgroupns=host.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| scripts/setup.sh | Adds a lightweight preflight warning if a host kubelet/k3s is detected prior to gateway start. |
| scripts/setup-spark.sh | Adds a preflight detector that aborts (or prompts) when a conflicting host Kubernetes is running before applying Docker cgroupns changes. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
scripts/setup-spark.sh
Outdated
| # Check for k3s service | ||
| if systemctl is-active --quiet k3s 2>/dev/null; then | ||
| found="k3s service is active" | ||
| fi |
There was a problem hiding this comment.
The k3s detection only checks systemctl is-active k3s, which can miss hosts running k3s as k3s-agent (common on agent-only nodes) or as a k3s process without that specific unit. Consider checking k3s-agent as well (and/or pgrep -x k3s) so the conflict warning triggers reliably for k3s installations.
There was a problem hiding this comment.
Good catch — added pgrep -x k3s and systemctl is-active --quiet k3s-agent to both scripts.
scripts/setup-spark.sh
Outdated
| warn "" | ||
|
|
||
| if [ -t 0 ]; then | ||
| read -rp "Continue anyway? [y/N] " reply |
There was a problem hiding this comment.
Because the script runs with set -e, read -rp ... will cause an immediate exit on EOF/error (e.g., user hits Ctrl-D) before your intended abort message runs. Handle read failure explicitly (e.g., if ! read ...; then fail ...; fi) so the script exits with a clear, consistent error.
| read -rp "Continue anyway? [y/N] " reply | |
| if ! read -rp "Continue anyway? [y/N] " reply; then | |
| fail "Aborted. Stop the conflicting Kubernetes service and retry." | |
| fi |
There was a problem hiding this comment.
Fixed — read failure is now handled explicitly with if ! read ... so Ctrl-D/EOF gets a clean error message instead of a cryptic set -e exit.
scripts/setup.sh
Outdated
| if pgrep -x kubelet > /dev/null 2>&1 || pgrep -x kubelite > /dev/null 2>&1 || systemctl is-active --quiet k3s 2>/dev/null; then | ||
| warn "⚠️ A Kubernetes kubelet is running on this host." | ||
| warn "The gateway's embedded k3s may conflict over cgroup paths (kubepods)." | ||
| warn "If the gateway fails with CrashLoopBackOff, stop the host Kubernetes first:" | ||
| warn " sudo microk8s stop / sudo systemctl stop k3s / sudo systemctl stop kubelet" | ||
| fi |
There was a problem hiding this comment.
This preflight check looks for kubelet/kubelite processes and an active k3s systemd unit, but it can miss k3s installations running as k3s-agent (and/or just a k3s process). Consider checking systemctl is-active --quiet k3s-agent as well (and/or pgrep -x k3s) to avoid false negatives when k3s is the conflicting host Kubernetes.
There was a problem hiding this comment.
Addressed in the same commit — both pgrep -x k3s and k3s-agent systemd unit are now checked.
There was a problem hiding this comment.
🧹 Nitpick comments (1)
scripts/setup.sh (1)
95-102: Consider centralizing kubelet-conflict detection to avoid drift.This block now overlaps with
scripts/setup-spark.shbut with different checks/behavior. Moving shared detection/message formatting intoscripts/lib/runtime.shwould keep both setup paths consistent and easier to maintain.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/setup.sh` around lines 95 - 102, The kubelet-conflict detection block in scripts/setup.sh is duplicated with different behavior in scripts/setup-spark.sh; extract the logic into a shared function placed in scripts/lib/runtime.sh (e.g., detect_kubelet_conflict() which runs the pgrep/systemctl checks and returns status, and warn_kubelet_conflict() which emits the standardized warn messages), then replace the inline checks in both scripts (setup.sh and setup-spark.sh) with calls to detect_kubelet_conflict && warn_kubelet_conflict to keep detection and message formatting centralized and consistent across both setup paths.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@scripts/setup.sh`:
- Around line 95-102: The kubelet-conflict detection block in scripts/setup.sh
is duplicated with different behavior in scripts/setup-spark.sh; extract the
logic into a shared function placed in scripts/lib/runtime.sh (e.g.,
detect_kubelet_conflict() which runs the pgrep/systemctl checks and returns
status, and warn_kubelet_conflict() which emits the standardized warn messages),
then replace the inline checks in both scripts (setup.sh and setup-spark.sh)
with calls to detect_kubelet_conflict && warn_kubelet_conflict to keep detection
and message formatting centralized and consistent across both setup paths.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 7e8e7be6-e6bc-40c7-a32e-aa55b532778d
📒 Files selected for processing (2)
scripts/setup-spark.shscripts/setup.sh
|
Addressed CodeRabbit's nitpick — extracted |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
scripts/setup-spark.sh (1)
67-87: Consider moving conflict preflight before user/group mutation.Right now, a detected conflict can abort after
usermod -aG dockerhas already changed host state. Running this preflight earlier would make failure paths fully fail-fast and side-effect free.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/setup-spark.sh` around lines 67 - 87, Move the Kubernetes conflict preflight (the detect_kubelet_conflict check and its warn_kubelet_conflict/interactive fail path) to run before any host user/group mutation occurs (e.g., before the usermod -aG docker invocation); specifically, relocate the entire if detect_kubelet_conflict ... fi block so it executes prior to any calls that change system state, ensuring the script fails fast and leaves the host unmodified when a conflict is detected.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@scripts/lib/runtime.sh`:
- Line 260: The assignment of local detail in warn_kubelet_conflict() uses
${1:-$KUBELET_CONFLICT_DETAIL} which can trigger an unbound-variable error under
set -u if KUBELET_CONFLICT_DETAIL is not defined; change the assignment to use a
nested default so it falls back to an empty value when both $1 and
KUBELET_CONFLICT_DETAIL are unset (i.e., set detail to first argument, otherwise
to KUBELET_CONFLICT_DETAIL, otherwise to an empty string) so the function is
defensive when called before detect_kubelet_conflict() initializes the variable.
---
Nitpick comments:
In `@scripts/setup-spark.sh`:
- Around line 67-87: Move the Kubernetes conflict preflight (the
detect_kubelet_conflict check and its warn_kubelet_conflict/interactive fail
path) to run before any host user/group mutation occurs (e.g., before the
usermod -aG docker invocation); specifically, relocate the entire if
detect_kubelet_conflict ... fi block so it executes prior to any calls that
change system state, ensuring the script fails fast and leaves the host
unmodified when a conflict is detected.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: e6b4e37b-5c8b-48e1-ad11-68542573dd65
📒 Files selected for processing (3)
scripts/lib/runtime.shscripts/setup-spark.shscripts/setup.sh
✅ Files skipped from review due to trivial changes (1)
- scripts/setup.sh
|
All bot feedback addressed across 4 commits:
The final diff is clean:
Detects: Ready for human review 🙂 |
7303195 to
c6cb2da
Compare
|
@cv Rebased onto latest main (clean, 4 commits → 3 files). All review comments from Copilot were already addressed in earlier commits (k3s-agent detection, read EOF handling). Would appreciate a CI trigger when you get a chance — first-time contributor. Thanks! |
51f3d32 to
6342c4e
Compare
|
Rebased onto latest main — squashed to 1 clean commit with conventional format and Signed-off-by. Note: |
6342c4e to
2a9c962
Compare
Signed-off-by: Benedikt Schackenberg <6381261+BenediktSchackenberg@users.noreply.github.com>
2a9c962 to
6abdffe
Compare
Summary
When another Kubernetes distribution (MicroK8s, k3s, kubeadm) is running on the host, the gateway's embedded k3s inside Docker (started with
--cgroupns=host) fights over/sys/fs/cgroup/kubepodswith the host kubelet. Both kubelets see each other's pods as orphaned and kill them, resulting in rapid CrashLoopBackOff cycles.Changes
setup-spark.sh:
detect_conflicting_kubelet()function that checks for:kubelet/kubeliteprocessesmicrok8s status)setup.sh:
openshell gateway startWhy not just switch to
--cgroupns=private?That would be the ideal long-term fix (suggested in #431), but it requires changes in OpenShell's gateway container startup logic, which is outside this repo's scope. The preflight check is the safest fix we can ship now without upstream changes.
Fixes #431
AI-assisted: Built with Claude, reviewed by human.
Summary by CodeRabbit