Skip to content

fix: detect conflicting host kubelet before gateway start#433

Open
BenediktSchackenberg wants to merge 1 commit intoNVIDIA:mainfrom
BenediktSchackenberg:fix/kubelet-cgroup-conflict
Open

fix: detect conflicting host kubelet before gateway start#433
BenediktSchackenberg wants to merge 1 commit intoNVIDIA:mainfrom
BenediktSchackenberg:fix/kubelet-cgroup-conflict

Conversation

@BenediktSchackenberg
Copy link
Copy Markdown
Contributor

@BenediktSchackenberg BenediktSchackenberg commented Mar 19, 2026

Summary

When another Kubernetes distribution (MicroK8s, k3s, kubeadm) is running on the host, the gateway's embedded k3s inside Docker (started with --cgroupns=host) fights over /sys/fs/cgroup/kubepods with the host kubelet. Both kubelets see each other's pods as orphaned and kill them, resulting in rapid CrashLoopBackOff cycles.

Changes

setup-spark.sh:

  • Added detect_conflicting_kubelet() function that checks for:
    • Running kubelet/kubelite processes
    • Active MicroK8s installation (microk8s status)
    • Active k3s systemd service
  • Interactive: prompts user to stop the conflicting service or abort
  • Non-interactive: fails with clear instructions

setup.sh:

  • Added lightweight preflight warning (non-blocking) before openshell gateway start
  • Warns about detected kubelet but doesn't block — the user may know what they're doing

Why not just switch to --cgroupns=private?

That would be the ideal long-term fix (suggested in #431), but it requires changes in OpenShell's gateway container startup logic, which is outside this repo's scope. The preflight check is the safest fix we can ship now without upstream changes.

Fixes #431

AI-assisted: Built with Claude, reviewed by human.

Summary by CodeRabbit

  • New Features
    • Pre-flight detection of a running host Kubernetes (kubelet/k3s/MicroK8s) now warns about cgroup namespace conflicts that can cause pods to CrashLoopBackOff.
    • Warnings appear before restart/destroy actions; interactive runs prompt to continue, non-interactive runs fail with clear guidance to stop the conflicting Kubernetes or re-run interactively.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 19, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Added host kubelet detection and warning: new detect_kubelet_conflict() and warn_kubelet_conflict() in scripts/lib/runtime.sh. scripts/setup-spark.sh now sources the helpers and runs an interactive/non‑interactive preflight that may abort on conflict. scripts/setup.sh emits a non‑blocking warning when a host kubelet is detected.

Changes

Cohort / File(s) Summary
Runtime helpers
scripts/lib/runtime.sh
Added detect_kubelet_conflict() to detect host kubelet/kubelite/k3s, microk8s running status, and relevant systemd services; added warn_kubelet_conflict() and KUBELET_CONFLICT_DETAIL for standardized warnings.
Spark setup gating
scripts/setup-spark.sh
Now resolves script dir and sources the runtime helpers; runs detect_kubelet_conflict before changing default-cgroupns-mode; on conflict calls warn_kubelet_conflict "$KUBELET_CONFLICT_DETAIL" and either prompts and may abort in TTY, or fails fast in non‑TTY.
Gateway setup warning
scripts/setup.sh
Added an early, non‑blocking detect_kubelet_conflict invocation that emits warn_kubelet_conflict "$KUBELET_CONFLICT_DETAIL" prior to gateway restart/destroy steps; does not gate or change restart flow.

Sequence Diagram(s)

sequenceDiagram
  rect rgba(220,230,241,0.5)
    participant User as User (TTY / Non‑TTY)
    participant Script as Setup Script
    participant Host as Host (processes / systemd)
    participant Docker as Docker / container runtime
  end

  Script->>Host: detect_kubelet_conflict() — probe for kubelet/kubelite/k3s,\nmicrok8s status, systemd services
  alt conflict found
    Script->>User: warn_kubelet_conflict(detail) — emit multi-line guidance
    alt interactive (TTY)
      User-->>Script: reply (Y/N)
      alt confirmed (Y)
        Script->>Docker: proceed with cgroupns change / restart
      else not confirmed
        Script-->>User: abort/exit
      end
    else non-interactive
      Script-->>User: fail fast with stop/retry instructions
    end
  else no conflict
    Script->>Docker: proceed with cgroupns change / restart
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 I sniffed the host for kubelets in the night,
Found tiny clashes that made containers fight.
I called out a warning, then waited for say,
Prompting in terminal or failing the way.
Hop lightly, restart, and keep clusters bright.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix: detect conflicting host kubelet before gateway start' accurately and concisely summarizes the main change: adding kubelet conflict detection before the gateway starts.
Linked Issues check ✅ Passed The PR implements mitigation #3 from issue #431: detecting conflicting host Kubernetes distributions (MicroK8s, k3s, kubeadm) and warning/blocking on detection with clear remediation instructions.
Out of Scope Changes check ✅ Passed All changes are directly related to adding kubelet conflict detection: two new utility functions in lib/runtime.sh and their integration into setup-spark.sh and setup.sh. No unrelated modifications present.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds host-side preflight detection of conflicting Kubernetes kubelets to reduce openshell gateway start CrashLoopBackOff scenarios caused by cgroupns=host cgroup contention (kubepods), per issue #431.

Changes:

  • Add a non-blocking kubelet/k3s warning in scripts/setup.sh before starting the gateway.
  • Add a blocking (with interactive override) conflicting-kubelet detector in scripts/setup-spark.sh before configuring Docker for cgroupns=host.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
scripts/setup.sh Adds a lightweight preflight warning if a host kubelet/k3s is detected prior to gateway start.
scripts/setup-spark.sh Adds a preflight detector that aborts (or prompts) when a conflicting host Kubernetes is running before applying Docker cgroupns changes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +88 to +91
# Check for k3s service
if systemctl is-active --quiet k3s 2>/dev/null; then
found="k3s service is active"
fi
Copy link

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The k3s detection only checks systemctl is-active k3s, which can miss hosts running k3s as k3s-agent (common on agent-only nodes) or as a k3s process without that specific unit. Consider checking k3s-agent as well (and/or pgrep -x k3s) so the conflict warning triggers reliably for k3s installations.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — added pgrep -x k3s and systemctl is-active --quiet k3s-agent to both scripts.

warn ""

if [ -t 0 ]; then
read -rp "Continue anyway? [y/N] " reply
Copy link

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the script runs with set -e, read -rp ... will cause an immediate exit on EOF/error (e.g., user hits Ctrl-D) before your intended abort message runs. Handle read failure explicitly (e.g., if ! read ...; then fail ...; fi) so the script exits with a clear, consistent error.

Suggested change
read -rp "Continue anyway? [y/N] " reply
if ! read -rp "Continue anyway? [y/N] " reply; then
fail "Aborted. Stop the conflicting Kubernetes service and retry."
fi

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — read failure is now handled explicitly with if ! read ... so Ctrl-D/EOF gets a clean error message instead of a cryptic set -e exit.

scripts/setup.sh Outdated
Comment on lines +97 to +102
if pgrep -x kubelet > /dev/null 2>&1 || pgrep -x kubelite > /dev/null 2>&1 || systemctl is-active --quiet k3s 2>/dev/null; then
warn "⚠️ A Kubernetes kubelet is running on this host."
warn "The gateway's embedded k3s may conflict over cgroup paths (kubepods)."
warn "If the gateway fails with CrashLoopBackOff, stop the host Kubernetes first:"
warn " sudo microk8s stop / sudo systemctl stop k3s / sudo systemctl stop kubelet"
fi
Copy link

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This preflight check looks for kubelet/kubelite processes and an active k3s systemd unit, but it can miss k3s installations running as k3s-agent (and/or just a k3s process). Consider checking systemctl is-active --quiet k3s-agent as well (and/or pgrep -x k3s) to avoid false negatives when k3s is the conflicting host Kubernetes.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in the same commit — both pgrep -x k3s and k3s-agent systemd unit are now checked.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
scripts/setup.sh (1)

95-102: Consider centralizing kubelet-conflict detection to avoid drift.

This block now overlaps with scripts/setup-spark.sh but with different checks/behavior. Moving shared detection/message formatting into scripts/lib/runtime.sh would keep both setup paths consistent and easier to maintain.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/setup.sh` around lines 95 - 102, The kubelet-conflict detection block
in scripts/setup.sh is duplicated with different behavior in
scripts/setup-spark.sh; extract the logic into a shared function placed in
scripts/lib/runtime.sh (e.g., detect_kubelet_conflict() which runs the
pgrep/systemctl checks and returns status, and warn_kubelet_conflict() which
emits the standardized warn messages), then replace the inline checks in both
scripts (setup.sh and setup-spark.sh) with calls to detect_kubelet_conflict &&
warn_kubelet_conflict to keep detection and message formatting centralized and
consistent across both setup paths.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@scripts/setup.sh`:
- Around line 95-102: The kubelet-conflict detection block in scripts/setup.sh
is duplicated with different behavior in scripts/setup-spark.sh; extract the
logic into a shared function placed in scripts/lib/runtime.sh (e.g.,
detect_kubelet_conflict() which runs the pgrep/systemctl checks and returns
status, and warn_kubelet_conflict() which emits the standardized warn messages),
then replace the inline checks in both scripts (setup.sh and setup-spark.sh)
with calls to detect_kubelet_conflict && warn_kubelet_conflict to keep detection
and message formatting centralized and consistent across both setup paths.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 7e8e7be6-e6bc-40c7-a32e-aa55b532778d

📥 Commits

Reviewing files that changed from the base of the PR and between ffdccaf and 0214255.

📒 Files selected for processing (2)
  • scripts/setup-spark.sh
  • scripts/setup.sh

@BenediktSchackenberg
Copy link
Copy Markdown
Contributor Author

Addressed CodeRabbit's nitpick — extracted detect_kubelet_conflict() and warn_kubelet_conflict() into scripts/lib/runtime.sh. Both setup.sh and setup-spark.sh now call the shared functions instead of duplicating the detection logic.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
scripts/setup-spark.sh (1)

67-87: Consider moving conflict preflight before user/group mutation.

Right now, a detected conflict can abort after usermod -aG docker has already changed host state. Running this preflight earlier would make failure paths fully fail-fast and side-effect free.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/setup-spark.sh` around lines 67 - 87, Move the Kubernetes conflict
preflight (the detect_kubelet_conflict check and its
warn_kubelet_conflict/interactive fail path) to run before any host user/group
mutation occurs (e.g., before the usermod -aG docker invocation); specifically,
relocate the entire if detect_kubelet_conflict ... fi block so it executes prior
to any calls that change system state, ensuring the script fails fast and leaves
the host unmodified when a conflict is detected.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/lib/runtime.sh`:
- Line 260: The assignment of local detail in warn_kubelet_conflict() uses
${1:-$KUBELET_CONFLICT_DETAIL} which can trigger an unbound-variable error under
set -u if KUBELET_CONFLICT_DETAIL is not defined; change the assignment to use a
nested default so it falls back to an empty value when both $1 and
KUBELET_CONFLICT_DETAIL are unset (i.e., set detail to first argument, otherwise
to KUBELET_CONFLICT_DETAIL, otherwise to an empty string) so the function is
defensive when called before detect_kubelet_conflict() initializes the variable.

---

Nitpick comments:
In `@scripts/setup-spark.sh`:
- Around line 67-87: Move the Kubernetes conflict preflight (the
detect_kubelet_conflict check and its warn_kubelet_conflict/interactive fail
path) to run before any host user/group mutation occurs (e.g., before the
usermod -aG docker invocation); specifically, relocate the entire if
detect_kubelet_conflict ... fi block so it executes prior to any calls that
change system state, ensuring the script fails fast and leaves the host
unmodified when a conflict is detected.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e6b4e37b-5c8b-48e1-ad11-68542573dd65

📥 Commits

Reviewing files that changed from the base of the PR and between 7561c3c and 8a1a285.

📒 Files selected for processing (3)
  • scripts/lib/runtime.sh
  • scripts/setup-spark.sh
  • scripts/setup.sh
✅ Files skipped from review due to trivial changes (1)
  • scripts/setup.sh

@BenediktSchackenberg
Copy link
Copy Markdown
Contributor Author

All bot feedback addressed across 4 commits:

  1. Initial fix — preflight kubelet detection in setup-spark.sh + setup.sh
  2. k3s-agent + pgrep k3s — Copilot caught missing k3s detection paths
  3. Shared lib — CodeRabbit nitpick: extracted into lib/runtime.sh to avoid duplication
  4. set -u safety — CodeRabbit caught unbound variable edge case

The final diff is clean:

  • scripts/lib/runtime.sh: +44 lines — detect_kubelet_conflict() + warn_kubelet_conflict()
  • scripts/setup-spark.sh: +24 lines — interactive blocking check before cgroupns config
  • scripts/setup.sh: +6 lines — non-blocking warning before gateway start

Detects: kubelet, kubelite, k3s processes + microk8s status + k3s/k3s-agent systemd units.

Ready for human review 🙂

@psorensen-nvidia psorensen-nvidia added bug Something isn't working Platform: DGX Spark Support for DGX Spark Platform: Ubuntu Support for Linux Ubuntu priority: high Important issue that should be resolved in the next release Docker Support for Docker containerization K8s Use this label to identify Kubernetes deployment issues with NemoClaw. labels Mar 19, 2026
@BenediktSchackenberg BenediktSchackenberg force-pushed the fix/kubelet-cgroup-conflict branch 3 times, most recently from 7303195 to c6cb2da Compare March 27, 2026 19:59
@BenediktSchackenberg
Copy link
Copy Markdown
Contributor Author

@cv Rebased onto latest main (clean, 4 commits → 3 files). All review comments from Copilot were already addressed in earlier commits (k3s-agent detection, read EOF handling). Would appreciate a CI trigger when you get a chance — first-time contributor. Thanks!

@BenediktSchackenberg
Copy link
Copy Markdown
Contributor Author

Rebased onto latest main — squashed to 1 clean commit with conventional format and Signed-off-by. Note: setup.sh was deleted in main (#1235), so the kubelet detection now lives purely in scripts/lib/runtime.sh (shared helper) and scripts/setup-spark.sh.

@BenediktSchackenberg BenediktSchackenberg force-pushed the fix/kubelet-cgroup-conflict branch from 6342c4e to 2a9c962 Compare April 3, 2026 17:57
Signed-off-by: Benedikt Schackenberg <6381261+BenediktSchackenberg@users.noreply.github.com>
@BenediktSchackenberg BenediktSchackenberg force-pushed the fix/kubelet-cgroup-conflict branch from 2a9c962 to 6abdffe Compare April 3, 2026 18:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working Docker Support for Docker containerization K8s Use this label to identify Kubernetes deployment issues with NemoClaw. Platform: DGX Spark Support for DGX Spark Platform: Ubuntu Support for Linux Ubuntu priority: high Important issue that should be resolved in the next release

Projects

None yet

Development

Successfully merging this pull request may close these issues.

openshell gateway start fails with CrashLoopBackOff when another Kubernetes distribution is running on the host

4 participants