fix: detect conflicting host kubelet before gateway start by BenediktSchackenberg · Pull Request #433 · NVIDIA/NemoClaw

BenediktSchackenberg · 2026-03-19T19:43:10Z

Summary

When another Kubernetes distribution (MicroK8s, k3s, kubeadm) is running on the host, the gateway's embedded k3s inside Docker (started with --cgroupns=host) fights over /sys/fs/cgroup/kubepods with the host kubelet. Both kubelets see each other's pods as orphaned and kill them, resulting in rapid CrashLoopBackOff cycles.

Changes

setup-spark.sh:

Added detect_conflicting_kubelet() function that checks for:
- Running kubelet/kubelite processes
- Active MicroK8s installation (microk8s status)
- Active k3s systemd service
Interactive: prompts user to stop the conflicting service or abort
Non-interactive: fails with clear instructions

setup.sh:

Added lightweight preflight warning (non-blocking) before openshell gateway start
Warns about detected kubelet but doesn't block — the user may know what they're doing

Why not just switch to `--cgroupns=private`?

That would be the ideal long-term fix (suggested in #431), but it requires changes in OpenShell's gateway container startup logic, which is outside this repo's scope. The preflight check is the safest fix we can ship now without upstream changes.

Fixes #431

AI-assisted: Built with Claude, reviewed by human.

Summary by CodeRabbit

New Features
- Pre-flight detection of a running host Kubernetes (kubelet/k3s/MicroK8s) now warns about cgroup namespace conflicts that can cause pods to CrashLoopBackOff.
- Warnings appear before restart/destroy actions; interactive runs prompt to continue, non-interactive runs fail with clear guidance to stop the conflicting Kubernetes or re-run interactively.

coderabbitai · 2026-03-19T19:43:45Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Added host kubelet detection and warning: new detect_kubelet_conflict() and warn_kubelet_conflict() in scripts/lib/runtime.sh. scripts/setup-spark.sh now sources the helpers and runs an interactive/non‑interactive preflight that may abort on conflict. scripts/setup.sh emits a non‑blocking warning when a host kubelet is detected.

Changes

Cohort / File(s)	Summary
Runtime helpers `scripts/lib/runtime.sh`	Added `detect_kubelet_conflict()` to detect host `kubelet`/`kubelite`/`k3s`, `microk8s` running status, and relevant systemd services; added `warn_kubelet_conflict()` and `KUBELET_CONFLICT_DETAIL` for standardized warnings.
Spark setup gating `scripts/setup-spark.sh`	Now resolves script dir and sources the runtime helpers; runs `detect_kubelet_conflict` before changing `default-cgroupns-mode`; on conflict calls `warn_kubelet_conflict "$KUBELET_CONFLICT_DETAIL"` and either prompts and may abort in TTY, or fails fast in non‑TTY.
Gateway setup warning `scripts/setup.sh`	Added an early, non‑blocking `detect_kubelet_conflict` invocation that emits `warn_kubelet_conflict "$KUBELET_CONFLICT_DETAIL"` prior to gateway restart/destroy steps; does not gate or change restart flow.

Sequence Diagram(s)

sequenceDiagram
  rect rgba(220,230,241,0.5)
    participant User as User (TTY / Non‑TTY)
    participant Script as Setup Script
    participant Host as Host (processes / systemd)
    participant Docker as Docker / container runtime
  end

  Script->>Host: detect_kubelet_conflict() — probe for kubelet/kubelite/k3s,\nmicrok8s status, systemd services
  alt conflict found
    Script->>User: warn_kubelet_conflict(detail) — emit multi-line guidance
    alt interactive (TTY)
      User-->>Script: reply (Y/N)
      alt confirmed (Y)
        Script->>Docker: proceed with cgroupns change / restart
      else not confirmed
        Script-->>User: abort/exit
      end
    else non-interactive
      Script-->>User: fail fast with stop/retry instructions
    end
  else no conflict
    Script->>Docker: proceed with cgroupns change / restart
  end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 I sniffed the host for kubelets in the night,
Found tiny clashes that made containers fight.
I called out a warning, then waited for say,
Prompting in terminal or failing the way.
Hop lightly, restart, and keep clusters bright.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'fix: detect conflicting host kubelet before gateway start' accurately and concisely summarizes the main change: adding kubelet conflict detection before the gateway starts.
Linked Issues check	✅ Passed	The PR implements mitigation `#3` from issue `#431`: detecting conflicting host Kubernetes distributions (MicroK8s, k3s, kubeadm) and warning/blocking on detection with clear remediation instructions.
Out of Scope Changes check	✅ Passed	All changes are directly related to adding kubelet conflict detection: two new utility functions in lib/runtime.sh and their integration into setup-spark.sh and setup.sh. No unrelated modifications present.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Copilot

Pull request overview

Adds host-side preflight detection of conflicting Kubernetes kubelets to reduce openshell gateway start CrashLoopBackOff scenarios caused by cgroupns=host cgroup contention (kubepods), per issue #431.

Changes:

Add a non-blocking kubelet/k3s warning in scripts/setup.sh before starting the gateway.
Add a blocking (with interactive override) conflicting-kubelet detector in scripts/setup-spark.sh before configuring Docker for cgroupns=host.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File	Description
scripts/setup.sh	Adds a lightweight preflight warning if a host kubelet/k3s is detected prior to gateway start.
scripts/setup-spark.sh	Adds a preflight detector that aborts (or prompts) when a conflicting host Kubernetes is running before applying Docker cgroupns changes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-19T19:46:10Z

scripts/setup-spark.sh

+  # Check for k3s service
+  if systemctl is-active --quiet k3s 2>/dev/null; then
+    found="k3s service is active"
+  fi


The k3s detection only checks systemctl is-active k3s, which can miss hosts running k3s as k3s-agent (common on agent-only nodes) or as a k3s process without that specific unit. Consider checking k3s-agent as well (and/or pgrep -x k3s) so the conflict warning triggers reliably for k3s installations.

Good catch — added pgrep -x k3s and systemctl is-active --quiet k3s-agent to both scripts.

Copilot · 2026-03-19T19:46:11Z

scripts/setup-spark.sh

+    warn ""
+
+    if [ -t 0 ]; then
+      read -rp "Continue anyway? [y/N] " reply


Because the script runs with set -e, read -rp ... will cause an immediate exit on EOF/error (e.g., user hits Ctrl-D) before your intended abort message runs. Handle read failure explicitly (e.g., if ! read ...; then fail ...; fi) so the script exits with a clear, consistent error.

Suggested change

read -rp "Continue anyway? [y/N] " reply

if ! read -rp "Continue anyway? [y/N] " reply; then

fail "Aborted. Stop the conflicting Kubernetes service and retry."

fi

Fixed — read failure is now handled explicitly with if ! read ... so Ctrl-D/EOF gets a clean error message instead of a cryptic set -e exit.

Copilot · 2026-03-19T19:46:11Z

scripts/setup.sh

+if pgrep -x kubelet > /dev/null 2>&1 || pgrep -x kubelite > /dev/null 2>&1 || systemctl is-active --quiet k3s 2>/dev/null; then
+  warn "⚠️  A Kubernetes kubelet is running on this host."
+  warn "The gateway's embedded k3s may conflict over cgroup paths (kubepods)."
+  warn "If the gateway fails with CrashLoopBackOff, stop the host Kubernetes first:"
+  warn "  sudo microk8s stop / sudo systemctl stop k3s / sudo systemctl stop kubelet"
+fi


This preflight check looks for kubelet/kubelite processes and an active k3s systemd unit, but it can miss k3s installations running as k3s-agent (and/or just a k3s process). Consider checking systemctl is-active --quiet k3s-agent as well (and/or pgrep -x k3s) to avoid false negatives when k3s is the conflicting host Kubernetes.

Addressed in the same commit — both pgrep -x k3s and k3s-agent systemd unit are now checked.

coderabbitai

🧹 Nitpick comments (1)

scripts/setup.sh (1)
95-102: Consider centralizing kubelet-conflict detection to avoid drift.

This block now overlaps with scripts/setup-spark.sh but with different checks/behavior. Moving shared detection/message formatting into scripts/lib/runtime.sh would keep both setup paths consistent and easier to maintain.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/setup.sh` around lines 95 - 102, The kubelet-conflict detection block
in scripts/setup.sh is duplicated with different behavior in
scripts/setup-spark.sh; extract the logic into a shared function placed in
scripts/lib/runtime.sh (e.g., detect_kubelet_conflict() which runs the
pgrep/systemctl checks and returns status, and warn_kubelet_conflict() which
emits the standardized warn messages), then replace the inline checks in both
scripts (setup.sh and setup-spark.sh) with calls to detect_kubelet_conflict &&
warn_kubelet_conflict to keep detection and message formatting centralized and
consistent across both setup paths.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@scripts/setup.sh`:
- Around line 95-102: The kubelet-conflict detection block in scripts/setup.sh
is duplicated with different behavior in scripts/setup-spark.sh; extract the
logic into a shared function placed in scripts/lib/runtime.sh (e.g.,
detect_kubelet_conflict() which runs the pgrep/systemctl checks and returns
status, and warn_kubelet_conflict() which emits the standardized warn messages),
then replace the inline checks in both scripts (setup.sh and setup-spark.sh)
with calls to detect_kubelet_conflict && warn_kubelet_conflict to keep detection
and message formatting centralized and consistent across both setup paths.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 7e8e7be6-e6bc-40c7-a32e-aa55b532778d

📥 Commits

Reviewing files that changed from the base of the PR and between ffdccaf and 0214255.

📒 Files selected for processing (2)

scripts/setup-spark.sh
scripts/setup.sh

BenediktSchackenberg · 2026-03-19T19:53:16Z

Addressed CodeRabbit's nitpick — extracted detect_kubelet_conflict() and warn_kubelet_conflict() into scripts/lib/runtime.sh. Both setup.sh and setup-spark.sh now call the shared functions instead of duplicating the detection logic.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

scripts/setup-spark.sh (1)
67-87: Consider moving conflict preflight before user/group mutation.

Right now, a detected conflict can abort after usermod -aG docker has already changed host state. Running this preflight earlier would make failure paths fully fail-fast and side-effect free.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/setup-spark.sh` around lines 67 - 87, Move the Kubernetes conflict
preflight (the detect_kubelet_conflict check and its
warn_kubelet_conflict/interactive fail path) to run before any host user/group
mutation occurs (e.g., before the usermod -aG docker invocation); specifically,
relocate the entire if detect_kubelet_conflict ... fi block so it executes prior
to any calls that change system state, ensuring the script fails fast and leaves
the host unmodified when a conflict is detected.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/lib/runtime.sh`:
- Line 260: The assignment of local detail in warn_kubelet_conflict() uses
${1:-$KUBELET_CONFLICT_DETAIL} which can trigger an unbound-variable error under
set -u if KUBELET_CONFLICT_DETAIL is not defined; change the assignment to use a
nested default so it falls back to an empty value when both $1 and
KUBELET_CONFLICT_DETAIL are unset (i.e., set detail to first argument, otherwise
to KUBELET_CONFLICT_DETAIL, otherwise to an empty string) so the function is
defensive when called before detect_kubelet_conflict() initializes the variable.

---

Nitpick comments:
In `@scripts/setup-spark.sh`:
- Around line 67-87: Move the Kubernetes conflict preflight (the
detect_kubelet_conflict check and its warn_kubelet_conflict/interactive fail
path) to run before any host user/group mutation occurs (e.g., before the
usermod -aG docker invocation); specifically, relocate the entire if
detect_kubelet_conflict ... fi block so it executes prior to any calls that
change system state, ensuring the script fails fast and leaves the host
unmodified when a conflict is detected.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e6b4e37b-5c8b-48e1-ad11-68542573dd65

📥 Commits

Reviewing files that changed from the base of the PR and between 7561c3c and 8a1a285.

📒 Files selected for processing (3)

scripts/lib/runtime.sh
scripts/setup-spark.sh
scripts/setup.sh

✅ Files skipped from review due to trivial changes (1)

scripts/setup.sh

scripts/lib/runtime.sh

BenediktSchackenberg · 2026-03-19T19:59:37Z

All bot feedback addressed across 4 commits:

Initial fix — preflight kubelet detection in setup-spark.sh + setup.sh
k3s-agent + pgrep k3s — Copilot caught missing k3s detection paths
Shared lib — CodeRabbit nitpick: extracted into lib/runtime.sh to avoid duplication
set -u safety — CodeRabbit caught unbound variable edge case

The final diff is clean:

scripts/lib/runtime.sh: +44 lines — detect_kubelet_conflict() + warn_kubelet_conflict()
scripts/setup-spark.sh: +24 lines — interactive blocking check before cgroupns config
scripts/setup.sh: +6 lines — non-blocking warning before gateway start

Detects: kubelet, kubelite, k3s processes + microk8s status + k3s/k3s-agent systemd units.

Ready for human review 🙂

BenediktSchackenberg · 2026-03-27T19:59:27Z

@cv Rebased onto latest main (clean, 4 commits → 3 files). All review comments from Copilot were already addressed in earlier commits (k3s-agent detection, read EOF handling). Would appreciate a CI trigger when you get a chance — first-time contributor. Thanks!

BenediktSchackenberg · 2026-04-02T08:24:31Z

Rebased onto latest main — squashed to 1 clean commit with conventional format and Signed-off-by. Note: setup.sh was deleted in main (#1235), so the kubelet detection now lives purely in scripts/lib/runtime.sh (shared helper) and scripts/setup-spark.sh.

Signed-off-by: Benedikt Schackenberg <6381261+BenediktSchackenberg@users.noreply.github.com>

Copilot AI review requested due to automatic review settings March 19, 2026 19:43

BenediktSchackenberg mentioned this pull request Mar 19, 2026

openshell gateway start fails with CrashLoopBackOff when another Kubernetes distribution is running on the host #431

Open

Copilot started reviewing on behalf of BenediktSchackenberg March 19, 2026 19:43 View session

Copilot AI reviewed Mar 19, 2026

View reviewed changes

coderabbitai bot reviewed Mar 19, 2026

View reviewed changes

scripts/lib/runtime.sh Outdated Show resolved Hide resolved

BenediktSchackenberg force-pushed the fix/kubelet-cgroup-conflict branch 3 times, most recently from 7303195 to c6cb2da Compare March 27, 2026 19:59

BenediktSchackenberg mentioned this pull request Mar 28, 2026

fix(onboard): gateway start fails on first attempt due to k3s startup timeout #1050

Closed

BenediktSchackenberg force-pushed the fix/kubelet-cgroup-conflict branch from 51f3d32 to 6342c4e Compare April 2, 2026 08:24

BenediktSchackenberg force-pushed the fix/kubelet-cgroup-conflict branch from 6342c4e to 2a9c962 Compare April 3, 2026 17:57

fix(preflight): detect conflicting host kubelet before gateway start

6abdffe

Signed-off-by: Benedikt Schackenberg <6381261+BenediktSchackenberg@users.noreply.github.com>

BenediktSchackenberg force-pushed the fix/kubelet-cgroup-conflict branch from 2a9c962 to 6abdffe Compare April 3, 2026 18:44

-      read -rp "Continue anyway? [y/N] " reply
+      if ! read -rp "Continue anyway? [y/N] " reply; then
+        fail "Aborted. Stop the conflicting Kubernetes service and retry."
+      fi

Conversation

BenediktSchackenberg commented Mar 19, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Why not just switch to --cgroupns=private?

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

BenediktSchackenberg Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

BenediktSchackenberg Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

BenediktSchackenberg Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

BenediktSchackenberg commented Mar 19, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

BenediktSchackenberg commented Mar 19, 2026

Uh oh!

BenediktSchackenberg commented Mar 27, 2026

Uh oh!

BenediktSchackenberg commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

BenediktSchackenberg commented Mar 19, 2026 •

edited by coderabbitai bot

Loading

Why not just switch to `--cgroupns=private`?

coderabbitai bot commented Mar 19, 2026 •

edited

Loading