Skip to content

fix: convert osmo-ctrl to native K8s sidecar to prevent upload data loss#765

Open
KeitaW wants to merge 3 commits into
NVIDIA:mainfrom
KeitaW:fix/native-sidecar-upload-drain
Open

fix: convert osmo-ctrl to native K8s sidecar to prevent upload data loss#765
KeitaW wants to merge 3 commits into
NVIDIA:mainfrom
KeitaW:fix/native-sidecar-upload-drain

Conversation

@KeitaW

@KeitaW KeitaW commented Mar 31, 2026

Copy link
Copy Markdown
Contributor

Summary

Converts osmo-ctrl from a regular sidecar container to a Kubernetes native sidecar (KEP-753) and adds SIGTERM upload drain logic, fixing a race condition where large output uploads are killed before completion.

  • Move osmo-ctrl to initContainers[] with restartPolicy: Always — ensures osmo-ctrl receives SIGTERM only after the user container exits (K8s native sidecar semantics)
  • Add SIGTERM drain handler in ctrl.go — waits up to 9 minutes for in-progress uploads to complete before exiting
  • Set terminationGracePeriodSeconds: 600 on task pods (was unset, defaulting to 30s)
  • Simplify default_ctrl pod template to only set terminationGracePeriodSeconds (placing osmo-ctrl in both task.py and the template causes Duplicate value K8s 422 errors via merge_lists_on_name)
  • Update check_running_pod_containers() to check init_container_statuses for native sidecar osmo-ctrl

Root Cause

Two issues combine to kill uploads:

  1. terminationGracePeriodSeconds is never set — task pods use the K8s default of 30 seconds
  2. The SIGTERM handler in ctrl.go calls os.Exit(1) immediately after FUSE unmount, killing any in-progress osmo data upload subprocess

Even increasing the grace period alone doesn't help because os.Exit(1) fires immediately on SIGTERM.

Files Changed

File Change
src/runtime/cmd/ctrl/ctrl.go SIGTERM drain: uploading atomic flag + uploadDone channel
src/utils/job/task.py Move control_container_spec to initContainers with restartPolicy: Always
src/service/core/config/objects.py default_ctrl template: only terminationGracePeriodSeconds: 600
src/operator/backend_listener.py check_running_pod_containers() checks both init and regular container statuses
src/operator/tests/test_listener.py Test for native sidecar osmo-ctrl error detection

Compatibility

  • Requires Kubernetes >= 1.29 (SidecarContainers feature gate beta, enabled by default)
  • GA in Kubernetes 1.33
  • No changes to workflow YAML or OSMO API

Fixes #764

Summary by CodeRabbit

  • Bug Fixes

    • Failure detection now includes init container termination states, improving error reporting.
  • New Features

    • Graceful shutdown waits up to 9 minutes for in-progress uploads and better drains remaining logs before exit.
    • Optional bypass of data access validation via env toggling.
  • Chores

    • Control-side container moved to run as an init container; default pod templates set a 600s termination grace period.
    • Resource discovery now considers init containers.
  • Tests

    • Added test covering init-container termination triggering failure detection.

osmo-ctrl is killed before large output uploads complete because:
(1) terminationGracePeriodSeconds defaults to 30s (never set by OSMO),
and (2) the SIGTERM handler calls os.Exit(1) immediately without
waiting for in-progress uploads to finish.

This converts osmo-ctrl from a regular container to a Kubernetes native
sidecar (KEP-753, init container with restartPolicy: Always), which
ensures it receives SIGTERM only after the user container exits. Combined
with a SIGTERM drain handler that waits for upload completion, this
eliminates the race condition.

Changes:
- task.py: Move control_container_spec from containers[] to
  initContainers[] with restartPolicy: Always; set
  terminationGracePeriodSeconds: 600
- ctrl.go: Add upload drain logic to SIGTERM handler — waits up to
  9 minutes for in-progress uploads before exiting
- objects.py: Simplify default_ctrl template to only set
  terminationGracePeriodSeconds (container placement is handled by
  task.py; duplicating in the template causes K8s 422 errors via
  merge_lists_on_name)
- backend_listener.py: Check init_container_statuses in addition to
  container_statuses so osmo-ctrl termination is detected as a native
  sidecar
- test_listener.py: Add test for native sidecar osmo-ctrl error
  detection

Requires Kubernetes >= 1.29 (SidecarContainers feature gate beta,
enabled by default).

Fixes NVIDIA#764
@KeitaW KeitaW requested a review from a team as a code owner March 31, 2026 16:19
@coderabbitai

coderabbitai Bot commented Mar 31, 2026

Copy link
Copy Markdown
Contributor
📝 Walkthrough

Walkthrough

osmo-ctrl is moved to an init-sidecar in task pods and pod termination grace is set to 600s. The ctrl process now waits for in-progress uploads to drain (with a timeout) on SIGINT/SIGTERM. Operator and connector code now inspect init container statuses/templates, and tests were added for init-sidecar termination detection.

Changes

Cohort / File(s) Summary
Pod spec / sidecar placement
src/utils/job/task.py, src/service/core/config/objects.py
Control container removed from regular spec.containers and added to spec.initContainers (restartPolicy: 'Always'); default ctrl pod template reduced to terminationGracePeriodSeconds: 600.
SIGTERM / upload drain logic
src/runtime/cmd/ctrl/ctrl.go
Add process-wide atomic uploading flag and uploadDone channel; SIGINT/SIGTERM handler waits for ongoing uploads (or ~9min timeout) before cleanup/exit; logging and log-queue drain on shutdown adjusted.
Operator status checks
src/operator/backend_listener.py
check_running_pod_containers() now combines pod.status.init_container_statuses and pod.status.container_statuses when inspecting container termination reasons.
Tests for init-sidecar termination
src/operator/tests/test_listener.py
New test helper builds pod status with osmo-ctrl as a terminated init container and a running user container; test asserts error detection for init-sidecar termination.
Postgres connector pod template parsing
src/utils/connectors/postgres.py
When scanning pod templates for the control container, inspect both spec.initContainers and spec.containers (combined) to locate osmo-ctrl and its resources.
User runtime logging
src/runtime/cmd/user/user.go
Add log statement when initial decoded control message has type messages.CtrlFailed.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant K8S as "Kubernetes SIGTERM"
    participant CTRL as "osmo-ctrl (process)"
    participant UP as "Upload subprocess"
    participant MNT as "Mount cleanup"

    K8S->>CTRL: send SIGTERM
    activate CTRL
    CTRL->>CTRL: check atomic `uploading` flag
    alt uploading == true
        CTRL->>CTRL: wait for `uploadDone` or 9min timeout
        par upload completes
            UP-->>CTRL: upload finished
            CTRL->>CTRL: close `uploadDone`
        and timeout not hit
        end
        CTRL->>MNT: cleanup mounts
        CTRL-->>K8S: exit(0)
    else uploading == false
        CTRL->>MNT: cleanup mounts
        CTRL-->>K8S: exit(1)
    end
    deactivate CTRL
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I hopped in logs where big files roam,
The sidecar waits to carry them home,
With atomic paws I hold the gate,
Nine minutes more to finish the crate,
Then mounts unroll and all's back to home.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 45.45% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title accurately describes the main objective: converting osmo-ctrl to a native K8s sidecar to prevent upload data loss, which directly addresses issue #764.
Linked Issues check ✅ Passed The PR implements all primary coding objectives from issue #764: converts osmo-ctrl to native sidecar (initContainers with restartPolicy), adds SIGTERM upload drain logic with timeout, sets terminationGracePeriodSeconds to 600s, updates pod templates, and detects native sidecar status in backend_listener.
Out of Scope Changes check ✅ Passed All changes are directly scoped to issue #764: K8s native sidecar conversion, SIGTERM drain handling, pod template updates, and operator status detection. Additional improvements like OSMO_SKIP_DATA_AUTH propagation and log draining are necessary runtime fixes discovered during testing.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
src/operator/tests/test_listener.py (1)

222-238: Model the native sidecar in pod.spec and assert through calculate_pod_status().

Line 226 still uses self.create_spec(), which keeps osmo-ctrl in spec.containers. That means this test proves only the status-list helper, not the real native-sidecar pod shape. Building the mock so osmo-ctrl lives under the pod's init-container spec and asserting calculate_pod_status() would cover the production path better.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/operator/tests/test_listener.py` around lines 222 - 238, The test
currently builds a pod via create_native_sidecar_ctrl_terminated_pod but still
calls create_spec() which places osmo-ctrl in spec.containers; modify the pod
model so osmo-ctrl is placed in pod.spec.init_containers (or via a helper that
returns a spec with initContainers) inside
create_native_sidecar_ctrl_terminated_pod, then call
backend_listener.calculate_pod_status(pod_event) and assert the returned
status/error reflects the osmo-ctrl termination instead of only calling
check_running_pod_containers; update references to create_spec(),
create_native_sidecar_ctrl_terminated_pod, calculate_pod_status, and
check_running_pod_containers accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/runtime/cmd/ctrl/ctrl.go`:
- Around line 1404-1414: The handler currently always calls os.Exit(1) after the
upload drain block, which causes a graceful drained SIGTERM (when
uploading.Load() is true and <-uploadDone completes) to be treated as a failure;
update the control flow in the SIGTERM handling so that after a successful
upload drain (the case where <-uploadDone returns) you call os.Exit(0) instead
of os.Exit(1), while preserving os.Exit(1) for the upload timeout/failure path;
locate the block using uploading.Load(), the uploadDone channel,
cleanupMounts(cmdArgs.DownloadType) and os.Exit(1) to make the change.
- Around line 1498-1503: The uploading flag is set too late causing a race where
SIGTERM can bypass uploads; move the uploading.Store(true) to before the exec
phase (i.e., set it before entering the execLogs/ExecFinished handling or
immediately before the signal handler is registered) and only call
uploading.Store(false) after uploadOutputs(...) completes; ensure the signal
handler continues to check uploading.Load() so it will wait for uploadOutputs to
finish instead of calling cleanupMounts()/os.Exit(1); add a unit/integration
test that sends SIGTERM immediately after an ExecFinished message to verify
uploads complete and no data is skipped.

In `@src/utils/job/task.py`:
- Around line 2808-2816: The control sidecar "osmo-ctrl" was moved into pod spec
"initContainers", but readers still only inspect spec['containers']; update the
connector that extracts control-container data (the postgres connector logic
that searches for "osmo-ctrl") to also examine spec.get('initContainers', []) in
addition to spec.get('containers', []) and merge/choose the found container when
reading resource limits and other metadata; ensure the lookup code (the function
in connectors.postgres that looks up "osmo-ctrl") checks both lists and prefers
the initContainer match if present so generated pod specs continue to provide
limits.

---

Nitpick comments:
In `@src/operator/tests/test_listener.py`:
- Around line 222-238: The test currently builds a pod via
create_native_sidecar_ctrl_terminated_pod but still calls create_spec() which
places osmo-ctrl in spec.containers; modify the pod model so osmo-ctrl is placed
in pod.spec.init_containers (or via a helper that returns a spec with
initContainers) inside create_native_sidecar_ctrl_terminated_pod, then call
backend_listener.calculate_pod_status(pod_event) and assert the returned
status/error reflects the osmo-ctrl termination instead of only calling
check_running_pod_containers; update references to create_spec(),
create_native_sidecar_ctrl_terminated_pod, calculate_pod_status, and
check_running_pod_containers accordingly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 163d06c6-3141-460d-84a7-87142f945af9

📥 Commits

Reviewing files that changed from the base of the PR and between ecd7ad1 and 9da50af.

📒 Files selected for processing (5)
  • src/operator/backend_listener.py
  • src/operator/tests/test_listener.py
  • src/runtime/cmd/ctrl/ctrl.go
  • src/service/core/config/objects.py
  • src/utils/job/task.py

Comment thread src/runtime/cmd/ctrl/ctrl.go Outdated
Comment on lines +1498 to +1503
uploading.Store(true)
uploadOutputs(unixConn, cmdArgs.Outputs, cmdArgs.OutputPath, cmdArgs.MetadataFile,
uploadChan, metricChan, cmdArgs.RetryId, cmdArgs.GroupName, cmdArgs.LogSource,
cmdArgs.UserConfig, cmdArgs.ServiceConfig, cmdArgs.ConfigLoc)
uploading.Store(false)
close(uploadDone)

@coderabbitai coderabbitai Bot Mar 31, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, check file size and structure
wc -l src/runtime/cmd/ctrl/ctrl.go

Repository: NVIDIA/OSMO

Length of output: 88


🏁 Script executed:

# Read the declaration area (lines 74-76)
sed -n '70,85p' src/runtime/cmd/ctrl/ctrl.go

Repository: NVIDIA/OSMO

Length of output: 402


🏁 Script executed:

# Read the signal handler area (lines 1404-1412)
sed -n '1400,1420p' src/runtime/cmd/ctrl/ctrl.go

Repository: NVIDIA/OSMO

Length of output: 716


🏁 Script executed:

# Read the upload call area (lines 1498-1503 and surrounding context)
sed -n '1490,1510p' src/runtime/cmd/ctrl/ctrl.go

Repository: NVIDIA/OSMO

Length of output: 846


🏁 Script executed:

# Find the execLogs loop to understand when it exits
rg -n "execLogs" src/runtime/cmd/ctrl/ctrl.go -A 5 -B 5

Repository: NVIDIA/OSMO

Length of output: 922


🏁 Script executed:

# Search for tests of SIGTERM/signal handling
find . -name "*test*.go" -o -name "*_test.go" | xargs grep -l "SIGTERM\|sigintCatch\|uploading" 2>/dev/null | head -20

Repository: NVIDIA/OSMO

Length of output: 37


🏁 Script executed:

# Check if there are any other references to uploading flag
rg -n "uploading\." src/runtime/cmd/ctrl/ctrl.go

Repository: NVIDIA/OSMO

Length of output: 140


🏁 Script executed:

# Check the uploadOutputs function signature and blocking behavior
rg -n "func uploadOutputs" src/runtime/cmd/ctrl/ctrl.go -A 10

Repository: NVIDIA/OSMO

Length of output: 492


🏁 Script executed:

# Check if there's any initialization or prior setting of uploading before the loop
sed -n '1400,1500p' src/runtime/cmd/ctrl/ctrl.go | head -100

Repository: NVIDIA/OSMO

Length of output: 3345


Uploading flag must be set before the signal handler can bypass upload draining.

There is a genuine race condition at lines 1498–1502. The uploading flag is set only after the execLogs loop exits and before uploadOutputs() runs. In that window, SIGTERM can arrive, the signal handler checks uploading.Load() (which is still false), and exits immediately via cleanupMounts() and os.Exit(1) without waiting for uploads—causing the subsequent uploadOutputs() call never to execute.

In the native-sidecar lifecycle, the kubelet sends SIGTERM when the user container exits, which occurs precisely during this race window. The result is data loss: task outputs are skipped.

Set the uploading flag to true before entering the exec phase (or immediately before the signal handler setup if uploads can happen during exec), and clear it only after uploadOutputs() completes. Add a test that delivers SIGTERM right after an ExecFinished message to confirm the fix prevents the race.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/runtime/cmd/ctrl/ctrl.go` around lines 1498 - 1503, The uploading flag is
set too late causing a race where SIGTERM can bypass uploads; move the
uploading.Store(true) to before the exec phase (i.e., set it before entering the
execLogs/ExecFinished handling or immediately before the signal handler is
registered) and only call uploading.Store(false) after uploadOutputs(...)
completes; ensure the signal handler continues to check uploading.Load() so it
will wait for uploadOutputs to finish instead of calling
cleanupMounts()/os.Exit(1); add a unit/integration test that sends SIGTERM
immediately after an ExecFinished message to verify uploads complete and no data
is skipped.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We investigated this and believe the race is not practically exploitable in the native sidecar model.

With native sidecars (KEP-753), kubelet sends SIGTERM to the sidecar only after the user container has fully stopped. But ExecFinished arrives via the Unix socket before the user container process exits (it is the last message sent by the user process). So the execLogs loop exits and uploading.Store(true) is set well before kubelet initiates sidecar termination — osmo-ctrl gets a deterministic head start.

The window between break execLogs and uploading.Store(true) is a few Go instructions (nanoseconds), while kubelet's container stop → SIGTERM delivery path involves API calls and process teardown (milliseconds to seconds).

We opted not to move uploading.Store(true) earlier because setting it before the exec phase would cause the SIGTERM handler to wait for uploads even when the process is in the download or exec phase (where no upload is in progress and waiting would be incorrect).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KeitaW, thank you for the thorough explanation — the reasoning is sound.

The key insight is the lifecycle ordering guarantee from KEP-753: kubelet only delivers SIGTERM to the native sidecar after the user container's process has fully exited, and ExecFinished is sent over the Unix socket before that exit. The gap between break execLogs and uploading.Store(true) is therefore bounded by a few Go instructions, while the kubelet's container-stop-to-SIGTERM path (API call, cgroup accounting, etc.) takes orders of magnitude longer. Under that model the window is not realistically reachable.

The deliberate choice to keep uploading.Store(true) as late as possible also makes sense — hoisting it earlier would cause the drain logic to block on SIGTERM during the download or exec phase, where no upload is occurring and an indefinite wait would be incorrect.

I'll withdraw the concern for this design and note the intent for future reviews.


✏️ Learnings added
Learnt from: KeitaW
Repo: NVIDIA/OSMO PR: 765
File: src/runtime/cmd/ctrl/ctrl.go:1500-1505
Timestamp: 2026-03-31T17:46:05.985Z
Learning: In `src/runtime/cmd/ctrl/ctrl.go`, `uploading.Store(true)` is deliberately set immediately before `uploadOutputs()` (not before the exec phase). Under the Kubernetes native sidecar model (KEP-753, K8s >= 1.29), kubelet sends SIGTERM to osmo-ctrl only after the user container fully exits, which happens after `ExecFinished` is received and `uploading.Store(true)` is already set — so the apparent SIGTERM race window is practically unexploitable. Moving the flag earlier would incorrectly trigger upload-drain waiting during download or exec phases.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

Comment thread src/utils/job/task.py
…n postgres.py

- ctrl.go: os.Exit(0) after successful upload drain so K8s reports
  reason=Completed instead of reason=Error (which caused the backend
  listener to mark the workflow FAILED despite successful upload)
- postgres.py: check_osmo_data_resource() now scans both containers
  and initContainers for osmo-ctrl resource limits, fixing a regression
  where capacity calculations silently returned empty after moving
  osmo-ctrl to initContainers

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
src/runtime/cmd/ctrl/ctrl.go (1)

1500-1505: ⚠️ Potential issue | 🔴 Critical

The drain state still has race windows at the upload boundaries.

Line 1500 is still after every break execLogs path, so a SIGTERM right after ExecFinished/EOF can still see uploading == false and exit before uploadOutputs() runs. On the other edge, Line 1504 clears the flag before Line 1505 closes uploadDone, so a signal in that gap exits 1 even though the upload already finished. Please protect the whole post-exec phase before leaving execLogs, then close uploadDone before clearing the flag, and add a regression test that sends SIGTERM immediately after exec completion.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/runtime/cmd/ctrl/ctrl.go` around lines 1500 - 1505, The drain/race is
caused by toggling uploading around uploadOutputs so signals can see
uploading==false between exec completion and upload start, and also clearing it
before closing uploadDone; to fix, move the uploading.Store(true) to cover the
entire post-exec phase (set it before leaving execLogs/any ExecFinished/EOF path
so the process is considered draining as soon as exec ends), ensure
close(uploadDone) is called before calling uploading.Store(false) (so the done
signal races correctly), and keep uploadOutputs(...) called while uploading is
true; update the logic around execLogs/ExecFinished/EOF to guarantee the upload
window is fully protected and add a regression test that sends SIGTERM
immediately after exec completion to assert the process waits for uploadDone and
exits only after uploadOutputs finishes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@src/runtime/cmd/ctrl/ctrl.go`:
- Around line 1500-1505: The drain/race is caused by toggling uploading around
uploadOutputs so signals can see uploading==false between exec completion and
upload start, and also clearing it before closing uploadDone; to fix, move the
uploading.Store(true) to cover the entire post-exec phase (set it before leaving
execLogs/any ExecFinished/EOF path so the process is considered draining as soon
as exec ends), ensure close(uploadDone) is called before calling
uploading.Store(false) (so the done signal races correctly), and keep
uploadOutputs(...) called while uploading is true; update the logic around
execLogs/ExecFinished/EOF to guarantee the upload window is fully protected and
add a regression test that sends SIGTERM immediately after exec completion to
assert the process waits for uploadDone and exits only after uploadOutputs
finishes.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 6ba11185-ea19-4a1d-a9ac-1d7801c94cef

📥 Commits

Reviewing files that changed from the base of the PR and between 9da50af and eba8a0e.

📒 Files selected for processing (2)
  • src/runtime/cmd/ctrl/ctrl.go
  • src/utils/connectors/postgres.py

Five fixes discovered while testing the native sidecar conversion on
a production EKS cluster:

1. Honor OSMO_SKIP_DATA_AUTH in ctrl.go — the Python service sets this
   env var when credential_config.disable_data_validation includes "*"
   or "s3", but the Go runtime never checked it. Dataset output write
   validation failed (iam:SimulatePrincipalPolicy), causing silent task
   failure with exit 0.

2. Log CtrlFailed in user.go — osmo_exec silently returned exit 0 when
   receiving CtrlFailed, making validation failures invisible to users.

3. Flush log channels before os.Exit in SIGTERM handler — os.Exit()
   bypasses deferred functions, so logs were lost on task completion.

4. Drain sendLogs queue on stop — sendLogs returned immediately on
   stopChan without flushing remaining messages.

5. Flush logQueue directly before LogDone — the logger closes the
   websocket after receiving LogDone, so any user logs still in the
   queue at that point were lost. Now flushes from the main goroutine
   while the websocket is still open.

6. Propagate disable_data_validation to ctrl container env — adds
   OSMO_SKIP_DATA_AUTH=1 when the service config disables validation.
@KeitaW

KeitaW commented Apr 2, 2026

Copy link
Copy Markdown
Contributor Author

Additional fixes from production deployment testing

Pushed 7671bd00 with six fixes discovered while deploying the native sidecar conversion on a production EKS cluster (B200 GPUs, OSMO v6.2-rc6 base).

Fixes

1. Honor OSMO_SKIP_DATA_AUTH in ctrl.go (critical)
When credential_config.disable_data_validation includes "*" or "s3", the Python service (task.py) sets OSMO_SKIP_DATA_AUTH=1 in the ctrl container env. But ctrl.go always called ValidateInputsOutputsAccess unconditionally — it never checked this env var. The osmo dataset check --access-type WRITE command failed on deployments without full IAM SimulatePrincipalPolicy permissions, causing ctrl to panic and osmo_exec to exit 0 silently. All workflows with outputs: (dataset uploads) failed with no visible error.

2. Log CtrlFailed in user.go
When osmo_exec received CtrlFailed from ctrl (e.g., from the validation failure above), it returned with exit 0 and no log output — completely invisible to users. Now logs "Received CtrlFailed from osmo-ctrl — task cannot proceed".

3. SIGTERM handler flushes logs before exit
The SIGTERM handler called os.Exit(0) / os.Exit(1) directly, bypassing all deferred functions including log channel flushing. As a native sidecar, ctrl receives SIGTERM when the user container exits — if logs haven't been forwarded yet, they were lost. Now flushes stopPutLogs, stopSendLogs, and waitGoRoutines.Wait() before exiting. Also changed non-upload SIGTERM from exit(1) to exit(0) since SIGTERM after task completion is expected.

4. sendLogs drains queue on stop
sendLogs processed the log queue one message per ticker tick. When it received the stop signal, it returned immediately without flushing remaining messages.

5. Direct queue flush before LogDone
The logger closes the websocket after receiving LogDone. If user logs were still in the queue when LogDone was sent, they were permanently lost. Now flushes logQueue directly from the main goroutine (holding bufferMutex) before enqueueing LogDone.

6. Propagate disable_data_validation to ctrl container env (task.py)
Adds OSMO_SKIP_DATA_AUTH=1 to the ctrl container's environment when credential_config.disable_data_validation contains "*" or "s3". This is the Python-side counterpart to fix #1.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
src/runtime/cmd/ctrl/ctrl.go (1)

1421-1436: ⚠️ Potential issue | 🔴 Critical

Return a failure exit code when the drain timeout is hit.

Line 1427 detects the timeout, but Line 1436 still exits 0. That makes Kubernetes report Completed, so a timed-out upload drain becomes a silent success even though uploads were cut short. Keep 0 only for the <-uploadDone branch and exit non-zero on the timeout path.

💡 Minimal fix
 go func() {
 	<-sigintCatch
 	log.Println("SIGTERM received, starting graceful shutdown...")
+	exitCode := 0
 	if uploading.Load() {
 		log.Println("Upload in progress, waiting for completion...")
 		select {
 		case <-uploadDone:
 			log.Println("Upload completed after SIGTERM")
 		case <-time.After(9 * time.Minute):
 			log.Println("Upload drain timeout exceeded")
+			exitCode = 1
 		}
 	}
 	// Flush log channels before exiting — os.Exit bypasses defers
 	stopPutLogs <- true
 	stopSendLogs <- true
 	waitGoRoutines.Wait()
 	cleanupMounts(cmdArgs.DownloadType)
-	os.Exit(0)
+	os.Exit(exitCode)
 }()

Please add a regression test for the timeout branch too. As per coding guidelines, "Ensure tests are updated/added for behavior changes like sidecar/init-container detection and shutdown/drain logic."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/runtime/cmd/ctrl/ctrl.go` around lines 1421 - 1436, The SIGTERM handler
currently always calls os.Exit(0) even when the upload drain times out; update
the logic in the signal handling block that references uploading, uploadDone and
the select so that when the timeout case (<-time.After(9 * time.Minute)) is
taken you call os.Exit with a non-zero code (e.g., os.Exit(1)) while keeping
os.Exit(0) only in the <-uploadDone branch; ensure you still flush logs and call
stopPutLogs, stopSendLogs, waitGoRoutines.Wait() and
cleanupMounts(cmdArgs.DownloadType) before exiting in both branches, and add a
regression test that simulates the timeout branch to assert the process exits
non-zero when the drain timeout is hit.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/utils/job/task.py`:
- Around line 2759-2766: The current code collapses
workflow_config.credential_config.disable_data_validation into a global boolean
and sets OSMO_SKIP_DATA_AUTH, which loses per-scheme info; instead, serialize
the actual disable_data_validation list into an env var (e.g.,
OSMO_SKIP_DATA_SCHEMES) on control_container_spec['env'] when
disable_data_validation is present (preserving '*' if supplied) so osmo-ctrl can
decide per input/output; update osmo-ctrl/ctrl.go to read that env var
(comma-separated or wildcard) and apply skipping per-scheme rather than relying
on a single boolean flag.

---

Duplicate comments:
In `@src/runtime/cmd/ctrl/ctrl.go`:
- Around line 1421-1436: The SIGTERM handler currently always calls os.Exit(0)
even when the upload drain times out; update the logic in the signal handling
block that references uploading, uploadDone and the select so that when the
timeout case (<-time.After(9 * time.Minute)) is taken you call os.Exit with a
non-zero code (e.g., os.Exit(1)) while keeping os.Exit(0) only in the
<-uploadDone branch; ensure you still flush logs and call stopPutLogs,
stopSendLogs, waitGoRoutines.Wait() and cleanupMounts(cmdArgs.DownloadType)
before exiting in both branches, and add a regression test that simulates the
timeout branch to assert the process exits non-zero when the drain timeout is
hit.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 3c5f8171-9d06-4916-a1da-c32b4b1d988b

📥 Commits

Reviewing files that changed from the base of the PR and between eba8a0e and 7671bd0.

📒 Files selected for processing (3)
  • src/runtime/cmd/ctrl/ctrl.go
  • src/runtime/cmd/user/user.go
  • src/utils/job/task.py

Comment thread src/utils/job/task.py
Comment on lines +2759 to +2766
# Propagate disable_data_validation to the ctrl sidecar so it skips
# osmo dataset check when the server-side config says validation is disabled.
disabled_data = workflow_config.credential_config.disable_data_validation
if disabled_data and ('*' in disabled_data or 's3' in disabled_data):
control_container_spec['env'].append({
'name': 'OSMO_SKIP_DATA_AUTH',
'value': '1',
})

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't collapse disable_data_validation into a global skip flag.

Line 2762 turns a backend-scoped setting into an all-or-nothing env toggle. That gives you two bad cases: ['gs'] never propagates to osmo-ctrl, and ['s3'] disables validation for every input/output in ctrl.go, not just the S3 ones. Please pass the configured schemes through and let osmo-ctrl apply the skip per input/output instead of reducing the list to a single boolean.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/utils/job/task.py` around lines 2759 - 2766, The current code collapses
workflow_config.credential_config.disable_data_validation into a global boolean
and sets OSMO_SKIP_DATA_AUTH, which loses per-scheme info; instead, serialize
the actual disable_data_validation list into an env var (e.g.,
OSMO_SKIP_DATA_SCHEMES) on control_container_spec['env'] when
disable_data_validation is present (preserving '*' if supplied) so osmo-ctrl can
decide per input/output; update osmo-ctrl/ctrl.go to read that env var
(comma-separated or wildcard) and apply skipping per-scheme rather than relying
on a single boolean flag.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: osmo-ctrl sidecar killed before task output upload completes — data loss for large outputs

1 participant