fix: convert osmo-ctrl to native K8s sidecar to prevent upload data loss by KeitaW · Pull Request #765 · NVIDIA/OSMO

KeitaW · 2026-03-31T16:19:37Z

Summary

Converts osmo-ctrl from a regular sidecar container to a Kubernetes native sidecar (KEP-753) and adds SIGTERM upload drain logic, fixing a race condition where large output uploads are killed before completion.

Move osmo-ctrl to initContainers[] with restartPolicy: Always — ensures osmo-ctrl receives SIGTERM only after the user container exits (K8s native sidecar semantics)
Add SIGTERM drain handler in ctrl.go — waits up to 9 minutes for in-progress uploads to complete before exiting
Set terminationGracePeriodSeconds: 600 on task pods (was unset, defaulting to 30s)
Simplify default_ctrl pod template to only set terminationGracePeriodSeconds (placing osmo-ctrl in both task.py and the template causes Duplicate value K8s 422 errors via merge_lists_on_name)
Update check_running_pod_containers() to check init_container_statuses for native sidecar osmo-ctrl

Root Cause

Two issues combine to kill uploads:

terminationGracePeriodSeconds is never set — task pods use the K8s default of 30 seconds
The SIGTERM handler in ctrl.go calls os.Exit(1) immediately after FUSE unmount, killing any in-progress osmo data upload subprocess

Even increasing the grace period alone doesn't help because os.Exit(1) fires immediately on SIGTERM.

Files Changed

File	Change
`src/runtime/cmd/ctrl/ctrl.go`	SIGTERM drain: `uploading` atomic flag + `uploadDone` channel
`src/utils/job/task.py`	Move `control_container_spec` to `initContainers` with `restartPolicy: Always`
`src/service/core/config/objects.py`	`default_ctrl` template: only `terminationGracePeriodSeconds: 600`
`src/operator/backend_listener.py`	`check_running_pod_containers()` checks both init and regular container statuses
`src/operator/tests/test_listener.py`	Test for native sidecar osmo-ctrl error detection

Compatibility

Requires Kubernetes >= 1.29 (SidecarContainers feature gate beta, enabled by default)
GA in Kubernetes 1.33
No changes to workflow YAML or OSMO API

Fixes #764

Summary by CodeRabbit

Bug Fixes
- Failure detection now includes init container termination states, improving error reporting.
New Features
- Graceful shutdown waits up to 9 minutes for in-progress uploads and better drains remaining logs before exit.
- Optional bypass of data access validation via env toggling.
Chores
- Control-side container moved to run as an init container; default pod templates set a 600s termination grace period.
- Resource discovery now considers init containers.
Tests
- Added test covering init-container termination triggering failure detection.

osmo-ctrl is killed before large output uploads complete because: (1) terminationGracePeriodSeconds defaults to 30s (never set by OSMO), and (2) the SIGTERM handler calls os.Exit(1) immediately without waiting for in-progress uploads to finish. This converts osmo-ctrl from a regular container to a Kubernetes native sidecar (KEP-753, init container with restartPolicy: Always), which ensures it receives SIGTERM only after the user container exits. Combined with a SIGTERM drain handler that waits for upload completion, this eliminates the race condition. Changes: - task.py: Move control_container_spec from containers[] to initContainers[] with restartPolicy: Always; set terminationGracePeriodSeconds: 600 - ctrl.go: Add upload drain logic to SIGTERM handler — waits up to 9 minutes for in-progress uploads before exiting - objects.py: Simplify default_ctrl template to only set terminationGracePeriodSeconds (container placement is handled by task.py; duplicating in the template causes K8s 422 errors via merge_lists_on_name) - backend_listener.py: Check init_container_statuses in addition to container_statuses so osmo-ctrl termination is detected as a native sidecar - test_listener.py: Add test for native sidecar osmo-ctrl error detection Requires Kubernetes >= 1.29 (SidecarContainers feature gate beta, enabled by default). Fixes NVIDIA#764

coderabbitai · 2026-03-31T16:20:09Z

📝 Walkthrough

Walkthrough

osmo-ctrl is moved to an init-sidecar in task pods and pod termination grace is set to 600s. The ctrl process now waits for in-progress uploads to drain (with a timeout) on SIGINT/SIGTERM. Operator and connector code now inspect init container statuses/templates, and tests were added for init-sidecar termination detection.

Changes

Cohort / File(s)	Summary
Pod spec / sidecar placement `src/utils/job/task.py`, `src/service/core/config/objects.py`	Control container removed from regular `spec.containers` and added to `spec.initContainers` (`restartPolicy: 'Always'`); default ctrl pod template reduced to `terminationGracePeriodSeconds: 600`.
SIGTERM / upload drain logic `src/runtime/cmd/ctrl/ctrl.go`	Add process-wide atomic `uploading` flag and `uploadDone` channel; SIGINT/SIGTERM handler waits for ongoing uploads (or ~9min timeout) before cleanup/exit; logging and log-queue drain on shutdown adjusted.
Operator status checks `src/operator/backend_listener.py`	`check_running_pod_containers()` now combines `pod.status.init_container_statuses` and `pod.status.container_statuses` when inspecting container termination reasons.
Tests for init-sidecar termination `src/operator/tests/test_listener.py`	New test helper builds pod status with `osmo-ctrl` as a terminated init container and a running user container; test asserts error detection for init-sidecar termination.
Postgres connector pod template parsing `src/utils/connectors/postgres.py`	When scanning pod templates for the control container, inspect both `spec.initContainers` and `spec.containers` (combined) to locate `osmo-ctrl` and its resources.
User runtime logging `src/runtime/cmd/user/user.go`	Add log statement when initial decoded control message has type `messages.CtrlFailed`.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant K8S as "Kubernetes SIGTERM"
    participant CTRL as "osmo-ctrl (process)"
    participant UP as "Upload subprocess"
    participant MNT as "Mount cleanup"

    K8S->>CTRL: send SIGTERM
    activate CTRL
    CTRL->>CTRL: check atomic `uploading` flag
    alt uploading == true
        CTRL->>CTRL: wait for `uploadDone` or 9min timeout
        par upload completes
            UP-->>CTRL: upload finished
            CTRL->>CTRL: close `uploadDone`
        and timeout not hit
        end
        CTRL->>MNT: cleanup mounts
        CTRL-->>K8S: exit(0)
    else uploading == false
        CTRL->>MNT: cleanup mounts
        CTRL-->>K8S: exit(1)
    end
    deactivate CTRL

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I hopped in logs where big files roam,
The sidecar waits to carry them home,
With atomic paws I hold the gate,
Nine minutes more to finish the crate,
Then mounts unroll and all's back to home.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 45.45% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title accurately describes the main objective: converting osmo-ctrl to a native K8s sidecar to prevent upload data loss, which directly addresses issue `#764`.
Linked Issues check	✅ Passed	The PR implements all primary coding objectives from issue `#764`: converts osmo-ctrl to native sidecar (initContainers with restartPolicy), adds SIGTERM upload drain logic with timeout, sets terminationGracePeriodSeconds to 600s, updates pod templates, and detects native sidecar status in backend_listener.
Out of Scope Changes check	✅ Passed	All changes are directly scoped to issue `#764`: K8s native sidecar conversion, SIGTERM drain handling, pod template updates, and operator status detection. Additional improvements like OSMO_SKIP_DATA_AUTH propagation and log draining are necessary runtime fixes discovered during testing.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (1)

src/operator/tests/test_listener.py (1)
222-238: Model the native sidecar in pod.spec and assert through calculate_pod_status().

Line 226 still uses self.create_spec(), which keeps osmo-ctrl in spec.containers. That means this test proves only the status-list helper, not the real native-sidecar pod shape. Building the mock so osmo-ctrl lives under the pod's init-container spec and asserting calculate_pod_status() would cover the production path better.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/operator/tests/test_listener.py` around lines 222 - 238, The test
currently builds a pod via create_native_sidecar_ctrl_terminated_pod but still
calls create_spec() which places osmo-ctrl in spec.containers; modify the pod
model so osmo-ctrl is placed in pod.spec.init_containers (or via a helper that
returns a spec with initContainers) inside
create_native_sidecar_ctrl_terminated_pod, then call
backend_listener.calculate_pod_status(pod_event) and assert the returned
status/error reflects the osmo-ctrl termination instead of only calling
check_running_pod_containers; update references to create_spec(),
create_native_sidecar_ctrl_terminated_pod, calculate_pod_status, and
check_running_pod_containers accordingly.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/runtime/cmd/ctrl/ctrl.go`:
- Around line 1404-1414: The handler currently always calls os.Exit(1) after the
upload drain block, which causes a graceful drained SIGTERM (when
uploading.Load() is true and <-uploadDone completes) to be treated as a failure;
update the control flow in the SIGTERM handling so that after a successful
upload drain (the case where <-uploadDone returns) you call os.Exit(0) instead
of os.Exit(1), while preserving os.Exit(1) for the upload timeout/failure path;
locate the block using uploading.Load(), the uploadDone channel,
cleanupMounts(cmdArgs.DownloadType) and os.Exit(1) to make the change.
- Around line 1498-1503: The uploading flag is set too late causing a race where
SIGTERM can bypass uploads; move the uploading.Store(true) to before the exec
phase (i.e., set it before entering the execLogs/ExecFinished handling or
immediately before the signal handler is registered) and only call
uploading.Store(false) after uploadOutputs(...) completes; ensure the signal
handler continues to check uploading.Load() so it will wait for uploadOutputs to
finish instead of calling cleanupMounts()/os.Exit(1); add a unit/integration
test that sends SIGTERM immediately after an ExecFinished message to verify
uploads complete and no data is skipped.

In `@src/utils/job/task.py`:
- Around line 2808-2816: The control sidecar "osmo-ctrl" was moved into pod spec
"initContainers", but readers still only inspect spec['containers']; update the
connector that extracts control-container data (the postgres connector logic
that searches for "osmo-ctrl") to also examine spec.get('initContainers', []) in
addition to spec.get('containers', []) and merge/choose the found container when
reading resource limits and other metadata; ensure the lookup code (the function
in connectors.postgres that looks up "osmo-ctrl") checks both lists and prefers
the initContainer match if present so generated pod specs continue to provide
limits.

---

Nitpick comments:
In `@src/operator/tests/test_listener.py`:
- Around line 222-238: The test currently builds a pod via
create_native_sidecar_ctrl_terminated_pod but still calls create_spec() which
places osmo-ctrl in spec.containers; modify the pod model so osmo-ctrl is placed
in pod.spec.init_containers (or via a helper that returns a spec with
initContainers) inside create_native_sidecar_ctrl_terminated_pod, then call
backend_listener.calculate_pod_status(pod_event) and assert the returned
status/error reflects the osmo-ctrl termination instead of only calling
check_running_pod_containers; update references to create_spec(),
create_native_sidecar_ctrl_terminated_pod, calculate_pod_status, and
check_running_pod_containers accordingly.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 163d06c6-3141-460d-84a7-87142f945af9

📥 Commits

Reviewing files that changed from the base of the PR and between ecd7ad1 and 9da50af.

📒 Files selected for processing (5)

src/operator/backend_listener.py
src/operator/tests/test_listener.py
src/runtime/cmd/ctrl/ctrl.go
src/service/core/config/objects.py
src/utils/job/task.py

coderabbitai · 2026-03-31T16:30:54Z

+	uploading.Store(true)
 	uploadOutputs(unixConn, cmdArgs.Outputs, cmdArgs.OutputPath, cmdArgs.MetadataFile,
 		uploadChan, metricChan, cmdArgs.RetryId, cmdArgs.GroupName, cmdArgs.LogSource,
 		cmdArgs.UserConfig, cmdArgs.ServiceConfig, cmdArgs.ConfigLoc)
+	uploading.Store(false)
+	close(uploadDone)


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, check file size and structure wc -l src/runtime/cmd/ctrl/ctrl.go

Repository: NVIDIA/OSMO

Length of output: 88

🏁 Script executed:

# Read the declaration area (lines 74-76) sed -n '70,85p' src/runtime/cmd/ctrl/ctrl.go

Repository: NVIDIA/OSMO

Length of output: 402

🏁 Script executed:

# Read the signal handler area (lines 1404-1412) sed -n '1400,1420p' src/runtime/cmd/ctrl/ctrl.go

Repository: NVIDIA/OSMO

Length of output: 716

🏁 Script executed:

# Read the upload call area (lines 1498-1503 and surrounding context) sed -n '1490,1510p' src/runtime/cmd/ctrl/ctrl.go

Repository: NVIDIA/OSMO

Length of output: 846

🏁 Script executed:

# Find the execLogs loop to understand when it exits rg -n "execLogs" src/runtime/cmd/ctrl/ctrl.go -A 5 -B 5

Repository: NVIDIA/OSMO

Length of output: 922

🏁 Script executed:

# Search for tests of SIGTERM/signal handling find . -name "*test*.go" -o -name "*_test.go" | xargs grep -l "SIGTERM\|sigintCatch\|uploading" 2>/dev/null | head -20

Repository: NVIDIA/OSMO

Length of output: 37

🏁 Script executed:

# Check if there are any other references to uploading flag rg -n "uploading\." src/runtime/cmd/ctrl/ctrl.go

Repository: NVIDIA/OSMO

Length of output: 140

🏁 Script executed:

# Check the uploadOutputs function signature and blocking behavior rg -n "func uploadOutputs" src/runtime/cmd/ctrl/ctrl.go -A 10

Repository: NVIDIA/OSMO

Length of output: 492

🏁 Script executed:

# Check if there's any initialization or prior setting of uploading before the loop sed -n '1400,1500p' src/runtime/cmd/ctrl/ctrl.go | head -100

Repository: NVIDIA/OSMO

Length of output: 3345

Uploading flag must be set before the signal handler can bypass upload draining.

There is a genuine race condition at lines 1498–1502. The uploading flag is set only after the execLogs loop exits and before uploadOutputs() runs. In that window, SIGTERM can arrive, the signal handler checks uploading.Load() (which is still false), and exits immediately via cleanupMounts() and os.Exit(1) without waiting for uploads—causing the subsequent uploadOutputs() call never to execute.

In the native-sidecar lifecycle, the kubelet sends SIGTERM when the user container exits, which occurs precisely during this race window. The result is data loss: task outputs are skipped.

Set the uploading flag to true before entering the exec phase (or immediately before the signal handler setup if uploads can happen during exec), and clear it only after uploadOutputs() completes. Add a test that delivers SIGTERM right after an ExecFinished message to confirm the fix prevents the race.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/runtime/cmd/ctrl/ctrl.go` around lines 1498 - 1503, The uploading flag is set too late causing a race where SIGTERM can bypass uploads; move the uploading.Store(true) to before the exec phase (i.e., set it before entering the execLogs/ExecFinished handling or immediately before the signal handler is registered) and only call uploading.Store(false) after uploadOutputs(...) completes; ensure the signal handler continues to check uploading.Load() so it will wait for uploadOutputs to finish instead of calling cleanupMounts()/os.Exit(1); add a unit/integration test that sends SIGTERM immediately after an ExecFinished message to verify uploads complete and no data is skipped.

We investigated this and believe the race is not practically exploitable in the native sidecar model.

With native sidecars (KEP-753), kubelet sends SIGTERM to the sidecar only after the user container has fully stopped. But ExecFinished arrives via the Unix socket before the user container process exits (it is the last message sent by the user process). So the execLogs loop exits and uploading.Store(true) is set well before kubelet initiates sidecar termination — osmo-ctrl gets a deterministic head start.

The window between break execLogs and uploading.Store(true) is a few Go instructions (nanoseconds), while kubelet's container stop → SIGTERM delivery path involves API calls and process teardown (milliseconds to seconds).

We opted not to move uploading.Store(true) earlier because setting it before the exec phase would cause the SIGTERM handler to wait for uploads even when the process is in the download or exec phase (where no upload is in progress and waiting would be incorrect).

@KeitaW, thank you for the thorough explanation — the reasoning is sound.

The key insight is the lifecycle ordering guarantee from KEP-753: kubelet only delivers SIGTERM to the native sidecar after the user container's process has fully exited, and ExecFinished is sent over the Unix socket before that exit. The gap between break execLogs and uploading.Store(true) is therefore bounded by a few Go instructions, while the kubelet's container-stop-to-SIGTERM path (API call, cgroup accounting, etc.) takes orders of magnitude longer. Under that model the window is not realistically reachable.

The deliberate choice to keep uploading.Store(true) as late as possible also makes sense — hoisting it earlier would cause the drain logic to block on SIGTERM during the download or exec phase, where no upload is occurring and an indefinite wait would be incorrect.

I'll withdraw the concern for this design and note the intent for future reviews.

✏️ Learnings added

Learnt from: KeitaW Repo: NVIDIA/OSMO PR: 765 File: src/runtime/cmd/ctrl/ctrl.go:1500-1505 Timestamp: 2026-03-31T17:46:05.985Z Learning: In `src/runtime/cmd/ctrl/ctrl.go`, `uploading.Store(true)` is deliberately set immediately before `uploadOutputs()` (not before the exec phase). Under the Kubernetes native sidecar model (KEP-753, K8s >= 1.29), kubelet sends SIGTERM to osmo-ctrl only after the user container fully exits, which happens after `ExecFinished` is received and `uploading.Store(true)` is already set — so the apparent SIGTERM race window is practically unexploitable. Moving the flag earlier would incorrectly trigger upload-drain waiting during download or exec phases.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

…n postgres.py - ctrl.go: os.Exit(0) after successful upload drain so K8s reports reason=Completed instead of reason=Error (which caused the backend listener to mark the workflow FAILED despite successful upload) - postgres.py: check_osmo_data_resource() now scans both containers and initContainers for osmo-ctrl resource limits, fixing a regression where capacity calculations silently returned empty after moving osmo-ctrl to initContainers

coderabbitai

♻️ Duplicate comments (1)

src/runtime/cmd/ctrl/ctrl.go (1)
1500-1505: ⚠️ Potential issue | 🔴 Critical

The drain state still has race windows at the upload boundaries.

Line 1500 is still after every break execLogs path, so a SIGTERM right after ExecFinished/EOF can still see uploading == false and exit before uploadOutputs() runs. On the other edge, Line 1504 clears the flag before Line 1505 closes uploadDone, so a signal in that gap exits 1 even though the upload already finished. Please protect the whole post-exec phase before leaving execLogs, then close uploadDone before clearing the flag, and add a regression test that sends SIGTERM immediately after exec completion.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/runtime/cmd/ctrl/ctrl.go` around lines 1500 - 1505, The drain/race is
caused by toggling uploading around uploadOutputs so signals can see
uploading==false between exec completion and upload start, and also clearing it
before closing uploadDone; to fix, move the uploading.Store(true) to cover the
entire post-exec phase (set it before leaving execLogs/any ExecFinished/EOF path
so the process is considered draining as soon as exec ends), ensure
close(uploadDone) is called before calling uploading.Store(false) (so the done
signal races correctly), and keep uploadOutputs(...) called while uploading is
true; update the logic around execLogs/ExecFinished/EOF to guarantee the upload
window is fully protected and add a regression test that sends SIGTERM
immediately after exec completion to assert the process waits for uploadDone and
exits only after uploadOutputs finishes.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@src/runtime/cmd/ctrl/ctrl.go`:
- Around line 1500-1505: The drain/race is caused by toggling uploading around
uploadOutputs so signals can see uploading==false between exec completion and
upload start, and also clearing it before closing uploadDone; to fix, move the
uploading.Store(true) to cover the entire post-exec phase (set it before leaving
execLogs/any ExecFinished/EOF path so the process is considered draining as soon
as exec ends), ensure close(uploadDone) is called before calling
uploading.Store(false) (so the done signal races correctly), and keep
uploadOutputs(...) called while uploading is true; update the logic around
execLogs/ExecFinished/EOF to guarantee the upload window is fully protected and
add a regression test that sends SIGTERM immediately after exec completion to
assert the process waits for uploadDone and exits only after uploadOutputs
finishes.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 6ba11185-ea19-4a1d-a9ac-1d7801c94cef

📥 Commits

Reviewing files that changed from the base of the PR and between 9da50af and eba8a0e.

📒 Files selected for processing (2)

src/runtime/cmd/ctrl/ctrl.go
src/utils/connectors/postgres.py

Five fixes discovered while testing the native sidecar conversion on a production EKS cluster: 1. Honor OSMO_SKIP_DATA_AUTH in ctrl.go — the Python service sets this env var when credential_config.disable_data_validation includes "*" or "s3", but the Go runtime never checked it. Dataset output write validation failed (iam:SimulatePrincipalPolicy), causing silent task failure with exit 0. 2. Log CtrlFailed in user.go — osmo_exec silently returned exit 0 when receiving CtrlFailed, making validation failures invisible to users. 3. Flush log channels before os.Exit in SIGTERM handler — os.Exit() bypasses deferred functions, so logs were lost on task completion. 4. Drain sendLogs queue on stop — sendLogs returned immediately on stopChan without flushing remaining messages. 5. Flush logQueue directly before LogDone — the logger closes the websocket after receiving LogDone, so any user logs still in the queue at that point were lost. Now flushes from the main goroutine while the websocket is still open. 6. Propagate disable_data_validation to ctrl container env — adds OSMO_SKIP_DATA_AUTH=1 when the service config disables validation.

KeitaW · 2026-04-02T17:30:08Z

Additional fixes from production deployment testing

Pushed 7671bd00 with six fixes discovered while deploying the native sidecar conversion on a production EKS cluster (B200 GPUs, OSMO v6.2-rc6 base).

Fixes

1. Honor OSMO_SKIP_DATA_AUTH in ctrl.go (critical)
When credential_config.disable_data_validation includes "*" or "s3", the Python service (task.py) sets OSMO_SKIP_DATA_AUTH=1 in the ctrl container env. But ctrl.go always called ValidateInputsOutputsAccess unconditionally — it never checked this env var. The osmo dataset check --access-type WRITE command failed on deployments without full IAM SimulatePrincipalPolicy permissions, causing ctrl to panic and osmo_exec to exit 0 silently. All workflows with outputs: (dataset uploads) failed with no visible error.

2. Log CtrlFailed in user.go
When osmo_exec received CtrlFailed from ctrl (e.g., from the validation failure above), it returned with exit 0 and no log output — completely invisible to users. Now logs "Received CtrlFailed from osmo-ctrl — task cannot proceed".

3. SIGTERM handler flushes logs before exit
The SIGTERM handler called os.Exit(0) / os.Exit(1) directly, bypassing all deferred functions including log channel flushing. As a native sidecar, ctrl receives SIGTERM when the user container exits — if logs haven't been forwarded yet, they were lost. Now flushes stopPutLogs, stopSendLogs, and waitGoRoutines.Wait() before exiting. Also changed non-upload SIGTERM from exit(1) to exit(0) since SIGTERM after task completion is expected.

4. sendLogs drains queue on stop
sendLogs processed the log queue one message per ticker tick. When it received the stop signal, it returned immediately without flushing remaining messages.

5. Direct queue flush before LogDone
The logger closes the websocket after receiving LogDone. If user logs were still in the queue when LogDone was sent, they were permanently lost. Now flushes logQueue directly from the main goroutine (holding bufferMutex) before enqueueing LogDone.

6. Propagate disable_data_validation to ctrl container env (task.py)
Adds OSMO_SKIP_DATA_AUTH=1 to the ctrl container's environment when credential_config.disable_data_validation contains "*" or "s3". This is the Python-side counterpart to fix #1.

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

src/runtime/cmd/ctrl/ctrl.go (1)
1421-1436: ⚠️ Potential issue | 🔴 Critical

Return a failure exit code when the drain timeout is hit.

Line 1427 detects the timeout, but Line 1436 still exits 0. That makes Kubernetes report Completed, so a timed-out upload drain becomes a silent success even though uploads were cut short. Keep 0 only for the <-uploadDone branch and exit non-zero on the timeout path.
💡 Minimal fix
 go func() {
 	<-sigintCatch
 	log.Println("SIGTERM received, starting graceful shutdown...")
+	exitCode := 0
 	if uploading.Load() {
 		log.Println("Upload in progress, waiting for completion...")
 		select {
 		case <-uploadDone:
 			log.Println("Upload completed after SIGTERM")
 		case <-time.After(9 * time.Minute):
 			log.Println("Upload drain timeout exceeded")
+			exitCode = 1
 		}
 	}
 	// Flush log channels before exiting — os.Exit bypasses defers
 	stopPutLogs <- true
 	stopSendLogs <- true
 	waitGoRoutines.Wait()
 	cleanupMounts(cmdArgs.DownloadType)
-	os.Exit(0)
+	os.Exit(exitCode)
 }()
Please add a regression test for the timeout branch too. As per coding guidelines, "Ensure tests are updated/added for behavior changes like sidecar/init-container detection and shutdown/drain logic."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/runtime/cmd/ctrl/ctrl.go` around lines 1421 - 1436, The SIGTERM handler
currently always calls os.Exit(0) even when the upload drain times out; update
the logic in the signal handling block that references uploading, uploadDone and
the select so that when the timeout case (<-time.After(9 * time.Minute)) is
taken you call os.Exit with a non-zero code (e.g., os.Exit(1)) while keeping
os.Exit(0) only in the <-uploadDone branch; ensure you still flush logs and call
stopPutLogs, stopSendLogs, waitGoRoutines.Wait() and
cleanupMounts(cmdArgs.DownloadType) before exiting in both branches, and add a
regression test that simulates the timeout branch to assert the process exits
non-zero when the drain timeout is hit.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/utils/job/task.py`:
- Around line 2759-2766: The current code collapses
workflow_config.credential_config.disable_data_validation into a global boolean
and sets OSMO_SKIP_DATA_AUTH, which loses per-scheme info; instead, serialize
the actual disable_data_validation list into an env var (e.g.,
OSMO_SKIP_DATA_SCHEMES) on control_container_spec['env'] when
disable_data_validation is present (preserving '*' if supplied) so osmo-ctrl can
decide per input/output; update osmo-ctrl/ctrl.go to read that env var
(comma-separated or wildcard) and apply skipping per-scheme rather than relying
on a single boolean flag.

---

Duplicate comments:
In `@src/runtime/cmd/ctrl/ctrl.go`:
- Around line 1421-1436: The SIGTERM handler currently always calls os.Exit(0)
even when the upload drain times out; update the logic in the signal handling
block that references uploading, uploadDone and the select so that when the
timeout case (<-time.After(9 * time.Minute)) is taken you call os.Exit with a
non-zero code (e.g., os.Exit(1)) while keeping os.Exit(0) only in the
<-uploadDone branch; ensure you still flush logs and call stopPutLogs,
stopSendLogs, waitGoRoutines.Wait() and cleanupMounts(cmdArgs.DownloadType)
before exiting in both branches, and add a regression test that simulates the
timeout branch to assert the process exits non-zero when the drain timeout is
hit.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 3c5f8171-9d06-4916-a1da-c32b4b1d988b

📥 Commits

Reviewing files that changed from the base of the PR and between eba8a0e and 7671bd0.

📒 Files selected for processing (3)

src/runtime/cmd/ctrl/ctrl.go
src/runtime/cmd/user/user.go
src/utils/job/task.py

coderabbitai · 2026-04-02T17:40:15Z

+        # Propagate disable_data_validation to the ctrl sidecar so it skips
+        # osmo dataset check when the server-side config says validation is disabled.
+        disabled_data = workflow_config.credential_config.disable_data_validation
+        if disabled_data and ('*' in disabled_data or 's3' in disabled_data):
+            control_container_spec['env'].append({
+                'name': 'OSMO_SKIP_DATA_AUTH',
+                'value': '1',
+            })


⚠️ Potential issue | 🟠 Major

Don't collapse disable_data_validation into a global skip flag.

Line 2762 turns a backend-scoped setting into an all-or-nothing env toggle. That gives you two bad cases: ['gs'] never propagates to osmo-ctrl, and ['s3'] disables validation for every input/output in ctrl.go, not just the S3 ones. Please pass the configured schemes through and let osmo-ctrl apply the skip per input/output instead of reducing the list to a single boolean.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/utils/job/task.py` around lines 2759 - 2766, The current code collapses workflow_config.credential_config.disable_data_validation into a global boolean and sets OSMO_SKIP_DATA_AUTH, which loses per-scheme info; instead, serialize the actual disable_data_validation list into an env var (e.g., OSMO_SKIP_DATA_SCHEMES) on control_container_spec['env'] when disable_data_validation is present (preserving '*' if supplied) so osmo-ctrl can decide per input/output; update osmo-ctrl/ctrl.go to read that env var (comma-separated or wildcard) and apply skipping per-scheme rather than relying on a single boolean flag.

KeitaW requested a review from a team as a code owner March 31, 2026 16:19

coderabbitai Bot reviewed Mar 31, 2026

View reviewed changes

coderabbitai Bot reviewed Apr 2, 2026

View reviewed changes

Conversation

KeitaW commented Mar 31, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root Cause

Files Changed

Compatibility

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

KeitaW Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

KeitaW commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Additional fixes from production deployment testing

Fixes

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

KeitaW commented Mar 31, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Mar 31, 2026 •

edited

Loading

coderabbitai Bot Mar 31, 2026 •

edited

Loading

KeitaW commented Apr 2, 2026 •

edited

Loading