fix: convert osmo-ctrl to native K8s sidecar to prevent upload data loss#765
fix: convert osmo-ctrl to native K8s sidecar to prevent upload data loss#765KeitaW wants to merge 3 commits into
Conversation
osmo-ctrl is killed before large output uploads complete because: (1) terminationGracePeriodSeconds defaults to 30s (never set by OSMO), and (2) the SIGTERM handler calls os.Exit(1) immediately without waiting for in-progress uploads to finish. This converts osmo-ctrl from a regular container to a Kubernetes native sidecar (KEP-753, init container with restartPolicy: Always), which ensures it receives SIGTERM only after the user container exits. Combined with a SIGTERM drain handler that waits for upload completion, this eliminates the race condition. Changes: - task.py: Move control_container_spec from containers[] to initContainers[] with restartPolicy: Always; set terminationGracePeriodSeconds: 600 - ctrl.go: Add upload drain logic to SIGTERM handler — waits up to 9 minutes for in-progress uploads before exiting - objects.py: Simplify default_ctrl template to only set terminationGracePeriodSeconds (container placement is handled by task.py; duplicating in the template causes K8s 422 errors via merge_lists_on_name) - backend_listener.py: Check init_container_statuses in addition to container_statuses so osmo-ctrl termination is detected as a native sidecar - test_listener.py: Add test for native sidecar osmo-ctrl error detection Requires Kubernetes >= 1.29 (SidecarContainers feature gate beta, enabled by default). Fixes NVIDIA#764
📝 WalkthroughWalkthroughosmo-ctrl is moved to an init-sidecar in task pods and pod termination grace is set to 600s. The ctrl process now waits for in-progress uploads to drain (with a timeout) on SIGINT/SIGTERM. Operator and connector code now inspect init container statuses/templates, and tests were added for init-sidecar termination detection. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant K8S as "Kubernetes SIGTERM"
participant CTRL as "osmo-ctrl (process)"
participant UP as "Upload subprocess"
participant MNT as "Mount cleanup"
K8S->>CTRL: send SIGTERM
activate CTRL
CTRL->>CTRL: check atomic `uploading` flag
alt uploading == true
CTRL->>CTRL: wait for `uploadDone` or 9min timeout
par upload completes
UP-->>CTRL: upload finished
CTRL->>CTRL: close `uploadDone`
and timeout not hit
end
CTRL->>MNT: cleanup mounts
CTRL-->>K8S: exit(0)
else uploading == false
CTRL->>MNT: cleanup mounts
CTRL-->>K8S: exit(1)
end
deactivate CTRL
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
src/operator/tests/test_listener.py (1)
222-238: Model the native sidecar inpod.specand assert throughcalculate_pod_status().Line 226 still uses
self.create_spec(), which keepsosmo-ctrlinspec.containers. That means this test proves only the status-list helper, not the real native-sidecar pod shape. Building the mock soosmo-ctrllives under the pod's init-container spec and assertingcalculate_pod_status()would cover the production path better.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/operator/tests/test_listener.py` around lines 222 - 238, The test currently builds a pod via create_native_sidecar_ctrl_terminated_pod but still calls create_spec() which places osmo-ctrl in spec.containers; modify the pod model so osmo-ctrl is placed in pod.spec.init_containers (or via a helper that returns a spec with initContainers) inside create_native_sidecar_ctrl_terminated_pod, then call backend_listener.calculate_pod_status(pod_event) and assert the returned status/error reflects the osmo-ctrl termination instead of only calling check_running_pod_containers; update references to create_spec(), create_native_sidecar_ctrl_terminated_pod, calculate_pod_status, and check_running_pod_containers accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/runtime/cmd/ctrl/ctrl.go`:
- Around line 1404-1414: The handler currently always calls os.Exit(1) after the
upload drain block, which causes a graceful drained SIGTERM (when
uploading.Load() is true and <-uploadDone completes) to be treated as a failure;
update the control flow in the SIGTERM handling so that after a successful
upload drain (the case where <-uploadDone returns) you call os.Exit(0) instead
of os.Exit(1), while preserving os.Exit(1) for the upload timeout/failure path;
locate the block using uploading.Load(), the uploadDone channel,
cleanupMounts(cmdArgs.DownloadType) and os.Exit(1) to make the change.
- Around line 1498-1503: The uploading flag is set too late causing a race where
SIGTERM can bypass uploads; move the uploading.Store(true) to before the exec
phase (i.e., set it before entering the execLogs/ExecFinished handling or
immediately before the signal handler is registered) and only call
uploading.Store(false) after uploadOutputs(...) completes; ensure the signal
handler continues to check uploading.Load() so it will wait for uploadOutputs to
finish instead of calling cleanupMounts()/os.Exit(1); add a unit/integration
test that sends SIGTERM immediately after an ExecFinished message to verify
uploads complete and no data is skipped.
In `@src/utils/job/task.py`:
- Around line 2808-2816: The control sidecar "osmo-ctrl" was moved into pod spec
"initContainers", but readers still only inspect spec['containers']; update the
connector that extracts control-container data (the postgres connector logic
that searches for "osmo-ctrl") to also examine spec.get('initContainers', []) in
addition to spec.get('containers', []) and merge/choose the found container when
reading resource limits and other metadata; ensure the lookup code (the function
in connectors.postgres that looks up "osmo-ctrl") checks both lists and prefers
the initContainer match if present so generated pod specs continue to provide
limits.
---
Nitpick comments:
In `@src/operator/tests/test_listener.py`:
- Around line 222-238: The test currently builds a pod via
create_native_sidecar_ctrl_terminated_pod but still calls create_spec() which
places osmo-ctrl in spec.containers; modify the pod model so osmo-ctrl is placed
in pod.spec.init_containers (or via a helper that returns a spec with
initContainers) inside create_native_sidecar_ctrl_terminated_pod, then call
backend_listener.calculate_pod_status(pod_event) and assert the returned
status/error reflects the osmo-ctrl termination instead of only calling
check_running_pod_containers; update references to create_spec(),
create_native_sidecar_ctrl_terminated_pod, calculate_pod_status, and
check_running_pod_containers accordingly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 163d06c6-3141-460d-84a7-87142f945af9
📒 Files selected for processing (5)
src/operator/backend_listener.pysrc/operator/tests/test_listener.pysrc/runtime/cmd/ctrl/ctrl.gosrc/service/core/config/objects.pysrc/utils/job/task.py
| uploading.Store(true) | ||
| uploadOutputs(unixConn, cmdArgs.Outputs, cmdArgs.OutputPath, cmdArgs.MetadataFile, | ||
| uploadChan, metricChan, cmdArgs.RetryId, cmdArgs.GroupName, cmdArgs.LogSource, | ||
| cmdArgs.UserConfig, cmdArgs.ServiceConfig, cmdArgs.ConfigLoc) | ||
| uploading.Store(false) | ||
| close(uploadDone) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, check file size and structure
wc -l src/runtime/cmd/ctrl/ctrl.goRepository: NVIDIA/OSMO
Length of output: 88
🏁 Script executed:
# Read the declaration area (lines 74-76)
sed -n '70,85p' src/runtime/cmd/ctrl/ctrl.goRepository: NVIDIA/OSMO
Length of output: 402
🏁 Script executed:
# Read the signal handler area (lines 1404-1412)
sed -n '1400,1420p' src/runtime/cmd/ctrl/ctrl.goRepository: NVIDIA/OSMO
Length of output: 716
🏁 Script executed:
# Read the upload call area (lines 1498-1503 and surrounding context)
sed -n '1490,1510p' src/runtime/cmd/ctrl/ctrl.goRepository: NVIDIA/OSMO
Length of output: 846
🏁 Script executed:
# Find the execLogs loop to understand when it exits
rg -n "execLogs" src/runtime/cmd/ctrl/ctrl.go -A 5 -B 5Repository: NVIDIA/OSMO
Length of output: 922
🏁 Script executed:
# Search for tests of SIGTERM/signal handling
find . -name "*test*.go" -o -name "*_test.go" | xargs grep -l "SIGTERM\|sigintCatch\|uploading" 2>/dev/null | head -20Repository: NVIDIA/OSMO
Length of output: 37
🏁 Script executed:
# Check if there are any other references to uploading flag
rg -n "uploading\." src/runtime/cmd/ctrl/ctrl.goRepository: NVIDIA/OSMO
Length of output: 140
🏁 Script executed:
# Check the uploadOutputs function signature and blocking behavior
rg -n "func uploadOutputs" src/runtime/cmd/ctrl/ctrl.go -A 10Repository: NVIDIA/OSMO
Length of output: 492
🏁 Script executed:
# Check if there's any initialization or prior setting of uploading before the loop
sed -n '1400,1500p' src/runtime/cmd/ctrl/ctrl.go | head -100Repository: NVIDIA/OSMO
Length of output: 3345
Uploading flag must be set before the signal handler can bypass upload draining.
There is a genuine race condition at lines 1498–1502. The uploading flag is set only after the execLogs loop exits and before uploadOutputs() runs. In that window, SIGTERM can arrive, the signal handler checks uploading.Load() (which is still false), and exits immediately via cleanupMounts() and os.Exit(1) without waiting for uploads—causing the subsequent uploadOutputs() call never to execute.
In the native-sidecar lifecycle, the kubelet sends SIGTERM when the user container exits, which occurs precisely during this race window. The result is data loss: task outputs are skipped.
Set the uploading flag to true before entering the exec phase (or immediately before the signal handler setup if uploads can happen during exec), and clear it only after uploadOutputs() completes. Add a test that delivers SIGTERM right after an ExecFinished message to confirm the fix prevents the race.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/runtime/cmd/ctrl/ctrl.go` around lines 1498 - 1503, The uploading flag is
set too late causing a race where SIGTERM can bypass uploads; move the
uploading.Store(true) to before the exec phase (i.e., set it before entering the
execLogs/ExecFinished handling or immediately before the signal handler is
registered) and only call uploading.Store(false) after uploadOutputs(...)
completes; ensure the signal handler continues to check uploading.Load() so it
will wait for uploadOutputs to finish instead of calling
cleanupMounts()/os.Exit(1); add a unit/integration test that sends SIGTERM
immediately after an ExecFinished message to verify uploads complete and no data
is skipped.
There was a problem hiding this comment.
We investigated this and believe the race is not practically exploitable in the native sidecar model.
With native sidecars (KEP-753), kubelet sends SIGTERM to the sidecar only after the user container has fully stopped. But ExecFinished arrives via the Unix socket before the user container process exits (it is the last message sent by the user process). So the execLogs loop exits and uploading.Store(true) is set well before kubelet initiates sidecar termination — osmo-ctrl gets a deterministic head start.
The window between break execLogs and uploading.Store(true) is a few Go instructions (nanoseconds), while kubelet's container stop → SIGTERM delivery path involves API calls and process teardown (milliseconds to seconds).
We opted not to move uploading.Store(true) earlier because setting it before the exec phase would cause the SIGTERM handler to wait for uploads even when the process is in the download or exec phase (where no upload is in progress and waiting would be incorrect).
There was a problem hiding this comment.
@KeitaW, thank you for the thorough explanation — the reasoning is sound.
The key insight is the lifecycle ordering guarantee from KEP-753: kubelet only delivers SIGTERM to the native sidecar after the user container's process has fully exited, and ExecFinished is sent over the Unix socket before that exit. The gap between break execLogs and uploading.Store(true) is therefore bounded by a few Go instructions, while the kubelet's container-stop-to-SIGTERM path (API call, cgroup accounting, etc.) takes orders of magnitude longer. Under that model the window is not realistically reachable.
The deliberate choice to keep uploading.Store(true) as late as possible also makes sense — hoisting it earlier would cause the drain logic to block on SIGTERM during the download or exec phase, where no upload is occurring and an indefinite wait would be incorrect.
I'll withdraw the concern for this design and note the intent for future reviews.
✏️ Learnings added
Learnt from: KeitaW
Repo: NVIDIA/OSMO PR: 765
File: src/runtime/cmd/ctrl/ctrl.go:1500-1505
Timestamp: 2026-03-31T17:46:05.985Z
Learning: In `src/runtime/cmd/ctrl/ctrl.go`, `uploading.Store(true)` is deliberately set immediately before `uploadOutputs()` (not before the exec phase). Under the Kubernetes native sidecar model (KEP-753, K8s >= 1.29), kubelet sends SIGTERM to osmo-ctrl only after the user container fully exits, which happens after `ExecFinished` is received and `uploading.Store(true)` is already set — so the apparent SIGTERM race window is practically unexploitable. Moving the flag earlier would incorrectly trigger upload-drain waiting during download or exec phases.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
…n postgres.py - ctrl.go: os.Exit(0) after successful upload drain so K8s reports reason=Completed instead of reason=Error (which caused the backend listener to mark the workflow FAILED despite successful upload) - postgres.py: check_osmo_data_resource() now scans both containers and initContainers for osmo-ctrl resource limits, fixing a regression where capacity calculations silently returned empty after moving osmo-ctrl to initContainers
There was a problem hiding this comment.
♻️ Duplicate comments (1)
src/runtime/cmd/ctrl/ctrl.go (1)
1500-1505:⚠️ Potential issue | 🔴 CriticalThe drain state still has race windows at the upload boundaries.
Line 1500 is still after every
break execLogspath, so a SIGTERM right afterExecFinished/EOF can still seeuploading == falseand exit beforeuploadOutputs()runs. On the other edge, Line 1504 clears the flag before Line 1505 closesuploadDone, so a signal in that gap exits1even though the upload already finished. Please protect the whole post-exec phase before leavingexecLogs, then closeuploadDonebefore clearing the flag, and add a regression test that sends SIGTERM immediately after exec completion.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/runtime/cmd/ctrl/ctrl.go` around lines 1500 - 1505, The drain/race is caused by toggling uploading around uploadOutputs so signals can see uploading==false between exec completion and upload start, and also clearing it before closing uploadDone; to fix, move the uploading.Store(true) to cover the entire post-exec phase (set it before leaving execLogs/any ExecFinished/EOF path so the process is considered draining as soon as exec ends), ensure close(uploadDone) is called before calling uploading.Store(false) (so the done signal races correctly), and keep uploadOutputs(...) called while uploading is true; update the logic around execLogs/ExecFinished/EOF to guarantee the upload window is fully protected and add a regression test that sends SIGTERM immediately after exec completion to assert the process waits for uploadDone and exits only after uploadOutputs finishes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@src/runtime/cmd/ctrl/ctrl.go`:
- Around line 1500-1505: The drain/race is caused by toggling uploading around
uploadOutputs so signals can see uploading==false between exec completion and
upload start, and also clearing it before closing uploadDone; to fix, move the
uploading.Store(true) to cover the entire post-exec phase (set it before leaving
execLogs/any ExecFinished/EOF path so the process is considered draining as soon
as exec ends), ensure close(uploadDone) is called before calling
uploading.Store(false) (so the done signal races correctly), and keep
uploadOutputs(...) called while uploading is true; update the logic around
execLogs/ExecFinished/EOF to guarantee the upload window is fully protected and
add a regression test that sends SIGTERM immediately after exec completion to
assert the process waits for uploadDone and exits only after uploadOutputs
finishes.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 6ba11185-ea19-4a1d-a9ac-1d7801c94cef
📒 Files selected for processing (2)
src/runtime/cmd/ctrl/ctrl.gosrc/utils/connectors/postgres.py
Five fixes discovered while testing the native sidecar conversion on a production EKS cluster: 1. Honor OSMO_SKIP_DATA_AUTH in ctrl.go — the Python service sets this env var when credential_config.disable_data_validation includes "*" or "s3", but the Go runtime never checked it. Dataset output write validation failed (iam:SimulatePrincipalPolicy), causing silent task failure with exit 0. 2. Log CtrlFailed in user.go — osmo_exec silently returned exit 0 when receiving CtrlFailed, making validation failures invisible to users. 3. Flush log channels before os.Exit in SIGTERM handler — os.Exit() bypasses deferred functions, so logs were lost on task completion. 4. Drain sendLogs queue on stop — sendLogs returned immediately on stopChan without flushing remaining messages. 5. Flush logQueue directly before LogDone — the logger closes the websocket after receiving LogDone, so any user logs still in the queue at that point were lost. Now flushes from the main goroutine while the websocket is still open. 6. Propagate disable_data_validation to ctrl container env — adds OSMO_SKIP_DATA_AUTH=1 when the service config disables validation.
Additional fixes from production deployment testingPushed Fixes1. Honor 2. Log 3. SIGTERM handler flushes logs before exit 4. 5. Direct queue flush before 6. Propagate |
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
src/runtime/cmd/ctrl/ctrl.go (1)
1421-1436:⚠️ Potential issue | 🔴 CriticalReturn a failure exit code when the drain timeout is hit.
Line 1427 detects the timeout, but Line 1436 still exits
0. That makes Kubernetes reportCompleted, so a timed-out upload drain becomes a silent success even though uploads were cut short. Keep0only for the<-uploadDonebranch and exit non-zero on the timeout path.💡 Minimal fix
go func() { <-sigintCatch log.Println("SIGTERM received, starting graceful shutdown...") + exitCode := 0 if uploading.Load() { log.Println("Upload in progress, waiting for completion...") select { case <-uploadDone: log.Println("Upload completed after SIGTERM") case <-time.After(9 * time.Minute): log.Println("Upload drain timeout exceeded") + exitCode = 1 } } // Flush log channels before exiting — os.Exit bypasses defers stopPutLogs <- true stopSendLogs <- true waitGoRoutines.Wait() cleanupMounts(cmdArgs.DownloadType) - os.Exit(0) + os.Exit(exitCode) }()Please add a regression test for the timeout branch too. As per coding guidelines, "Ensure tests are updated/added for behavior changes like sidecar/init-container detection and shutdown/drain logic."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/runtime/cmd/ctrl/ctrl.go` around lines 1421 - 1436, The SIGTERM handler currently always calls os.Exit(0) even when the upload drain times out; update the logic in the signal handling block that references uploading, uploadDone and the select so that when the timeout case (<-time.After(9 * time.Minute)) is taken you call os.Exit with a non-zero code (e.g., os.Exit(1)) while keeping os.Exit(0) only in the <-uploadDone branch; ensure you still flush logs and call stopPutLogs, stopSendLogs, waitGoRoutines.Wait() and cleanupMounts(cmdArgs.DownloadType) before exiting in both branches, and add a regression test that simulates the timeout branch to assert the process exits non-zero when the drain timeout is hit.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/utils/job/task.py`:
- Around line 2759-2766: The current code collapses
workflow_config.credential_config.disable_data_validation into a global boolean
and sets OSMO_SKIP_DATA_AUTH, which loses per-scheme info; instead, serialize
the actual disable_data_validation list into an env var (e.g.,
OSMO_SKIP_DATA_SCHEMES) on control_container_spec['env'] when
disable_data_validation is present (preserving '*' if supplied) so osmo-ctrl can
decide per input/output; update osmo-ctrl/ctrl.go to read that env var
(comma-separated or wildcard) and apply skipping per-scheme rather than relying
on a single boolean flag.
---
Duplicate comments:
In `@src/runtime/cmd/ctrl/ctrl.go`:
- Around line 1421-1436: The SIGTERM handler currently always calls os.Exit(0)
even when the upload drain times out; update the logic in the signal handling
block that references uploading, uploadDone and the select so that when the
timeout case (<-time.After(9 * time.Minute)) is taken you call os.Exit with a
non-zero code (e.g., os.Exit(1)) while keeping os.Exit(0) only in the
<-uploadDone branch; ensure you still flush logs and call stopPutLogs,
stopSendLogs, waitGoRoutines.Wait() and cleanupMounts(cmdArgs.DownloadType)
before exiting in both branches, and add a regression test that simulates the
timeout branch to assert the process exits non-zero when the drain timeout is
hit.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 3c5f8171-9d06-4916-a1da-c32b4b1d988b
📒 Files selected for processing (3)
src/runtime/cmd/ctrl/ctrl.gosrc/runtime/cmd/user/user.gosrc/utils/job/task.py
| # Propagate disable_data_validation to the ctrl sidecar so it skips | ||
| # osmo dataset check when the server-side config says validation is disabled. | ||
| disabled_data = workflow_config.credential_config.disable_data_validation | ||
| if disabled_data and ('*' in disabled_data or 's3' in disabled_data): | ||
| control_container_spec['env'].append({ | ||
| 'name': 'OSMO_SKIP_DATA_AUTH', | ||
| 'value': '1', | ||
| }) |
There was a problem hiding this comment.
Don't collapse disable_data_validation into a global skip flag.
Line 2762 turns a backend-scoped setting into an all-or-nothing env toggle. That gives you two bad cases: ['gs'] never propagates to osmo-ctrl, and ['s3'] disables validation for every input/output in ctrl.go, not just the S3 ones. Please pass the configured schemes through and let osmo-ctrl apply the skip per input/output instead of reducing the list to a single boolean.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/utils/job/task.py` around lines 2759 - 2766, The current code collapses
workflow_config.credential_config.disable_data_validation into a global boolean
and sets OSMO_SKIP_DATA_AUTH, which loses per-scheme info; instead, serialize
the actual disable_data_validation list into an env var (e.g.,
OSMO_SKIP_DATA_SCHEMES) on control_container_spec['env'] when
disable_data_validation is present (preserving '*' if supplied) so osmo-ctrl can
decide per input/output; update osmo-ctrl/ctrl.go to read that env var
(comma-separated or wildcard) and apply skipping per-scheme rather than relying
on a single boolean flag.
Summary
Converts osmo-ctrl from a regular sidecar container to a Kubernetes native sidecar (KEP-753) and adds SIGTERM upload drain logic, fixing a race condition where large output uploads are killed before completion.
initContainers[]withrestartPolicy: Always— ensures osmo-ctrl receives SIGTERM only after the user container exits (K8s native sidecar semantics)terminationGracePeriodSeconds: 600on task pods (was unset, defaulting to 30s)default_ctrlpod template to only setterminationGracePeriodSeconds(placing osmo-ctrl in bothtask.pyand the template causesDuplicate valueK8s 422 errors viamerge_lists_on_name)check_running_pod_containers()to checkinit_container_statusesfor native sidecar osmo-ctrlRoot Cause
Two issues combine to kill uploads:
terminationGracePeriodSecondsis never set — task pods use the K8s default of 30 secondsctrl.gocallsos.Exit(1)immediately after FUSE unmount, killing any in-progressosmo data uploadsubprocessEven increasing the grace period alone doesn't help because
os.Exit(1)fires immediately on SIGTERM.Files Changed
src/runtime/cmd/ctrl/ctrl.gouploadingatomic flag +uploadDonechannelsrc/utils/job/task.pycontrol_container_spectoinitContainerswithrestartPolicy: Alwayssrc/service/core/config/objects.pydefault_ctrltemplate: onlyterminationGracePeriodSeconds: 600src/operator/backend_listener.pycheck_running_pod_containers()checks both init and regular container statusessrc/operator/tests/test_listener.pyCompatibility
Fixes #764
Summary by CodeRabbit
Bug Fixes
New Features
Chores
Tests