-
Notifications
You must be signed in to change notification settings - Fork 39
fix: convert osmo-ctrl to native K8s sidecar to prevent upload data loss #765
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -2756,6 +2756,15 @@ def _build_file_mount(file: File) -> kb_objects.FileMount: | |
| ctrl_extra_args, workflow_config.backend_images.client, self.group_uuid, file_mounts, | ||
| task_spec.downloadType.value, task_spec.resources, user_cache_size) | ||
|
|
||
| # Propagate disable_data_validation to the ctrl sidecar so it skips | ||
| # osmo dataset check when the server-side config says validation is disabled. | ||
| disabled_data = workflow_config.credential_config.disable_data_validation | ||
| if disabled_data and ('*' in disabled_data or 's3' in disabled_data): | ||
| control_container_spec['env'].append({ | ||
| 'name': 'OSMO_SKIP_DATA_AUTH', | ||
| 'value': '1', | ||
| }) | ||
|
Comment on lines
+2759
to
+2766
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Don't collapse Line 2762 turns a backend-scoped setting into an all-or-nothing env toggle. That gives you two bad cases: 🤖 Prompt for AI Agents |
||
|
|
||
| using_gpu = bool(task_spec.resources.gpu and task_spec.resources.gpu > 0) | ||
| user_args += [ | ||
| '-socketPath', f'{kb_objects.DATA_LOCATION}/socket/data.sock', | ||
|
|
@@ -2805,14 +2814,16 @@ def _build_file_mount(file: File) -> kb_objects.FileMount: | |
| 'restartPolicy': 'Never', | ||
| 'imagePullSecrets': image_pull_secrets, | ||
| 'hostNetwork': task_spec.hostNetwork, | ||
| 'containers': [user_container_spec, control_container_spec], | ||
| 'containers': [user_container_spec], | ||
| 'initContainers': [ | ||
| k8s_factory.create_init_container( | ||
| login_file_mount.volume_mount(), | ||
| user_config_file_mount.volume_mount(), | ||
| init_extra_args, | ||
| ), | ||
| {**control_container_spec, 'restartPolicy': 'Always'}, | ||
| ], | ||
|
coderabbitai[bot] marked this conversation as resolved.
|
||
| 'terminationGracePeriodSeconds': 600, | ||
| 'volumes': [ | ||
| {'name': 'osmo'}, | ||
| {'name': 'osmo-data'}, | ||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# First, check file size and structure wc -l src/runtime/cmd/ctrl/ctrl.goRepository: NVIDIA/OSMO
Length of output: 88
🏁 Script executed:
Repository: NVIDIA/OSMO
Length of output: 402
🏁 Script executed:
Repository: NVIDIA/OSMO
Length of output: 716
🏁 Script executed:
Repository: NVIDIA/OSMO
Length of output: 846
🏁 Script executed:
Repository: NVIDIA/OSMO
Length of output: 922
🏁 Script executed:
Repository: NVIDIA/OSMO
Length of output: 37
🏁 Script executed:
Repository: NVIDIA/OSMO
Length of output: 140
🏁 Script executed:
Repository: NVIDIA/OSMO
Length of output: 492
🏁 Script executed:
Repository: NVIDIA/OSMO
Length of output: 3345
Uploading flag must be set before the signal handler can bypass upload draining.
There is a genuine race condition at lines 1498–1502. The
uploadingflag is set only after theexecLogsloop exits and beforeuploadOutputs()runs. In that window,SIGTERMcan arrive, the signal handler checksuploading.Load()(which is still false), and exits immediately viacleanupMounts()andos.Exit(1)without waiting for uploads—causing the subsequentuploadOutputs()call never to execute.In the native-sidecar lifecycle, the kubelet sends
SIGTERMwhen the user container exits, which occurs precisely during this race window. The result is data loss: task outputs are skipped.Set the
uploadingflag totruebefore entering the exec phase (or immediately before the signal handler setup if uploads can happen during exec), and clear it only afteruploadOutputs()completes. Add a test that deliversSIGTERMright after anExecFinishedmessage to confirm the fix prevents the race.🤖 Prompt for AI Agents
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We investigated this and believe the race is not practically exploitable in the native sidecar model.
With native sidecars (KEP-753), kubelet sends SIGTERM to the sidecar only after the user container has fully stopped. But
ExecFinishedarrives via the Unix socket before the user container process exits (it is the last message sent by the user process). So theexecLogsloop exits anduploading.Store(true)is set well before kubelet initiates sidecar termination — osmo-ctrl gets a deterministic head start.The window between
break execLogsanduploading.Store(true)is a few Go instructions (nanoseconds), while kubelet's container stop → SIGTERM delivery path involves API calls and process teardown (milliseconds to seconds).We opted not to move
uploading.Store(true)earlier because setting it before the exec phase would cause the SIGTERM handler to wait for uploads even when the process is in the download or exec phase (where no upload is in progress and waiting would be incorrect).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@KeitaW, thank you for the thorough explanation — the reasoning is sound.The key insight is the lifecycle ordering guarantee from KEP-753: kubelet only delivers
SIGTERMto the native sidecar after the user container's process has fully exited, andExecFinishedis sent over the Unix socket before that exit. The gap betweenbreak execLogsanduploading.Store(true)is therefore bounded by a few Go instructions, while the kubelet's container-stop-to-SIGTERM path (API call, cgroup accounting, etc.) takes orders of magnitude longer. Under that model the window is not realistically reachable.The deliberate choice to keep
uploading.Store(true)as late as possible also makes sense — hoisting it earlier would cause the drain logic to block on SIGTERM during the download or exec phase, where no upload is occurring and an indefinite wait would be incorrect.I'll withdraw the concern for this design and note the intent for future reviews.
✏️ Learnings added