Fix Elastic PyTorch task nnodes override issue#8
Open
codegen-sh[bot] wants to merge 2 commits intomasterfrom
Open
Fix Elastic PyTorch task nnodes override issue#8codegen-sh[bot] wants to merge 2 commits intomasterfrom
codegen-sh[bot] wants to merge 2 commits intomasterfrom
Conversation
- Enhanced get_custom() method to properly read current task config state - Added comprehensive debugging logs to track override application - Added unit tests to verify override behavior works correctly - Fixes issue where with_overrides(task_config=Elastic(nnodes=1)) was not taking effect due to timing mismatch between serialization and override The fix ensures that when get_custom() is called during serialization, it reads the most current _task_config which includes any overrides applied via with_overrides(). This allows proper switching between single-node (Pod) and multi-node (PyTorchJob) execution modes.
- Added override-aware serialization logic in translator.py - Creates temporary task copies with overrides applied during serialization - Ensures get_custom() sees the correct configuration without affecting original task - Prevents shared task entity issues where overrides affect all nodes - Added comprehensive test to verify serialization isolation - Addresses architectural timing mismatch between override application and serialization This fix complements Fix 1 by ensuring that during workflow serialization, each node's task configuration overrides are properly applied to temporary task copies, allowing get_custom() to make the correct Pod vs PyTorchJob decision based on the overridden nnodes value.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Users reported that when overriding Elastic PyTorch tasks with
nnodes=1when the default task config wasnnodes=2, the task seemed to be stuck in multi-node mode regardless of the override.Root Cause
The issue was a timing mismatch in Flytekit's architecture:
get_custom()method is called during workflow definition/serialization time to determine whether to create a Pod (single-node) or PyTorchJob (multi-node)with_overrides(task_config=Elastic(nnodes=1))updates the task config, but this happens after the resource type has already been "baked in" during serializationnnodes=2), and runtime overrides tonnodes=1couldn't change the fundamental Kubernetes resource typeSolution
Enhanced
get_custom()Method_task_configstate which includes any overridesAdded Debugging Support
with_overrides()method with detailed logging to track override applicationnnodeschangesComprehensive Testing
Key Changes
plugins/flytekit-kf-pytorch/flytekitplugins/kfpytorch/task.py:get_custom()to properly read current task config stateflytekit/core/node.py:plugins/flytekit-kf-pytorch/tests/test_elastic_task.py:test_nnodes_override()andtest_nnodes_override_reverse()Testing
The fix includes comprehensive tests that verify:
nnodes=2tonnodes=1works (multi-node → single-node)nnodes=1tonnodes=2works (single-node → multi-node)Impact
This fix resolves the architectural limitation where task configuration overrides weren't being properly reflected during serialization. Users can now successfully override Elastic PyTorch task configurations and have them take effect properly.
Fixes the issue described in the original problem where
with_overrides(task_config=Elastic(nnodes=1))was not switching from multi-node to single-node execution mode.💻 View my work • About Codegen
⛔ Remove Codegen from PR • 🚫 Ban action checks