-
Notifications
You must be signed in to change notification settings - Fork 87
fix: Pass DGXC to ft_launcher #402
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
ko3n1g
wants to merge
24
commits into
main
Choose a base branch
from
ko3n1g/fix/ft-dgxc
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 2 commits
Commits
Show all changes
24 commits
Select commit
Hold shift + click to select a range
7ea62d8
pass dgxc to ft_launcher
ko3n1g 02efa9f
feat: Add FT to DGXC
ko3n1g 0be1693
torchrun_job
ko3n1g 232cef7
fix
ko3n1g cc1a276
format
ko3n1g f259248
fix
ko3n1g 4e15269
fix
ko3n1g 0875083
fix
ko3n1g db84ae0
revert
ko3n1g d197943
fix
ko3n1g c507dda
fix
ko3n1g 620e073
fix
ko3n1g eb98cb5
cleanup
ko3n1g 096e0af
change template
ko3n1g c7ab843
fix
ko3n1g bd000a4
fix
ko3n1g 5f2fb8a
test
ko3n1g c9b1755
retries
ko3n1g 6d3c34f
TORCHX_MAX_RETRIES
ko3n1g cdffd23
cleanup
ko3n1g b5f0a1a
fix
ko3n1g 9fc020c
bump FT interface
ko3n1g 8cd9ca3
--ft-use-infra-group-rank=False
ko3n1g 1d5a101
fix
ko3n1g File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,47 @@ | ||
| {%- import "ft_launcher_k8s.j2" as fault_tolerance -%} | ||
| #!/bin/bash | ||
| # | ||
| # Generated by NeMo Run for Kubernetes (PyTorchJob) | ||
| # | ||
|
|
||
| # 1. Basic Shell Setup | ||
| set -evx # Print commands, but DO NOT exit immediately on error (we handle that below) | ||
| export PYTHONUNBUFFERED=1 | ||
| export TORCHX_MAX_RETRIES={{max_retries}} | ||
|
|
||
| # 2. Environment Variables | ||
| # These are strictly user-defined vars (e.g. HYDRA_FULL_ERROR). | ||
| # Note: MASTER_ADDR, WORLD_SIZE, RANK are injected automatically by the PyTorchJob operator. | ||
| {%- for env_var in env_vars %} | ||
| {{env_var}} | ||
| {%- endfor %} | ||
|
|
||
| # 3. Fault Tolerance: SETUP (Check-in) | ||
| # Checks if we are resuming or if we are already finished. | ||
| {%- if ft_enabled %} | ||
| {{ fault_tolerance.ft_launcher_setup(fault_tol_cfg_path, fault_tol_finished_flag_file, fault_tol_job_results_file) }} | ||
| {%- endif %} | ||
|
|
||
| # 4. Main Execution | ||
| # In PyTorchJob, we usually have exactly one main command (torchrun). | ||
| # We assume the variable 'training_command' contains the full torchrun string. | ||
|
|
||
| echo "Starting training command..." | ||
| set +e # Turn off auto-exit so we can capture the code | ||
| # --------------------------------------------------------- | ||
| {{ training_command }} | ||
| # --------------------------------------------------------- | ||
| exitcode=$? | ||
| set -e | ||
|
|
||
| echo "Main command exited with code $exitcode" | ||
|
|
||
| # 5. Fault Tolerance: TEARDOWN (Check-out) | ||
| # Decides if we should exit 0 (complete) or exit 1 (retry via K8s backoffLimit). | ||
| {%- if ft_enabled %} | ||
| {{ fault_tolerance.ft_launcher_teardown() }} | ||
| {%- else %} | ||
| # If FT is disabled, simply pass the exit code through. | ||
| # K8s will restart if exitcode != 0 and backoffLimit > 0. | ||
| exit $exitcode | ||
| {%- endif %} |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,69 @@ | ||
| {% macro ft_launcher_setup(fault_tol_cfg_path, fault_tol_finished_flag_file, fault_tol_job_results_file) -%} | ||
| # ------------------------------------------------------------------------- | ||
| # K8s Fault Tolerance Setup (The "Check-In" Desk) | ||
| # ------------------------------------------------------------------------- | ||
|
|
||
| # 1. Export Paths | ||
| # IMPORTANT: These paths must reside on a ReadWriteMany (RWX) Persistent Volume | ||
| # mounted to all Pods so state is preserved across pod restarts/rescheduling. | ||
| export FAULT_TOL_CFG_PATH="{{fault_tol_cfg_path}}" | ||
| export FAULT_TOL_FINISHED_FLAG_FILE="{{fault_tol_finished_flag_file}}" | ||
| export FAULT_TOL_JOB_RESULTS_FILE="{{fault_tol_job_results_file}}" | ||
|
|
||
| # 2. Define Helper Functions | ||
| is_training_finished() { | ||
| test -f "$FAULT_TOL_FINISHED_FLAG_FILE" | ||
| } | ||
|
|
||
| # 3. Check for Previous Success | ||
| # In K8s, a Pod might be restarted due to node maintenance even if the job | ||
| # logic was done. If the flag file exists, we exit immediately with 0. | ||
| if is_training_finished ; then | ||
| echo "[FT-Setup] Found finished flag at $FAULT_TOL_FINISHED_FLAG_FILE." | ||
| echo "[FT-Setup] Training is already complete. Exiting successfully." | ||
| exit 0 | ||
| fi | ||
|
|
||
| # 4. Logging Start | ||
| # We use HOSTNAME (usually pod-name) as the identifier since SLURM_JOB_ID is gone. | ||
| # We append 'X' (Running/Unknown) to the log. | ||
| echo "[FT-Setup] Starting training on $(hostname)..." | ||
| # Optional: Log attempt to shared file (Using X for Running) | ||
| # Note: In high-scale K8s, writing to a single file from 1000 pods can cause lock contention. | ||
| # If scale is small, this is fine. | ||
| if [ -n "$FAULT_TOL_JOB_RESULTS_FILE" ]; then | ||
| echo "$(hostname) $(date +%s) X" >> "$FAULT_TOL_JOB_RESULTS_FILE" | ||
| fi | ||
|
|
||
| {%- endmacro %} | ||
|
|
||
| {% macro ft_launcher_teardown() -%} | ||
| # ------------------------------------------------------------------------- | ||
| # K8s Fault Tolerance Teardown (The "Check-Out" Desk) | ||
| # ------------------------------------------------------------------------- | ||
|
|
||
| # 1. Analyze Exit Code from the Main Command | ||
| # 'exitcode' is captured in the main script before calling this macro. | ||
| if [ "$exitcode" -eq "0" ]; then | ||
| RESULT_STATUS="S" # Success | ||
| else | ||
| RESULT_STATUS="F" # Failure | ||
| fi | ||
|
|
||
| # 2. Update Log (Optional but helpful for debugging) | ||
| if [ -n "$FAULT_TOL_JOB_RESULTS_FILE" ]; then | ||
| # We update the specific entry for this host from X to S or F | ||
| # Note: 'sed -i' on a shared PVC can be risky with concurrency. | ||
| # Appending a new status line is safer in K8s. | ||
| echo "$(hostname) $(date +%s) $RESULT_STATUS" >> "$FAULT_TOL_JOB_RESULTS_FILE" | ||
| fi | ||
|
|
||
| # 3. The Requeue Decision Logic | ||
| if [ "$exitcode" -eq "0" ]; then | ||
| # Case A: Script exited successfully. | ||
| # Verification: Did it actually finish (create the flag file)? | ||
| if is_training_finished; then | ||
| echo "[FT-Teardown] Job finished successfully and flag file exists." | ||
| exit 0 | ||
| else | ||
| # Edge Case: The python script exited 0, but didn't write the flag |
File renamed without changes.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.