Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
1974665
Add ConfigMapPackager for Kubernetes file staging
jskswamy Jul 28, 2025
439bbad
Add KubeflowExecutor for distributed training on Kubernetes
jskswamy Jul 28, 2025
7094674
Add KubeflowExecutor documentation to execution guide
jskswamy Jul 29, 2025
9f68ebf
Add Kubernetes name sanitization and improve ConfigMap key handling
jskswamy Jul 30, 2025
a577910
Fix Kubeflow executor ConfigMap naming and resource management
jskswamy Jul 30, 2025
70f3c4e
Implement ConfigMapPackager integration for KubeflowExecutor
jskswamy Aug 1, 2025
4554fd0
Add comprehensive ConfigMapPackager integration tests
jskswamy Aug 4, 2025
1c5f514
Add resource management tests for KubeflowExecutor
jskswamy Aug 4, 2025
715a2ca
Implement ClusterTrainingRuntime CRD creation and cleanup
jskswamy Aug 4, 2025
6b96f3b
Add CLI integration for KubeflowExecutor
jskswamy Aug 4, 2025
22a8e11
Fix lint issues
jskswamy Aug 4, 2025
5cf880d
Implement Inline Script Execution in Kubeflow Executor
jskswamy Aug 18, 2025
0cae8eb
Refactor KubeflowExecutor for Enhanced Configuration
jskswamy Aug 21, 2025
09a9e3e
Refactor KubeflowExecutor for Improved Task Handling
jskswamy Sep 3, 2025
713b976
Update KubeflowExecutor to use CommandTrainer
jskswamy Sep 12, 2025
61e1242
Update Kubeflow ClusterTrainingRuntime template
jskswamy Sep 14, 2025
a5a3b20
Enhance command and args handling for KubeflowExecutor
jskswamy Sep 15, 2025
97cebe1
Implement StorageMount class for PVC management
jskswamy Sep 16, 2025
e0c9331
Implement Launcher for Script and Partial Tasks
jskswamy Sep 16, 2025
8c3ffa2
Refactor volume mount path to workspace mount path
jskswamy Sep 16, 2025
57c9ea2
Implement Additional Packages Configuration for Executor
jskswamy Sep 17, 2025
4532d7b
Ensure Environment Secret Creation in Kubeflow
jskswamy Sep 17, 2025
3e58613
Update ConfigMap key sanitization to allow underscores
jskswamy Sep 18, 2025
777cecf
Match KubeflowExecutor parameters to match with other Executors
jskswamy Sep 18, 2025
ad9c75e
Update Kubeflow template to use variables for nodes
jskswamy Sep 18, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 21 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ To learn more, click on each link. This represents the typical order that NeMo R
- [Why Use NeMo Run?](#why-use-nemo-run)
- [Install NeMo Run](#install-nemo-run)
- [Get Started](#get-started)
- [Supported Executors](#supported-executors)
- [Design Philosophy and Inspiration](#design-philosophy-and-inspiration)
- [Pythonic](#pythonic)
- [Modular](#modular)
Expand All @@ -36,6 +37,12 @@ To install the project, use the following command:
pip install git+https://github.com/NVIDIA-NeMo/Run.git
```

For Kubeflow support, install with the kubernetes optional dependency:

```bash
pip install "git+https://github.com/NVIDIA-NeMo/Run.git[kubernetes]"
```

Make sure you have `pip` installed and configured properly.

## Get Started
Expand All @@ -59,6 +66,20 @@ local_executor = run.LocalExecutor()
run.run(partial_func, executor=local_executor, name="llama3_8b_pretraining")
```

## Supported Executors

NeMo Run supports multiple executors for different computing environments:

- **LocalExecutor**: Execute tasks locally on your machine
- **DockerExecutor**: Execute tasks in Docker containers
- **SlurmExecutor**: Execute tasks on Slurm clusters
- **SkypilotExecutor**: Execute tasks on cloud platforms via Skypilot
- **DGXCloudExecutor**: Execute tasks on NVIDIA DGX Cloud
- **LeptonExecutor**: Execute tasks on NVIDIA DGX Cloud Lepton clusters
- **KubeflowExecutor**: Execute tasks on Kubernetes using Kubeflow Trainer

For detailed information about each executor, see the [Execution Guide](./docs/source/guides/execution.md).

## Design Philosophy and Inspiration
In building NeMo Run, we drew inspiration from and relied on the following primary libraries. We would like to extend our gratitude for their work.

Expand Down
252 changes: 250 additions & 2 deletions docs/source/guides/execution.md

Large diffs are not rendered by default.

114 changes: 114 additions & 0 deletions examples/kubeflow/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
# KubeflowExecutor Example

This example demonstrates how to use NeMo Run's `KubeflowExecutor` to run distributed training jobs on Kubernetes using Kubeflow Trainer.

## Overview

The `KubeflowExecutor` enables distributed training on Kubernetes clusters using Kubeflow Trainer. This example includes CLI factory functions that make it easy to configure and use `KubeflowExecutor` from the command line.

## Files

- `hello_kubeflow.py` - Complete example with CLI integration
- `README.md` - This documentation file

## CLI Integration

The example includes CLI factory functions for easy configuration:

### Available Factories

#### `kubeflow_gpu`

GPU training configuration with default settings:

- 2 nodes, 8 GPUs per node
- 16 CPU cores, 64Gi memory per node
- NVIDIA PyTorch container image

#### `kubeflow_cpu`

CPU training configuration:

- 1 node, no GPUs
- 8 CPU cores, 32Gi memory per node
- NVIDIA PyTorch container image

### Usage Examples

```bash
# Use default GPU configuration
python hello_kubeflow.py executor=kubeflow_gpu

# Customize GPU configuration
python hello_kubeflow.py executor=kubeflow_gpu executor.nodes=4 executor.gpus=16

# Use CPU configuration
python hello_kubeflow.py executor=kubeflow_cpu

# Use the CLI entrypoint
python hello_kubeflow.py train_with_kubeflow executor=kubeflow_gpu epochs=20
```

## Prerequisites

1. **Kubernetes cluster** with Kubeflow Trainer installed
2. **ClusterTrainingRuntime** named "torch-distributed-nemo" configured
3. **kubectl** configured to access your cluster
4. **NeMo Run** with KubeflowExecutor support

## Running the Example

1. **Ensure prerequisites are met**:

```bash
# Check kubectl access
kubectl get nodes

# Check ClusterTrainingRuntime
kubectl get clustertrainingruntime torch-distributed-nemo
```

2. **Run the example**:

```bash
cd examples/kubeflow
python hello_kubeflow.py
```

3. **Use CLI integration**:

```bash
# GPU training
python hello_kubeflow.py executor=kubeflow_gpu

# CPU training
python hello_kubeflow.py executor=kubeflow_cpu

# CLI entrypoint
python hello_kubeflow.py train_with_kubeflow executor=kubeflow_gpu epochs=20
```

## Key Features

- **CLI Integration**: Factory functions for easy configuration
- **Resource Management**: GPU and CPU training configurations
- **Distributed Training**: Multi-node training support
- **File Staging**: Automatic file packaging via ConfigMapPackager

## Troubleshooting

### Common Issues

1. **ClusterTrainingRuntime not found**:

```bash
kubectl get clustertrainingruntime
```

2. **Kubeflow Trainer not installed**:

```bash
kubectl get pods -n kubeflow-system
```

3. **Resource allocation**: Ensure your cluster has sufficient resources.
Loading
Loading