Skip to content

Commit de68d68

Browse files
mcgrofchucklever
authored andcommitted
terraform: Document tier-based GPU selection for Lambda Labs
Add comprehensive documentation for the tier-based GPU selection feature to the Lambda Labs README. This includes documentation for the capacity checking and tier selection scripts, the available tier groups for both single GPU and multi-GPU configurations, and quick start examples. The documentation covers how tier-based selection works with automatic fallback from higher to lower GPU tiers when capacity is unavailable. It also updates the defconfigs table and scripts reference to include the new tier-based options. Generated-by: Claude AI Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
1 parent 8a5387d commit de68d68

1 file changed

Lines changed: 103 additions & 0 deletions

File tree

terraform/lambdalabs/README.md

Lines changed: 103 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@ This directory contains the Terraform configuration for deploying kdevops infras
88
- [Prerequisites](#prerequisites)
99
- [Quick Start](#quick-start)
1010
- [Dynamic Configuration](#dynamic-configuration)
11+
- [Tier-Based GPU Selection](#tier-based-gpu-selection)
1112
- [SSH Key Security](#ssh-key-security)
1213
- [Configuration Options](#configuration-options)
1314
- [Provider Limitations](#provider-limitations)
@@ -111,6 +112,101 @@ scripts/lambda-cli --output json pricing list
111112

112113
For more details on the dynamic configuration system, see [Dynamic Cloud Kconfig Documentation](../../docs/dynamic-cloud-kconfig.md).
113114

115+
## Tier-Based GPU Selection
116+
117+
Lambda Labs supports tier-based GPU selection with automatic fallback. Instead of specifying
118+
a single instance type, you can specify a maximum tier and kdevops will automatically select
119+
the highest available GPU within that tier.
120+
121+
### How It Works
122+
123+
1. **Specify Maximum Tier**: Choose a tier group like `H100_OR_LESS`
124+
2. **Capacity Check**: The system queries Lambda Labs API for available instances
125+
3. **Tier Fallback**: Tries each tier from highest to lowest until one is available
126+
4. **Auto-Provision**: Deploys to the first region with available capacity
127+
128+
### Single GPU Tier Groups
129+
130+
| Tier Group | Fallback Order | Use Case |
131+
|------------|----------------|----------|
132+
| `GH200_OR_LESS` | GH200 → H100-SXM → H100-PCIe → A100-SXM → A100 → A6000 → RTX6000 → A10 | Maximum performance |
133+
| `H100_OR_LESS` | H100-SXM → H100-PCIe → A100-SXM → A100 → A6000 → RTX6000 → A10 | High performance |
134+
| `A100_OR_LESS` | A100-SXM → A100 → A6000 → RTX6000 → A10 | Cost-effective |
135+
| `A6000_OR_LESS` | A6000 → RTX6000 → A10 | Budget-friendly |
136+
137+
### Multi-GPU (8x) Tier Groups
138+
139+
| Tier Group | Fallback Order | Use Case |
140+
|------------|----------------|----------|
141+
| `8X_B200_OR_LESS` | 8x B200 → 8x H100 → 8x A100-80 → 8x A100 → 8x V100 | Maximum multi-GPU |
142+
| `8X_H100_OR_LESS` | 8x H100 → 8x A100-80 → 8x A100 → 8x V100 | High-end multi-GPU |
143+
| `8X_A100_OR_LESS` | 8x A100-80 → 8x A100 → 8x V100 | Cost-effective multi-GPU |
144+
145+
### Quick Start with Tier Selection
146+
147+
```bash
148+
# Single GPU - best available up to H100
149+
make defconfig-lambdalabs-h100-or-less
150+
make bringup
151+
152+
# Single GPU - best available up to GH200
153+
make defconfig-lambdalabs-gh200-or-less
154+
make bringup
155+
156+
# 8x GPU - best available up to H100
157+
make defconfig-lambdalabs-8x-h100-or-less
158+
make bringup
159+
```
160+
161+
### Checking Capacity
162+
163+
Before deploying, you can check current GPU availability:
164+
165+
```bash
166+
# Check all available GPU instances
167+
python3 scripts/lambdalabs_check_capacity.py
168+
169+
# Check specific instance type
170+
python3 scripts/lambdalabs_check_capacity.py --instance-type gpu_1x_h100_sxm5
171+
172+
# JSON output for scripting
173+
python3 scripts/lambdalabs_check_capacity.py --json
174+
```
175+
176+
### Tier Selection Script
177+
178+
The tier selection script finds the best available GPU:
179+
180+
```bash
181+
# Find best single GPU up to H100
182+
python3 scripts/lambdalabs_select_tier.py h100-or-less --verbose
183+
184+
# Find best 8x GPU up to H100
185+
python3 scripts/lambdalabs_select_tier.py 8x-h100-or-less --verbose
186+
187+
# List all available tier groups
188+
python3 scripts/lambdalabs_select_tier.py --list-tiers
189+
```
190+
191+
Example output:
192+
```
193+
Checking tier group: h100-or-less
194+
Tiers to check (highest to lowest): h100-sxm, h100-pcie, a100-sxm, a100, a6000, rtx6000, a10
195+
196+
Checking tier 'h100-sxm': gpu_1x_h100_sxm5
197+
Checking gpu_1x_h100_sxm5... ✓ AVAILABLE in us-west-1
198+
199+
Selected: gpu_1x_h100_sxm5 in us-west-1 (tier: h100-sxm)
200+
gpu_1x_h100_sxm5 us-west-1
201+
```
202+
203+
### Benefits of Tier-Based Selection
204+
205+
- **Higher Success Rate**: Automatically falls back to available GPUs
206+
- **No Manual Intervention**: System handles capacity changes
207+
- **Best Performance**: Always gets the highest tier available
208+
- **Simple Configuration**: One defconfig covers multiple GPU types
209+
114210
## SSH Key Security
115211

116212
### Automatic Unique Keys (Default - Recommended)
@@ -168,6 +264,11 @@ The default configuration automatically:
168264
|--------|-------------|----------|
169265
| `defconfig-lambdalabs` | Smart instance + unique SSH keys | Production (recommended) |
170266
| `defconfig-lambdalabs-shared-key` | Smart instance + shared SSH key | Legacy/testing |
267+
| `defconfig-lambdalabs-gh200-or-less` | Best single GPU up to GH200 | Maximum performance |
268+
| `defconfig-lambdalabs-h100-or-less` | Best single GPU up to H100 | High performance |
269+
| `defconfig-lambdalabs-a100-or-less` | Best single GPU up to A100 | Cost-effective |
270+
| `defconfig-lambdalabs-8x-b200-or-less` | Best 8-GPU up to B200 | Maximum multi-GPU |
271+
| `defconfig-lambdalabs-8x-h100-or-less` | Best 8-GPU up to H100 | High-end multi-GPU |
171272

172273
### Manual Configuration
173274

@@ -274,6 +375,8 @@ The Lambda Labs Terraform provider (elct9620/lambdalabs v0.3.0) has significant
274375
|--------|---------|
275376
| `lambdalabs_api.py` | Main API integration, generates Kconfig |
276377
| `lambdalabs_smart_inference.py` | Smart instance/region selection |
378+
| `lambdalabs_check_capacity.py` | Check GPU availability across regions |
379+
| `lambdalabs_select_tier.py` | Tier-based GPU selection with fallback |
277380
| `lambdalabs_ssh_keys.py` | SSH key management |
278381
| `lambdalabs_list_instances.py` | List running instances |
279382
| `lambdalabs_credentials.py` | Manage API credentials |

0 commit comments

Comments
 (0)