Add tier-based GPU selection for Lambda Labs#70
Merged
chucklever merged 4 commits intomainfrom Dec 16, 2025
Merged
Conversation
Add support for tier-based GPU instance selection for Lambda Labs, similar to the existing DataCrunch implementation. This allows users to specify a maximum GPU tier and the system will automatically select the highest available GPU within that tier. The implementation adds capacity checking and tier selection scripts that query the Lambda Labs API to find available instances. Single GPU tier groups fall back from GH200 to H100 to A100 to A6000 to A10. Multi-GPU tier groups fall back from 8x B200 to 8x H100 to 8x A100 to 8x V100. New Kconfig options provide tier-based selections like H100_OR_LESS and 8X_H100_OR_LESS. The terraform ansible tasks detect these wildcard types and invoke the tier selection script to find available capacity before provisioning. Defconfigs are provided for common tier combinations to simplify usage. Users can now run commands like make defconfig-lambdalabs-h100-or-less to get the best available single GPU up to H100 tier. Generated-by: Claude AI Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Address review feedback regarding inconsistent error handling in the check_availability function. The function contract implies returning int values for both success and failure, but error paths were calling sys.exit() directly. Change the error handling to return non-zero integers instead of calling sys.exit(), making the function consistent and easier to test. Remove the unused instance_data binding from get_instance_types_with_capacity() since only capacity_map is used. Add exception handling around the API call so the function never raises unhandled exceptions. Generated-by: Claude AI Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Address review feedback about duplicate code in check_availability(). The logic to build a region_map dictionary from gpu_instances appeared twice identically, violating the DRY principle. Extract this common pattern into a private _build_region_map() helper function that takes gpu_instances and returns the region-to-instance-type mapping. Both the JSON output and text output code paths now call this helper instead of duplicating the iteration logic. Generated-by: Claude AI Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
The tier selection script outputs "instance_type region" which is then parsed by splitting on whitespace and accessing indices [0] and [1]. If the script produces unexpected output such as an empty line or a single word, the split operation produces a list with fewer than two elements, causing Ansible to fail with a cryptic index error. Add an explicit validation task using ansible.builtin.assert to verify the output contains exactly two whitespace-separated values before attempting to parse it. This provides a clear error message showing the actual output when the format is invalid, making debugging easier. Generated-by: Claude AI Signed-off-by: Chuck Lever <cel@kernel.org>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add support for tier-based GPU instance selection for Lambda Labs, similar to the existing DataCrunch implementation. This allows users to specify a maximum GPU tier and the system will automatically select the highest available GPU within that tier.