Skip to content

Infrastructure as Code for deploying GPU-enabled Kubernetes clusters on Linode with comprehensive monitoring, optimized for AI/ML workloads

License

Notifications You must be signed in to change notification settings

idvoretskyi/linode-gpu-k8s

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Linode GPU Kubernetes Infrastructure

OpenTofu infrastructure code for deploying production-ready, GPU-enabled Kubernetes clusters on Linode Kubernetes Engine (LKE) optimized for AI/ML workloads.

Overview

This repository provides automated infrastructure deployment for GPU-accelerated Kubernetes clusters with comprehensive monitoring, designed to serve as a foundation for AI/ML platforms and workloads.

Key Features:

  • GPU Compute: NVIDIA RTX 4000 Ada GPU nodes with automated driver installation
  • GPU Operator: NVIDIA GPU Operator for automated GPU management and monitoring
  • Metrics API: Kubernetes Metrics Server for resource monitoring and HPA
  • Monitoring Stack: Complete observability with Prometheus, Grafana, and Alertmanager
  • High Availability: Managed control plane with HA option
  • Autoscaling: Automatic node scaling (1-5 nodes)
  • Security: Configurable firewall rules and network policies
  • Automation: One-command deployment and management

Designed as infrastructure foundation for AI/ML platforms like Kubeflow, Ray, MLflow, and custom ML workloads.

Quick Start

# Configure Linode API token (paste your Personal Access Token)
export LINODE_TOKEN="YOUR_PERSONAL_ACCESS_TOKEN"

# Initialize and deploy
cd tofu
tofu init
tofu plan
tofu apply

# Access cluster (kubeconfig automatically merged to ~/.kube/config)
kubectl get nodes
kubectl top nodes

# Access Grafana dashboard
kubectl port-forward -n monitoring svc/kube-prometheus-stack-grafana 3000:80
# Then visit: http://localhost:3000 (admin/admin)

Deployment time:

  • Basic cluster: ~5 minutes
  • With GPU operator: ~15-20 minutes
  • With full monitoring stack: ~20-30 minutes

Prerequisites

  • OpenTofu >= 1.6 - Infrastructure as code tool
  • linode-cli - Linode API client (configured with token)
  • kubectl - Kubernetes command-line tool

macOS Installation

brew install opentofu kubectl
pip3 install linode-cli
linode-cli configure

Project Structure

.
├── README.md              # This file
├── LICENSE                # MIT License
└── tofu/                  # OpenTofu infrastructure code
    ├── main.tf            # Core resources
    ├── variables.tf       # Configuration variables
    ├── outputs.tf         # Output values
    ├── tofu.tfvars.example # Configuration template
    └── modules/           # Reusable modules
        ├── gpu-operator/  # NVIDIA GPU Operator
        ├── metrics-server/ # Kubernetes Metrics Server
        └── kube-prometheus-stack/ # Monitoring stack

Workflow

Common OpenTofu actions:

# From repo root
cd tofu

# Initialize providers and modules
tofu init

# Review and apply changes
tofu plan && tofu apply

# Format and validate configuration
tofu fmt -recursive && tofu validate

# Destroy infrastructure when no longer needed
tofu destroy

For detailed module documentation, see tofu/modules/README.md.

Configuration

Default configuration deploys to Chicago (us-ord) with GPU operator and monitoring enabled. Customize by creating tofu/tofu.tfvars:

# Basic cluster configuration
# cluster_name_prefix = "my-cluster"  # Optional: defaults to your username
region              = "us-ord"
kubernetes_version  = "1.34"
gpu_node_type       = "g2-gpu-rtx4000a1-s"  # RTX 4000 Ada
gpu_node_count      = 1
autoscaler_min      = 1
autoscaler_max      = 5

# High availability
ha_control_plane = true

# GPU Operator (automated NVIDIA driver installation)
install_gpu_operator  = true
enable_gpu_monitoring = true

# Metrics Server (kubectl top, HPA support)
install_metrics_server = true

# Monitoring Stack (Prometheus + Grafana + Alertmanager)
install_monitoring      = true
grafana_admin_password  = "admin"  # Change in production!
prometheus_retention    = "15d"
prometheus_storage_size = "50Gi"
grafana_storage_size    = "10Gi"

See tofu/tofu.tfvars.example for all available configuration options.

Cluster Specifications

Component Specification
Platform Linode Kubernetes Engine (LKE)
Region Chicago, IL (us-ord)
Kubernetes v1.34 (configurable)
GPU NVIDIA RTX 4000 Ada (1 per node)
CPU 4 vCPU per node
Memory 16 GB per node
Storage 512 GB SSD per node
Nodes 1 default, autoscaling 1-5

Cost Estimation

GPU Nodes: $1.50-2.00/hour per node ($1,080-1,440/month per node)

Monthly Cost (1 node): ~$1,080-1,440 Monthly Cost (2 nodes): ~$2,160-2,880

Control Plane: Free (standard) or additional charge (HA) Storage: Included in node pricing, additional for persistent volumes

Costs are approximate. Check Linode Pricing for current rates.

Security

  • API token read from the LINODE_TOKEN environment variable
  • Kubeconfig excluded from git tracking (auto-merged to ~/.kube/config)
  • Configurable firewall rules for kubectl and monitoring access
  • Support for Kubernetes RBAC and Network Policies
  • Grafana admin password (configurable, sensitive)

For production deployments, restrict access by IP:

allowed_kubectl_ips    = ["YOUR_IP/32"]
allowed_monitoring_ips = ["YOUR_IP/32"]

Cluster Management

Scale nodes:

# Edit tofu/tofu.tfvars
# gpu_node_count = 2

cd tofu && tofu apply

Update Kubernetes version:

# Edit tofu/tofu.tfvars
# kubernetes_version = "1.35"

cd tofu && tofu apply

Access Grafana:

kubectl port-forward -n monitoring svc/kube-prometheus-stack-grafana 3000:80
# Visit: http://localhost:3000 (default: admin/admin)

Access Prometheus:

kubectl port-forward -n monitoring svc/kube-prometheus-stack-prometheus 9090:9090
# Visit: http://localhost:9090

Check GPU availability:

kubectl get nodes -o json | jq '.items[].status.capacity."nvidia.com/gpu"'
kubectl get pods -n gpu-operator

Check resource usage:

kubectl top nodes
kubectl top pods -A

Destroy cluster:

cd tofu && tofu destroy

Features

Infrastructure

  • LKE cluster with GPU nodes (NVIDIA RTX 4000 Ada)
  • NVIDIA GPU Operator with automated driver installation
  • High availability control plane
  • Autoscaling configuration (1-5 nodes)
  • Firewall rules and network policies
  • OpenTofu-based automation
  • Kubeconfig auto-merge to ~/.kube/config (no local files)

Observability

  • Kubernetes Metrics Server (resource metrics API)
  • Prometheus (metrics collection and storage)
  • Grafana (visualization and dashboards)
  • Alertmanager (alert management)
  • Node Exporter (hardware and OS metrics)
  • Kube State Metrics (Kubernetes object metrics)
  • DCGM Exporter (GPU metrics integration)

GPU Support

  • NVIDIA GPU Operator (automated driver management)
  • GPU device plugin (resource scheduling)
  • GPU monitoring with DCGM exporter
  • GPU metrics integration with Prometheus
  • Support for CUDA workloads

Use Cases

This infrastructure is designed for:

  • ML Platform Deployment: Foundation for Kubeflow, MLflow, Ray, etc.
  • AI Model Training: Distributed training with GPU acceleration
  • AI Model Serving: Inference workloads with GPU support
  • Data Science Workflows: Jupyter notebooks with GPU access
  • Custom ML Applications: Any containerized AI/ML workload
  • Development & Testing: GPU-enabled development environments

Resources

Support

For issues and questions:

Contributing

Contributions are welcome! Open an issue or pull request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Author

Ihor Dvoretskyi (@idvoretskyi)

Acknowledgments

About

Infrastructure as Code for deploying GPU-enabled Kubernetes clusters on Linode with comprehensive monitoring, optimized for AI/ML workloads

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages