OpenTofu infrastructure code for deploying production-ready, GPU-enabled Kubernetes clusters on Linode Kubernetes Engine (LKE) optimized for AI/ML workloads.
This repository provides automated infrastructure deployment for GPU-accelerated Kubernetes clusters with comprehensive monitoring, designed to serve as a foundation for AI/ML platforms and workloads.
Key Features:
- GPU Compute: NVIDIA RTX 4000 Ada GPU nodes with automated driver installation
- GPU Operator: NVIDIA GPU Operator for automated GPU management and monitoring
- Metrics API: Kubernetes Metrics Server for resource monitoring and HPA
- Monitoring Stack: Complete observability with Prometheus, Grafana, and Alertmanager
- High Availability: Managed control plane with HA option
- Autoscaling: Automatic node scaling (1-5 nodes)
- Security: Configurable firewall rules and network policies
- Automation: One-command deployment and management
Designed as infrastructure foundation for AI/ML platforms like Kubeflow, Ray, MLflow, and custom ML workloads.
# Configure Linode API token (paste your Personal Access Token)
export LINODE_TOKEN="YOUR_PERSONAL_ACCESS_TOKEN"
# Initialize and deploy
cd tofu
tofu init
tofu plan
tofu apply
# Access cluster (kubeconfig automatically merged to ~/.kube/config)
kubectl get nodes
kubectl top nodes
# Access Grafana dashboard
kubectl port-forward -n monitoring svc/kube-prometheus-stack-grafana 3000:80
# Then visit: http://localhost:3000 (admin/admin)Deployment time:
- Basic cluster: ~5 minutes
- With GPU operator: ~15-20 minutes
- With full monitoring stack: ~20-30 minutes
- OpenTofu >= 1.6 - Infrastructure as code tool
- linode-cli - Linode API client (configured with token)
- kubectl - Kubernetes command-line tool
brew install opentofu kubectl
pip3 install linode-cli
linode-cli configure.
├── README.md # This file
├── LICENSE # MIT License
└── tofu/ # OpenTofu infrastructure code
├── main.tf # Core resources
├── variables.tf # Configuration variables
├── outputs.tf # Output values
├── tofu.tfvars.example # Configuration template
└── modules/ # Reusable modules
├── gpu-operator/ # NVIDIA GPU Operator
├── metrics-server/ # Kubernetes Metrics Server
└── kube-prometheus-stack/ # Monitoring stack
Common OpenTofu actions:
# From repo root
cd tofu
# Initialize providers and modules
tofu init
# Review and apply changes
tofu plan && tofu apply
# Format and validate configuration
tofu fmt -recursive && tofu validate
# Destroy infrastructure when no longer needed
tofu destroyFor detailed module documentation, see tofu/modules/README.md.
Default configuration deploys to Chicago (us-ord) with GPU operator and monitoring enabled. Customize by creating tofu/tofu.tfvars:
# Basic cluster configuration
# cluster_name_prefix = "my-cluster" # Optional: defaults to your username
region = "us-ord"
kubernetes_version = "1.34"
gpu_node_type = "g2-gpu-rtx4000a1-s" # RTX 4000 Ada
gpu_node_count = 1
autoscaler_min = 1
autoscaler_max = 5
# High availability
ha_control_plane = true
# GPU Operator (automated NVIDIA driver installation)
install_gpu_operator = true
enable_gpu_monitoring = true
# Metrics Server (kubectl top, HPA support)
install_metrics_server = true
# Monitoring Stack (Prometheus + Grafana + Alertmanager)
install_monitoring = true
grafana_admin_password = "admin" # Change in production!
prometheus_retention = "15d"
prometheus_storage_size = "50Gi"
grafana_storage_size = "10Gi"See tofu/tofu.tfvars.example for all available configuration options.
| Component | Specification |
|---|---|
| Platform | Linode Kubernetes Engine (LKE) |
| Region | Chicago, IL (us-ord) |
| Kubernetes | v1.34 (configurable) |
| GPU | NVIDIA RTX 4000 Ada (1 per node) |
| CPU | 4 vCPU per node |
| Memory | 16 GB per node |
| Storage | 512 GB SSD per node |
| Nodes | 1 default, autoscaling 1-5 |
GPU Nodes: $1.50-2.00/hour per node ($1,080-1,440/month per node)
Monthly Cost (1 node): ~$1,080-1,440 Monthly Cost (2 nodes): ~$2,160-2,880
Control Plane: Free (standard) or additional charge (HA) Storage: Included in node pricing, additional for persistent volumes
Costs are approximate. Check Linode Pricing for current rates.
- API token read from the
LINODE_TOKENenvironment variable - Kubeconfig excluded from git tracking (auto-merged to ~/.kube/config)
- Configurable firewall rules for kubectl and monitoring access
- Support for Kubernetes RBAC and Network Policies
- Grafana admin password (configurable, sensitive)
For production deployments, restrict access by IP:
allowed_kubectl_ips = ["YOUR_IP/32"]
allowed_monitoring_ips = ["YOUR_IP/32"]Scale nodes:
# Edit tofu/tofu.tfvars
# gpu_node_count = 2
cd tofu && tofu applyUpdate Kubernetes version:
# Edit tofu/tofu.tfvars
# kubernetes_version = "1.35"
cd tofu && tofu applyAccess Grafana:
kubectl port-forward -n monitoring svc/kube-prometheus-stack-grafana 3000:80
# Visit: http://localhost:3000 (default: admin/admin)Access Prometheus:
kubectl port-forward -n monitoring svc/kube-prometheus-stack-prometheus 9090:9090
# Visit: http://localhost:9090Check GPU availability:
kubectl get nodes -o json | jq '.items[].status.capacity."nvidia.com/gpu"'
kubectl get pods -n gpu-operatorCheck resource usage:
kubectl top nodes
kubectl top pods -ADestroy cluster:
cd tofu && tofu destroy- LKE cluster with GPU nodes (NVIDIA RTX 4000 Ada)
- NVIDIA GPU Operator with automated driver installation
- High availability control plane
- Autoscaling configuration (1-5 nodes)
- Firewall rules and network policies
- OpenTofu-based automation
- Kubeconfig auto-merge to ~/.kube/config (no local files)
- Kubernetes Metrics Server (resource metrics API)
- Prometheus (metrics collection and storage)
- Grafana (visualization and dashboards)
- Alertmanager (alert management)
- Node Exporter (hardware and OS metrics)
- Kube State Metrics (Kubernetes object metrics)
- DCGM Exporter (GPU metrics integration)
- NVIDIA GPU Operator (automated driver management)
- GPU device plugin (resource scheduling)
- GPU monitoring with DCGM exporter
- GPU metrics integration with Prometheus
- Support for CUDA workloads
This infrastructure is designed for:
- ML Platform Deployment: Foundation for Kubeflow, MLflow, Ray, etc.
- AI Model Training: Distributed training with GPU acceleration
- AI Model Serving: Inference workloads with GPU support
- Data Science Workflows: Jupyter notebooks with GPU access
- Custom ML Applications: Any containerized AI/ML workload
- Development & Testing: GPU-enabled development environments
- Linode Kubernetes Engine Documentation
- OpenTofu Documentation
- Kubernetes Documentation
- NVIDIA GPU Operator
- Prometheus Documentation
- Grafana Documentation
For issues and questions:
- Review the troubleshooting commands in the sections above
- Check
tofu/modules/README.mdfor module-specific troubleshooting - Visit Linode Community Forums
- Consult Kubernetes documentation
- Open an issue on GitHub
Contributions are welcome! Open an issue or pull request.
This project is licensed under the MIT License - see the LICENSE file for details.
Ihor Dvoretskyi (@idvoretskyi)
- Akamai/Linode for the cloud platform
- OpenTofu community for infrastructure-as-code tooling
- Kubernetes community
- NVIDIA for GPU support and documentation
- Prometheus and Grafana communities