Linode GPU Kubernetes Infrastructure

OpenTofu infrastructure code for deploying production-ready, GPU-enabled Kubernetes clusters on Linode Kubernetes Engine (LKE) optimized for AI/ML workloads.

Overview

This repository provides automated infrastructure deployment for GPU-accelerated Kubernetes clusters with comprehensive monitoring, designed to serve as a foundation for AI/ML platforms and workloads.

Key Features:

GPU Compute: NVIDIA RTX 4000 Ada GPU nodes with automated driver installation
GPU Operator: NVIDIA GPU Operator for automated GPU management and monitoring
Metrics API: Kubernetes Metrics Server for resource monitoring and HPA
Monitoring Stack: Complete observability with Prometheus, Grafana, and Alertmanager
High Availability: Managed control plane with HA option
Autoscaling: Automatic node scaling (1-5 nodes)
Security: Configurable firewall rules and network policies
Automation: One-command deployment and management

Designed as infrastructure foundation for AI/ML platforms like Kubeflow, Ray, MLflow, and custom ML workloads.

Quick Start

# Configure Linode API token (paste your Personal Access Token)
export LINODE_TOKEN="YOUR_PERSONAL_ACCESS_TOKEN"

# Initialize and deploy
cd tofu
tofu init
tofu plan
tofu apply

# Access cluster (kubeconfig automatically merged to ~/.kube/config)
kubectl get nodes
kubectl top nodes

# Access Grafana dashboard
kubectl port-forward -n monitoring svc/kube-prometheus-stack-grafana 3000:80
# Then visit: http://localhost:3000 (admin/admin)

Deployment time:

Basic cluster: ~5 minutes
With GPU operator: ~15-20 minutes
With full monitoring stack: ~20-30 minutes

Prerequisites

OpenTofu >= 1.6 - Infrastructure as code tool
linode-cli - Linode API client (configured with token)
kubectl - Kubernetes command-line tool

macOS Installation

brew install opentofu kubectl
pip3 install linode-cli
linode-cli configure

Project Structure

.
├── README.md              # This file
├── LICENSE                # MIT License
└── tofu/                  # OpenTofu infrastructure code
    ├── main.tf            # Core resources
    ├── variables.tf       # Configuration variables
    ├── outputs.tf         # Output values
    ├── tofu.tfvars.example # Configuration template
    └── modules/           # Reusable modules
        ├── gpu-operator/  # NVIDIA GPU Operator
        ├── metrics-server/ # Kubernetes Metrics Server
        └── kube-prometheus-stack/ # Monitoring stack

Workflow

Common OpenTofu actions:

# From repo root
cd tofu

# Initialize providers and modules
tofu init

# Review and apply changes
tofu plan && tofu apply

# Format and validate configuration
tofu fmt -recursive && tofu validate

# Destroy infrastructure when no longer needed
tofu destroy

For detailed module documentation, see tofu/modules/README.md.

Configuration

Default configuration deploys to Chicago (us-ord) with GPU operator and monitoring enabled. Customize by creating tofu/tofu.tfvars:

# Basic cluster configuration
# cluster_name_prefix = "my-cluster"  # Optional: defaults to your username
region              = "us-ord"
kubernetes_version  = "1.34"
gpu_node_type       = "g2-gpu-rtx4000a1-s"  # RTX 4000 Ada
gpu_node_count      = 1
autoscaler_min      = 1
autoscaler_max      = 5

# High availability
ha_control_plane = true

# GPU Operator (automated NVIDIA driver installation)
install_gpu_operator  = true
enable_gpu_monitoring = true

# Metrics Server (kubectl top, HPA support)
install_metrics_server = true

# Monitoring Stack (Prometheus + Grafana + Alertmanager)
install_monitoring      = true
grafana_admin_password  = "admin"  # Change in production!
prometheus_retention    = "15d"
prometheus_storage_size = "50Gi"
grafana_storage_size    = "10Gi"

See tofu/tofu.tfvars.example for all available configuration options.

Cluster Specifications

Component	Specification
Platform	Linode Kubernetes Engine (LKE)
Region	Chicago, IL (us-ord)
Kubernetes	v1.34 (configurable)
GPU	NVIDIA RTX 4000 Ada (1 per node)
CPU	4 vCPU per node
Memory	16 GB per node
Storage	512 GB SSD per node
Nodes	1 default, autoscaling 1-5

Cost Estimation

GPU Nodes: ~~$1.50-2.00/hour per node (~~$1,080-1,440/month per node)

Monthly Cost (1 node): ~$1,080-1,440 Monthly Cost (2 nodes): ~$2,160-2,880

Control Plane: Free (standard) or additional charge (HA) Storage: Included in node pricing, additional for persistent volumes

Costs are approximate. Check Linode Pricing for current rates.

Security

API token read from the LINODE_TOKEN environment variable
Kubeconfig excluded from git tracking (auto-merged to ~/.kube/config)
Configurable firewall rules for kubectl and monitoring access
Support for Kubernetes RBAC and Network Policies
Grafana admin password (configurable, sensitive)

For production deployments, restrict access by IP:

allowed_kubectl_ips    = ["YOUR_IP/32"]
allowed_monitoring_ips = ["YOUR_IP/32"]

Cluster Management

Scale nodes:

# Edit tofu/tofu.tfvars
# gpu_node_count = 2

cd tofu && tofu apply

Update Kubernetes version:

# Edit tofu/tofu.tfvars
# kubernetes_version = "1.35"

cd tofu && tofu apply

Access Grafana:

kubectl port-forward -n monitoring svc/kube-prometheus-stack-grafana 3000:80
# Visit: http://localhost:3000 (default: admin/admin)

Access Prometheus:

kubectl port-forward -n monitoring svc/kube-prometheus-stack-prometheus 9090:9090
# Visit: http://localhost:9090

Check GPU availability:

kubectl get nodes -o json | jq '.items[].status.capacity."nvidia.com/gpu"'
kubectl get pods -n gpu-operator

Check resource usage:

kubectl top nodes
kubectl top pods -A

Destroy cluster:

cd tofu && tofu destroy

Features

Infrastructure

LKE cluster with GPU nodes (NVIDIA RTX 4000 Ada)
NVIDIA GPU Operator with automated driver installation
High availability control plane
Autoscaling configuration (1-5 nodes)
Firewall rules and network policies
OpenTofu-based automation
Kubeconfig auto-merge to ~/.kube/config (no local files)

Observability

Kubernetes Metrics Server (resource metrics API)
Prometheus (metrics collection and storage)
Grafana (visualization and dashboards)
Alertmanager (alert management)
Node Exporter (hardware and OS metrics)
Kube State Metrics (Kubernetes object metrics)
DCGM Exporter (GPU metrics integration)

GPU Support

NVIDIA GPU Operator (automated driver management)
GPU device plugin (resource scheduling)
GPU monitoring with DCGM exporter
GPU metrics integration with Prometheus
Support for CUDA workloads

Use Cases

This infrastructure is designed for:

ML Platform Deployment: Foundation for Kubeflow, MLflow, Ray, etc.
AI Model Training: Distributed training with GPU acceleration
AI Model Serving: Inference workloads with GPU support
Data Science Workflows: Jupyter notebooks with GPU access
Custom ML Applications: Any containerized AI/ML workload
Development & Testing: GPU-enabled development environments

Resources

Support

For issues and questions:

Review the troubleshooting commands in the sections above
Check tofu/modules/README.md for module-specific troubleshooting
Visit Linode Community Forums
Consult Kubernetes documentation
Open an issue on GitHub

Contributing

Contributions are welcome! Open an issue or pull request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Author

Ihor Dvoretskyi (@idvoretskyi)

Acknowledgments

Akamai/Linode for the cloud platform
OpenTofu community for infrastructure-as-code tooling
Kubernetes community
NVIDIA for GPU support and documentation
Prometheus and Grafana communities

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.github		.github
tofu		tofu
.gitignore		.gitignore
CODEOWNERS		CODEOWNERS
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Linode GPU Kubernetes Infrastructure

Overview

Quick Start

Prerequisites

macOS Installation

Project Structure

Workflow

Configuration

Cluster Specifications

Cost Estimation

Security

Cluster Management

Features

Infrastructure

Observability

GPU Support

Use Cases

Resources

Support

Contributing

License

Author

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

idvoretskyi/linode-gpu-k8s

Folders and files

Latest commit

History

Repository files navigation

Linode GPU Kubernetes Infrastructure

Overview

Quick Start

Prerequisites

macOS Installation

Project Structure

Workflow

Configuration

Cluster Specifications

Cost Estimation

Security

Cluster Management

Features

Infrastructure

Observability

GPU Support

Use Cases

Resources

Support

Contributing

License

Author

Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages