google · copybara-service · May 15, 2026
diff --git a/skills/cloud/gke-upgrades/SKILL.md b/skills/cloud/gke-upgrades/SKILL.md
@@ -0,0 +1,110 @@
+---
+name: gke-upgrades
+description: >
+  Plans, executes, and validates Google Kubernetes Engine (GKE) cluster upgrades
+  and maintenance operations for both Standard and Autopilot clusters. Produces
+  upgrade plans, pre/post-upgrade checklists, maintenance runbooks with gcloud
+  commands, release channel strategy, and troubleshooting guides. Handles node
+  pool upgrade strategies (surge, blue-green), version compatibility, PDB
+  management, and workload-specific concerns (stateful, GPU, operators). Use this
+  skill whenever the user mentions GKE upgrades, Kubernetes version bumps, node
+  pool maintenance, GKE patching, cluster version management, release channel
+  selection, maintenance windows, surge upgrades, stuck upgrades, or any GKE
+  lifecycle management task — even casual mentions like "we need to upgrade our
+  clusters" or "plan our next GKE maintenance" or "our upgrade is stuck."
+---
+
+# GKE Upgrades & Maintenance
+
+Produce clear, actionable documents — upgrade plans, runbooks, or checklists — tailored to the user's environment. Output should be specific to their cluster mode, release channel, version, and workload types rather than generic advice.
+
+Always frame guidance around the auto-upgrade model: auto-upgrade with maintenance windows and exclusions is the preferred control mechanism. Manual upgrades are for exceptions.
+
+## Context Gathering
+
+Before producing any upgrade artifact, establish:
+- **Cluster mode** — Standard or Autopilot? (Autopilot has no node pool management, mandatory resource requests, no SSH)
+- **Current and target versions** — Node version skew must be within 2 minor versions of control plane.
+- **Release channel** — Rapid, Regular, Stable, or Extended.
+- **Environment topology** — Single vs multi-cluster, dev/staging/prod tiers.
+- **Workload sensitivity** — StatefulSets, databases, GPU, long-running batch need special handling.
+
+If the user provides these upfront, skip straight to the deliverable. If they're vague, fill in reasonable defaults and flag assumptions.
+
+## Core Principles
+
+1. **Sequential control plane, skip-level node pools** -- Control plane upgrades are sequential (N → N+1 → N+2). Node pools support skip-level (N+2) upgrades. GKE supports a 2-step CP minor upgrade where step 1 is rollbackable.
+2. **Control plane first** -- Control plane must be upgraded before node pools. Nodes can trail by up to 2 minor versions.
+3. **Environment progression** -- Always upgrade dev/staging before production. Use release channels to enforce this: Rapid → Regular → Stable.
+4. **Workload-aware** -- Upgrade strategy depends on what's running (stateless, stateful, GPU, batch).
+5. **Release channels first** -- Always recommend release channels with maintenance exclusions. Never recommend "No channel" as a first option.
+6. **Rollback** -- CP patch downgrades are customer-doable. CP minor downgrades require GKE support. Node pools can be re-created at a different version.
+
+## Release Channels
+
+| Channel | Best for | SLA | Support |
+|---------|----------|-----|---------|
+| **Rapid** | Dev/test, early feature access | No upgrade stability SLA | 14 months |
+| **Regular** (default) | Most production | Full SLA | 14 months |
+| **Stable** | Mission-critical, stability-first | Full SLA | 14 months |
+| **Extended** | Compliance, EoS enforcement control | Full SLA | Up to 24 months (extra cost) |
+
+Common multi-environment strategy: Dev→Rapid, Staging→Regular, Prod→Stable or Regular.
+
+## Maintenance Windows & Exclusions
+
+Configure maintenance windows to control auto-upgrade timing.
+
+**Exclusion types:**
+- **"No upgrades"**: Blocks everything for up to 30 days (BFCM, freezes).
+- **"No minor or node upgrades"**: Blocks minor and node upgrades, allows CP patches. Up to EoS.
+- **"No minor upgrades"**: Blocks minor upgrades, allows patches and node upgrades. Up to EoS.
+
+Recommend cluster-level exclusions to prevent skew. Use `--add-maintenance-exclusion-until-end-of-support` for persistent exclusions.
+
+## Upgrade Planning
+
+When asked to plan an upgrade, produce a structured document covering:
+- Version compatibility (breaking changes, deprecated APIs)
+- Upgrade path (sequential minor version upgrades)
+- Node pool upgrade strategy (Standard only)
+- Workload readiness (PDBs, resource requests)
+
+### Node Pool Strategy (Standard Only)
+
+Recommend surge upgrade as the default, with per-pool settings:
+- **Stateless**: Higher `maxSurge` (2-3) for speed, `maxUnavailable=0` for safety.
+- **Stateful/DB**: `maxSurge=1, maxUnavailable=0` (conservative).
+- **GPU (fixed reservation)**: `maxSurge=0, maxUnavailable=1` (no surge capacity).
+- **Large (50+ nodes)**: `maxSurge=20, maxUnavailable=0` (max parallelism).
+
+Recommend blue-green upgrade for mission-critical apps needing fast rollback or strict validation. Use autoscaled blue-green for long-running batch or disruption-sensitive workloads.
+
+For standard command sequences and runbook templates, see `references/runbook-template.md`.
+
+### Large-Scale AI/ML Clusters (GPU/TPU)
+
+- GPU VMs do not support live migration — upgrades force pod restart.
+- H100/A100 typically use fixed reservations with no surge capacity. Use `maxSurge=0, maxUnavailable=1`.
+- GPU driver is coupled with target node version; verify CUDA compatibility.
+- Use maintenance exclusions during active training campaigns. Cordon GPU nodes and wait for jobs to complete.
+- TPU slices are recreated atomically (not rolling); maintenance on one slice restarts all slices in the environment.
+
+## Checklists
+
+Produce checklists as copyable markdown with checkboxes. See `references/checklists.md` for the full pre-upgrade and post-upgrade checklist templates. Adapt them to the user's environment.
+
+## Maintenance runbooks
+
+Produce step-by-step runbooks with actual `gcloud` and `kubectl` commands. See `references/runbook-template.md` for the standard command sequences.
+
+## Troubleshooting
+
+When a user reports a stuck or failing upgrade, walk through diagnosis systematically in this order:
+1. PDB blocking drain (check `kubectl get pdb -A`)
+2. Resource constraints (pods pending, increase maxSurge)
+3. Bare pods (must delete or wrap in controllers)
+4. Admission webhooks rejecting pod creation
+5. PVC attachment issues (volume migration)
+
+For a detailed diagnostic flowchart and fix procedures, see `references/troubleshooting.md`.
diff --git a/skills/cloud/gke-upgrades/references/checklists.md b/skills/cloud/gke-upgrades/references/checklists.md
@@ -0,0 +1,74 @@
+# Checklist Templates
+
+Adapt these to the user's environment. Fill in cluster names, versions, and remove items that don't apply.
+
+## Pre-Upgrade Checklist
+
+```
+Pre-Upgrade Checklist
+- [ ] Cluster: ___ | Mode: Standard / Autopilot | Channel: ___
+- [ ] Current version: ___ | Target version: ___
+
+Compatibility
+- [ ] Target version available in release channel (`gcloud container get-server-config --zone ZONE --format="yaml(channels)"`)
+- [ ] No deprecated API usage (check GKE deprecation insights dashboard)
+- [ ] GKE release notes reviewed for breaking changes between current → target
+- [ ] Node version skew within 2 minor versions of control plane
+- [ ] Third-party operators/controllers compatible with target version
+- [ ] Admission webhooks tested against target version
+
+Workload Readiness
+- [ ] PDBs configured for critical workloads (not overly restrictive)
+- [ ] No bare pods — all managed by controllers
+- [ ] terminationGracePeriodSeconds adequate for graceful shutdown
+- [ ] StatefulSet PV backups completed, reclaim policies verified
+- [ ] Resource requests/limits set on all containers (mandatory for Autopilot)
+- [ ] GPU driver compatibility confirmed with target node image (if applicable)
+- [ ] Postgres/database operator compatibility verified (if applicable)
+
+Infrastructure (Standard only)
+- [ ] Node pool upgrade strategy chosen (surge / blue-green)
+- [ ] Surge settings configured per pool: maxSurge=___ maxUnavailable=___
+- [ ] Sufficient compute quota for surge nodes
+- [ ] Maintenance window configured (off-peak hours)
+- [ ] Maintenance exclusions set for freeze periods (if applicable)
+
+Ops Readiness
+- [ ] Monitoring and alerting active (Cloud Monitoring / Prometheus)
+- [ ] Baseline metrics captured (error rates, latency, throughput)
+- [ ] Upgrade window communicated to stakeholders
+- [ ] Rollback plan documented
+- [ ] On-call team aware and available
+```
+
+## Post-Upgrade Checklist
+
+```
+Post-Upgrade Checklist
+
+Cluster Health
+- [ ] Control plane at target version: `gcloud container clusters describe CLUSTER --zone ZONE --format="value(currentMasterVersion)"`
+- [ ] All node pools at target version: `gcloud container node-pools list --cluster CLUSTER --zone ZONE`
+- [ ] All nodes Ready: `kubectl get nodes`
+- [ ] System pods healthy: `kubectl get pods -n kube-system`
+- [ ] No stuck PDBs: `kubectl get pdb --all-namespaces`
+
+Workload Health
+- [ ] All deployments at desired replica count: `kubectl get deployments -A`
+- [ ] No CrashLoopBackOff or Pending pods: `kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded`
+- [ ] StatefulSets fully ready: `kubectl get statefulsets -A`
+- [ ] Ingress/load balancers responding
+- [ ] Application health checks and smoke tests passing
+
+Observability
+- [ ] Metrics pipeline active, no collection gaps
+- [ ] Logs flowing to aggregation
+- [ ] Error rates within pre-upgrade baseline
+- [ ] Latency (p50/p95/p99) within pre-upgrade baseline
+
+Cleanup
+- [ ] Old node pools removed (if blue-green)
+- [ ] Surge quota released (automatic for surge upgrades)
+- [ ] Upgrade documented in changelog
+- [ ] Lessons learned captured
+```
diff --git a/skills/cloud/gke-upgrades/references/runbook-template.md b/skills/cloud/gke-upgrades/references/runbook-template.md
@@ -0,0 +1,102 @@
+# Runbook Command Templates
+
+Standard command sequences for GKE upgrades. Replace placeholders: `CLUSTER_NAME`, `ZONE`, `TARGET_VERSION`, `NODE_POOL_NAME`.
+
+## Pre-flight
+
+```bash
+# Current versions
+gcloud container clusters describe CLUSTER_NAME \
+  --zone ZONE \
+  --format="table(name, currentMasterVersion, nodePools[].version)"
+
+# Available versions for channel
+gcloud container get-server-config --zone ZONE \
+  --format="yaml(channels)"
+
+# Deprecated API usage
+kubectl get --raw /metrics | grep apiserver_request_total | grep deprecated
+
+# Cluster health
+kubectl get nodes
+kubectl get pods -A | grep -v Running | grep -v Completed
+```
+
+## Control plane upgrade
+
+```bash
+gcloud container clusters upgrade CLUSTER_NAME \
+  --zone ZONE \
+  --master \
+  --cluster-version TARGET_VERSION
+
+# Verify (wait ~10-15 min)
+gcloud container clusters describe CLUSTER_NAME \
+  --zone ZONE \
+  --format="value(currentMasterVersion)"
+
+kubectl get pods -n kube-system
+```
+
+## Node pool upgrade (Standard only)
+
+```bash
+# Configure surge settings
+gcloud container node-pools update NODE_POOL_NAME \
+  --cluster CLUSTER_NAME \
+  --zone ZONE \
+  --max-surge-upgrade MAX_SURGE \
+  --max-unavailable-upgrade MAX_UNAVAILABLE
+
+# Upgrade
+gcloud container node-pools upgrade NODE_POOL_NAME \
+  --cluster CLUSTER_NAME \
+  --zone ZONE \
+  --cluster-version TARGET_VERSION
+
+# Monitor progress
+watch 'kubectl get nodes -o wide -L cloud.google.com/gke-nodepool'
+
+# Verify
+gcloud container node-pools list --cluster CLUSTER_NAME --zone ZONE
+kubectl get pods -A | grep -v Running | grep -v Completed
+```
+
+## Maintenance window configuration
+
+```bash
+# Set recurring maintenance window
+gcloud container clusters update CLUSTER_NAME \
+  --zone ZONE \
+  --maintenance-window-start YYYY-MM-DDTHH:MM:SSZ \
+  --maintenance-window-end YYYY-MM-DDTHH:MM:SSZ \
+  --maintenance-window-recurrence "FREQ=WEEKLY;BYDAY=SA"
+
+# Add maintenance exclusion (up to 30 days)
+gcloud container clusters update CLUSTER_NAME \
+  --zone ZONE \
+  --add-maintenance-exclusion-name "EXCLUSION_NAME" \
+  --add-maintenance-exclusion-start-time START_TIME \
+  --add-maintenance-exclusion-end-time END_TIME
+```
+
+## Rollback guidance
+
+Control plane downgrade is rare and not recommended without GKE support involvement. Node pool downgrades require creating a new pool at the old version and migrating workloads.
+
+```bash
+# Cancel in-progress node pool upgrade (if needed)
+# GKE will finish the current node and stop
+gcloud container operations list --cluster CLUSTER_NAME --zone ZONE
+
+# Create replacement node pool at previous version (if rollback needed)
+gcloud container node-pools create NODE_POOL_NAME-rollback \
+  --cluster CLUSTER_NAME \
+  --zone ZONE \
+  --cluster-version PREVIOUS_VERSION \
+  --num-nodes NUM_NODES \
+  --machine-type MACHINE_TYPE
+
+# Cordon old pool and migrate workloads
+kubectl cordon -l cloud.google.com/gke-nodepool=NODE_POOL_NAME
+```