From 18ba1c18c9fc60b10e216738876c963af6c42cfe Mon Sep 17 00:00:00 2001
From: Cloud IX Team <cloud-ix-copybara@google.com>
Date: Fri, 15 May 2026 07:49:49 -0700
Subject: [PATCH] Add GKE Upgrades agent skill

PiperOrigin-RevId: 916002403
---
 skills/cloud/gke-upgrades/SKILL.md            | 110 ++++++++++++++++
 .../gke-upgrades/references/checklists.md     |  74 +++++++++++
 .../references/runbook-template.md            | 102 +++++++++++++++
 .../references/troubleshooting.md             | 118 ++++++++++++++++++
 4 files changed, 404 insertions(+)
 create mode 100644 skills/cloud/gke-upgrades/SKILL.md
 create mode 100644 skills/cloud/gke-upgrades/references/checklists.md
 create mode 100644 skills/cloud/gke-upgrades/references/runbook-template.md
 create mode 100644 skills/cloud/gke-upgrades/references/troubleshooting.md

diff --git a/skills/cloud/gke-upgrades/SKILL.md b/skills/cloud/gke-upgrades/SKILL.md
new file mode 100644
index 0000000000..6c8d25592a
--- /dev/null
+++ b/skills/cloud/gke-upgrades/SKILL.md
@@ -0,0 +1,110 @@
+---
+name: gke-upgrades
+description: >
+  Plans, executes, and validates Google Kubernetes Engine (GKE) cluster upgrades
+  and maintenance operations for both Standard and Autopilot clusters. Produces
+  upgrade plans, pre/post-upgrade checklists, maintenance runbooks with gcloud
+  commands, release channel strategy, and troubleshooting guides. Handles node
+  pool upgrade strategies (surge, blue-green), version compatibility, PDB
+  management, and workload-specific concerns (stateful, GPU, operators). Use this
+  skill whenever the user mentions GKE upgrades, Kubernetes version bumps, node
+  pool maintenance, GKE patching, cluster version management, release channel
+  selection, maintenance windows, surge upgrades, stuck upgrades, or any GKE
+  lifecycle management task — even casual mentions like "we need to upgrade our
+  clusters" or "plan our next GKE maintenance" or "our upgrade is stuck."
+---
+
+# GKE Upgrades & Maintenance
+
+Produce clear, actionable documents — upgrade plans, runbooks, or checklists — tailored to the user's environment. Output should be specific to their cluster mode, release channel, version, and workload types rather than generic advice.
+
+Always frame guidance around the auto-upgrade model: auto-upgrade with maintenance windows and exclusions is the preferred control mechanism. Manual upgrades are for exceptions.
+
+## Context Gathering
+
+Before producing any upgrade artifact, establish:
+- **Cluster mode** — Standard or Autopilot? (Autopilot has no node pool management, mandatory resource requests, no SSH)
+- **Current and target versions** — Node version skew must be within 2 minor versions of control plane.
+- **Release channel** — Rapid, Regular, Stable, or Extended.
+- **Environment topology** — Single vs multi-cluster, dev/staging/prod tiers.
+- **Workload sensitivity** — StatefulSets, databases, GPU, long-running batch need special handling.
+
+If the user provides these upfront, skip straight to the deliverable. If they're vague, fill in reasonable defaults and flag assumptions.
+
+## Core Principles
+
+1. **Sequential control plane, skip-level node pools** -- Control plane upgrades are sequential (N → N+1 → N+2). Node pools support skip-level (N+2) upgrades. GKE supports a 2-step CP minor upgrade where step 1 is rollbackable.
+2. **Control plane first** -- Control plane must be upgraded before node pools. Nodes can trail by up to 2 minor versions.
+3. **Environment progression** -- Always upgrade dev/staging before production. Use release channels to enforce this: Rapid → Regular → Stable.
+4. **Workload-aware** -- Upgrade strategy depends on what's running (stateless, stateful, GPU, batch).
+5. **Release channels first** -- Always recommend release channels with maintenance exclusions. Never recommend "No channel" as a first option.
+6. **Rollback** -- CP patch downgrades are customer-doable. CP minor downgrades require GKE support. Node pools can be re-created at a different version.
+
+## Release Channels
+
+| Channel | Best for | SLA | Support |
+|---------|----------|-----|---------|
+| **Rapid** | Dev/test, early feature access | No upgrade stability SLA | 14 months |
+| **Regular** (default) | Most production | Full SLA | 14 months |
+| **Stable** | Mission-critical, stability-first | Full SLA | 14 months |
+| **Extended** | Compliance, EoS enforcement control | Full SLA | Up to 24 months (extra cost) |
+
+Common multi-environment strategy: Dev→Rapid, Staging→Regular, Prod→Stable or Regular.
+
+## Maintenance Windows & Exclusions
+
+Configure maintenance windows to control auto-upgrade timing.
+
+**Exclusion types:**
+- **"No upgrades"**: Blocks everything for up to 30 days (BFCM, freezes).
+- **"No minor or node upgrades"**: Blocks minor and node upgrades, allows CP patches. Up to EoS.
+- **"No minor upgrades"**: Blocks minor upgrades, allows patches and node upgrades. Up to EoS.
+
+Recommend cluster-level exclusions to prevent skew. Use `--add-maintenance-exclusion-until-end-of-support` for persistent exclusions.
+
+## Upgrade Planning
+
+When asked to plan an upgrade, produce a structured document covering:
+- Version compatibility (breaking changes, deprecated APIs)
+- Upgrade path (sequential minor version upgrades)
+- Node pool upgrade strategy (Standard only)
+- Workload readiness (PDBs, resource requests)
+
+### Node Pool Strategy (Standard Only)
+
+Recommend surge upgrade as the default, with per-pool settings:
+- **Stateless**: Higher `maxSurge` (2-3) for speed, `maxUnavailable=0` for safety.
+- **Stateful/DB**: `maxSurge=1, maxUnavailable=0` (conservative).
+- **GPU (fixed reservation)**: `maxSurge=0, maxUnavailable=1` (no surge capacity).
+- **Large (50+ nodes)**: `maxSurge=20, maxUnavailable=0` (max parallelism).
+
+Recommend blue-green upgrade for mission-critical apps needing fast rollback or strict validation. Use autoscaled blue-green for long-running batch or disruption-sensitive workloads.
+
+For standard command sequences and runbook templates, see `references/runbook-template.md`.
+
+### Large-Scale AI/ML Clusters (GPU/TPU)
+
+- GPU VMs do not support live migration — upgrades force pod restart.
+- H100/A100 typically use fixed reservations with no surge capacity. Use `maxSurge=0, maxUnavailable=1`.
+- GPU driver is coupled with target node version; verify CUDA compatibility.
+- Use maintenance exclusions during active training campaigns. Cordon GPU nodes and wait for jobs to complete.
+- TPU slices are recreated atomically (not rolling); maintenance on one slice restarts all slices in the environment.
+
+## Checklists
+
+Produce checklists as copyable markdown with checkboxes. See `references/checklists.md` for the full pre-upgrade and post-upgrade checklist templates. Adapt them to the user's environment.
+
+## Maintenance runbooks
+
+Produce step-by-step runbooks with actual `gcloud` and `kubectl` commands. See `references/runbook-template.md` for the standard command sequences.
+
+## Troubleshooting
+
+When a user reports a stuck or failing upgrade, walk through diagnosis systematically in this order:
+1. PDB blocking drain (check `kubectl get pdb -A`)
+2. Resource constraints (pods pending, increase maxSurge)
+3. Bare pods (must delete or wrap in controllers)
+4. Admission webhooks rejecting pod creation
+5. PVC attachment issues (volume migration)
+
+For a detailed diagnostic flowchart and fix procedures, see `references/troubleshooting.md`.
diff --git a/skills/cloud/gke-upgrades/references/checklists.md b/skills/cloud/gke-upgrades/references/checklists.md
new file mode 100644
index 0000000000..4131f905ad
--- /dev/null
+++ b/skills/cloud/gke-upgrades/references/checklists.md
@@ -0,0 +1,74 @@
+# Checklist Templates
+
+Adapt these to the user's environment. Fill in cluster names, versions, and remove items that don't apply.
+
+## Pre-Upgrade Checklist
+
+```
+Pre-Upgrade Checklist
+- [ ] Cluster: ___ | Mode: Standard / Autopilot | Channel: ___
+- [ ] Current version: ___ | Target version: ___
+
+Compatibility
+- [ ] Target version available in release channel (`gcloud container get-server-config --zone ZONE --format="yaml(channels)"`)
+- [ ] No deprecated API usage (check GKE deprecation insights dashboard)
+- [ ] GKE release notes reviewed for breaking changes between current → target
+- [ ] Node version skew within 2 minor versions of control plane
+- [ ] Third-party operators/controllers compatible with target version
+- [ ] Admission webhooks tested against target version
+
+Workload Readiness
+- [ ] PDBs configured for critical workloads (not overly restrictive)
+- [ ] No bare pods — all managed by controllers
+- [ ] terminationGracePeriodSeconds adequate for graceful shutdown
+- [ ] StatefulSet PV backups completed, reclaim policies verified
+- [ ] Resource requests/limits set on all containers (mandatory for Autopilot)
+- [ ] GPU driver compatibility confirmed with target node image (if applicable)
+- [ ] Postgres/database operator compatibility verified (if applicable)
+
+Infrastructure (Standard only)
+- [ ] Node pool upgrade strategy chosen (surge / blue-green)
+- [ ] Surge settings configured per pool: maxSurge=___ maxUnavailable=___
+- [ ] Sufficient compute quota for surge nodes
+- [ ] Maintenance window configured (off-peak hours)
+- [ ] Maintenance exclusions set for freeze periods (if applicable)
+
+Ops Readiness
+- [ ] Monitoring and alerting active (Cloud Monitoring / Prometheus)
+- [ ] Baseline metrics captured (error rates, latency, throughput)
+- [ ] Upgrade window communicated to stakeholders
+- [ ] Rollback plan documented
+- [ ] On-call team aware and available
+```
+
+## Post-Upgrade Checklist
+
+```
+Post-Upgrade Checklist
+
+Cluster Health
+- [ ] Control plane at target version: `gcloud container clusters describe CLUSTER --zone ZONE --format="value(currentMasterVersion)"`
+- [ ] All node pools at target version: `gcloud container node-pools list --cluster CLUSTER --zone ZONE`
+- [ ] All nodes Ready: `kubectl get nodes`
+- [ ] System pods healthy: `kubectl get pods -n kube-system`
+- [ ] No stuck PDBs: `kubectl get pdb --all-namespaces`
+
+Workload Health
+- [ ] All deployments at desired replica count: `kubectl get deployments -A`
+- [ ] No CrashLoopBackOff or Pending pods: `kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded`
+- [ ] StatefulSets fully ready: `kubectl get statefulsets -A`
+- [ ] Ingress/load balancers responding
+- [ ] Application health checks and smoke tests passing
+
+Observability
+- [ ] Metrics pipeline active, no collection gaps
+- [ ] Logs flowing to aggregation
+- [ ] Error rates within pre-upgrade baseline
+- [ ] Latency (p50/p95/p99) within pre-upgrade baseline
+
+Cleanup
+- [ ] Old node pools removed (if blue-green)
+- [ ] Surge quota released (automatic for surge upgrades)
+- [ ] Upgrade documented in changelog
+- [ ] Lessons learned captured
+```
diff --git a/skills/cloud/gke-upgrades/references/runbook-template.md b/skills/cloud/gke-upgrades/references/runbook-template.md
new file mode 100644
index 0000000000..47e4296730
--- /dev/null
+++ b/skills/cloud/gke-upgrades/references/runbook-template.md
@@ -0,0 +1,102 @@
+# Runbook Command Templates
+
+Standard command sequences for GKE upgrades. Replace placeholders: `CLUSTER_NAME`, `ZONE`, `TARGET_VERSION`, `NODE_POOL_NAME`.
+
+## Pre-flight
+
+```bash
+# Current versions
+gcloud container clusters describe CLUSTER_NAME \
+  --zone ZONE \
+  --format="table(name, currentMasterVersion, nodePools[].version)"
+
+# Available versions for channel
+gcloud container get-server-config --zone ZONE \
+  --format="yaml(channels)"
+
+# Deprecated API usage
+kubectl get --raw /metrics | grep apiserver_request_total | grep deprecated
+
+# Cluster health
+kubectl get nodes
+kubectl get pods -A | grep -v Running | grep -v Completed
+```
+
+## Control plane upgrade
+
+```bash
+gcloud container clusters upgrade CLUSTER_NAME \
+  --zone ZONE \
+  --master \
+  --cluster-version TARGET_VERSION
+
+# Verify (wait ~10-15 min)
+gcloud container clusters describe CLUSTER_NAME \
+  --zone ZONE \
+  --format="value(currentMasterVersion)"
+
+kubectl get pods -n kube-system
+```
+
+## Node pool upgrade (Standard only)
+
+```bash
+# Configure surge settings
+gcloud container node-pools update NODE_POOL_NAME \
+  --cluster CLUSTER_NAME \
+  --zone ZONE \
+  --max-surge-upgrade MAX_SURGE \
+  --max-unavailable-upgrade MAX_UNAVAILABLE
+
+# Upgrade
+gcloud container node-pools upgrade NODE_POOL_NAME \
+  --cluster CLUSTER_NAME \
+  --zone ZONE \
+  --cluster-version TARGET_VERSION
+
+# Monitor progress
+watch 'kubectl get nodes -o wide -L cloud.google.com/gke-nodepool'
+
+# Verify
+gcloud container node-pools list --cluster CLUSTER_NAME --zone ZONE
+kubectl get pods -A | grep -v Running | grep -v Completed
+```
+
+## Maintenance window configuration
+
+```bash
+# Set recurring maintenance window
+gcloud container clusters update CLUSTER_NAME \
+  --zone ZONE \
+  --maintenance-window-start YYYY-MM-DDTHH:MM:SSZ \
+  --maintenance-window-end YYYY-MM-DDTHH:MM:SSZ \
+  --maintenance-window-recurrence "FREQ=WEEKLY;BYDAY=SA"
+
+# Add maintenance exclusion (up to 30 days)
+gcloud container clusters update CLUSTER_NAME \
+  --zone ZONE \
+  --add-maintenance-exclusion-name "EXCLUSION_NAME" \
+  --add-maintenance-exclusion-start-time START_TIME \
+  --add-maintenance-exclusion-end-time END_TIME
+```
+
+## Rollback guidance
+
+Control plane downgrade is rare and not recommended without GKE support involvement. Node pool downgrades require creating a new pool at the old version and migrating workloads.
+
+```bash
+# Cancel in-progress node pool upgrade (if needed)
+# GKE will finish the current node and stop
+gcloud container operations list --cluster CLUSTER_NAME --zone ZONE
+
+# Create replacement node pool at previous version (if rollback needed)
+gcloud container node-pools create NODE_POOL_NAME-rollback \
+  --cluster CLUSTER_NAME \
+  --zone ZONE \
+  --cluster-version PREVIOUS_VERSION \
+  --num-nodes NUM_NODES \
+  --machine-type MACHINE_TYPE
+
+# Cordon old pool and migrate workloads
+kubectl cordon -l cloud.google.com/gke-nodepool=NODE_POOL_NAME
+```
diff --git a/skills/cloud/gke-upgrades/references/troubleshooting.md b/skills/cloud/gke-upgrades/references/troubleshooting.md
new file mode 100644
index 0000000000..4bab7c51af
--- /dev/null
+++ b/skills/cloud/gke-upgrades/references/troubleshooting.md
@@ -0,0 +1,118 @@
+# Troubleshooting GKE Upgrade Issues
+
+## Diagnostic flowchart
+
+When an upgrade is stuck or failing, work through these checks in order. Each section has the diagnosis command, what to look for, and the fix.
+
+## 1. PDB blocking drain (most common)
+
+**Diagnose:**
+```bash
+kubectl get pdb -A -o wide
+# Look for ALLOWED DISRUPTIONS = 0
+kubectl describe pdb PDB_NAME -n NAMESPACE
+```
+
+**Fix — temporarily relax the PDB:**
+```bash
+# Option A: Allow all disruptions temporarily
+kubectl patch pdb PDB_NAME -n NAMESPACE \
+  -p '{"spec":{"minAvailable":null,"maxUnavailable":"100%"}}'
+
+# Option B: Back up and edit
+kubectl get pdb PDB_NAME -n NAMESPACE -o yaml > pdb-backup.yaml
+# Edit minAvailable/maxUnavailable, then:
+kubectl apply -f pdb-backup.yaml
+```
+
+Restore original PDB after upgrade completes.
+
+## 2. Resource constraints (no room for pods)
+
+**Diagnose:**
+```bash
+kubectl get pods -A | grep Pending
+kubectl get events -A --field-selector reason=FailedScheduling
+kubectl top nodes
+kubectl describe nodes | grep -A 5 "Allocated resources"
+```
+
+**Fix — increase surge capacity:**
+```bash
+gcloud container node-pools update NODE_POOL_NAME \
+  --cluster CLUSTER_NAME \
+  --zone ZONE \
+  --max-surge-upgrade 2 \
+  --max-unavailable-upgrade 0
+```
+
+Or scale down non-critical workloads temporarily.
+
+## 3. Bare pods blocking drain
+
+**Diagnose:**
+```bash
+kubectl get pods -A -o json | \
+  jq -r '.items[] | select(.metadata.ownerReferences | length == 0) | "\(.metadata.namespace)/\(.metadata.name)"'
+```
+
+**Fix:** Delete bare pods (they won't reschedule anyway) or wrap in Deployments.
+
+## 4. Admission webhooks rejecting pod creation
+
+**Diagnose:**
+```bash
+kubectl get validatingwebhookconfigurations
+kubectl get mutatingwebhookconfigurations
+# Check for webhooks matching broad API groups
+kubectl describe validatingwebhookconfigurations WEBHOOK_NAME
+```
+
+**Fix — temporarily disable problematic webhook:**
+```bash
+# Add failure policy annotation or delete temporarily
+kubectl delete validatingwebhookconfigurations WEBHOOK_NAME
+# Re-create after upgrade
+```
+
+## 5. PVC attachment issues
+
+**Diagnose:**
+```bash
+kubectl get pvc -A | grep -v Bound
+kubectl get events -A --field-selector reason=FailedAttachVolume
+```
+
+**Fix:** Check if volumes are zone-locked. For regional clusters, PVs may need to be in the same zone as the new node. Consider migrating workloads to already-upgraded nodes.
+
+## 6. Long termination grace periods
+
+**Diagnose:**
+```bash
+kubectl get pods -A -o json | \
+  jq '.items[] | select(.spec.terminationGracePeriodSeconds > 120) | {ns:.metadata.namespace, name:.metadata.name, grace:.spec.terminationGracePeriodSeconds}'
+```
+
+**Fix:** Reduce `terminationGracePeriodSeconds` in the workload spec if possible. GKE waits up to 1 hour for pod eviction during surge upgrades.
+
+## 7. Upgrade operation stuck at GKE level
+
+**Diagnose:**
+```bash
+gcloud container operations list --cluster CLUSTER_NAME --zone ZONE --filter="operationType=UPGRADE_NODES"
+```
+
+**Fix:** If the operation shows no progress for >2 hours after resolving pod-level issues, contact GKE support with cluster name, zone, and operation ID.
+
+## Validation after applying a fix
+
+```bash
+# Monitor node upgrade progress
+watch 'kubectl get nodes -o wide | grep -E "NAME|CURRENT_VERSION|TARGET_VERSION"'
+
+# Check no pods stuck
+kubectl get pods -A | grep -E "Terminating|Pending"
+
+# Confirm upgrade resuming
+gcloud container operations list --cluster CLUSTER_NAME --zone ZONE --limit=1
+```