From 18ba1c18c9fc60b10e216738876c963af6c42cfe Mon Sep 17 00:00:00 2001 From: Cloud IX Team Date: Fri, 15 May 2026 07:49:49 -0700 Subject: [PATCH] Add GKE Upgrades agent skill PiperOrigin-RevId: 916002403 --- skills/cloud/gke-upgrades/SKILL.md | 110 ++++++++++++++++ .../gke-upgrades/references/checklists.md | 74 +++++++++++ .../references/runbook-template.md | 102 +++++++++++++++ .../references/troubleshooting.md | 118 ++++++++++++++++++ 4 files changed, 404 insertions(+) create mode 100644 skills/cloud/gke-upgrades/SKILL.md create mode 100644 skills/cloud/gke-upgrades/references/checklists.md create mode 100644 skills/cloud/gke-upgrades/references/runbook-template.md create mode 100644 skills/cloud/gke-upgrades/references/troubleshooting.md diff --git a/skills/cloud/gke-upgrades/SKILL.md b/skills/cloud/gke-upgrades/SKILL.md new file mode 100644 index 0000000000..6c8d25592a --- /dev/null +++ b/skills/cloud/gke-upgrades/SKILL.md @@ -0,0 +1,110 @@ +--- +name: gke-upgrades +description: > + Plans, executes, and validates Google Kubernetes Engine (GKE) cluster upgrades + and maintenance operations for both Standard and Autopilot clusters. Produces + upgrade plans, pre/post-upgrade checklists, maintenance runbooks with gcloud + commands, release channel strategy, and troubleshooting guides. Handles node + pool upgrade strategies (surge, blue-green), version compatibility, PDB + management, and workload-specific concerns (stateful, GPU, operators). Use this + skill whenever the user mentions GKE upgrades, Kubernetes version bumps, node + pool maintenance, GKE patching, cluster version management, release channel + selection, maintenance windows, surge upgrades, stuck upgrades, or any GKE + lifecycle management task — even casual mentions like "we need to upgrade our + clusters" or "plan our next GKE maintenance" or "our upgrade is stuck." +--- + +# GKE Upgrades & Maintenance + +Produce clear, actionable documents — upgrade plans, runbooks, or checklists — tailored to the user's environment. Output should be specific to their cluster mode, release channel, version, and workload types rather than generic advice. + +Always frame guidance around the auto-upgrade model: auto-upgrade with maintenance windows and exclusions is the preferred control mechanism. Manual upgrades are for exceptions. + +## Context Gathering + +Before producing any upgrade artifact, establish: +- **Cluster mode** — Standard or Autopilot? (Autopilot has no node pool management, mandatory resource requests, no SSH) +- **Current and target versions** — Node version skew must be within 2 minor versions of control plane. +- **Release channel** — Rapid, Regular, Stable, or Extended. +- **Environment topology** — Single vs multi-cluster, dev/staging/prod tiers. +- **Workload sensitivity** — StatefulSets, databases, GPU, long-running batch need special handling. + +If the user provides these upfront, skip straight to the deliverable. If they're vague, fill in reasonable defaults and flag assumptions. + +## Core Principles + +1. **Sequential control plane, skip-level node pools** -- Control plane upgrades are sequential (N → N+1 → N+2). Node pools support skip-level (N+2) upgrades. GKE supports a 2-step CP minor upgrade where step 1 is rollbackable. +2. **Control plane first** -- Control plane must be upgraded before node pools. Nodes can trail by up to 2 minor versions. +3. **Environment progression** -- Always upgrade dev/staging before production. Use release channels to enforce this: Rapid → Regular → Stable. +4. **Workload-aware** -- Upgrade strategy depends on what's running (stateless, stateful, GPU, batch). +5. **Release channels first** -- Always recommend release channels with maintenance exclusions. Never recommend "No channel" as a first option. +6. **Rollback** -- CP patch downgrades are customer-doable. CP minor downgrades require GKE support. Node pools can be re-created at a different version. + +## Release Channels + +| Channel | Best for | SLA | Support | +|---------|----------|-----|---------| +| **Rapid** | Dev/test, early feature access | No upgrade stability SLA | 14 months | +| **Regular** (default) | Most production | Full SLA | 14 months | +| **Stable** | Mission-critical, stability-first | Full SLA | 14 months | +| **Extended** | Compliance, EoS enforcement control | Full SLA | Up to 24 months (extra cost) | + +Common multi-environment strategy: Dev→Rapid, Staging→Regular, Prod→Stable or Regular. + +## Maintenance Windows & Exclusions + +Configure maintenance windows to control auto-upgrade timing. + +**Exclusion types:** +- **"No upgrades"**: Blocks everything for up to 30 days (BFCM, freezes). +- **"No minor or node upgrades"**: Blocks minor and node upgrades, allows CP patches. Up to EoS. +- **"No minor upgrades"**: Blocks minor upgrades, allows patches and node upgrades. Up to EoS. + +Recommend cluster-level exclusions to prevent skew. Use `--add-maintenance-exclusion-until-end-of-support` for persistent exclusions. + +## Upgrade Planning + +When asked to plan an upgrade, produce a structured document covering: +- Version compatibility (breaking changes, deprecated APIs) +- Upgrade path (sequential minor version upgrades) +- Node pool upgrade strategy (Standard only) +- Workload readiness (PDBs, resource requests) + +### Node Pool Strategy (Standard Only) + +Recommend surge upgrade as the default, with per-pool settings: +- **Stateless**: Higher `maxSurge` (2-3) for speed, `maxUnavailable=0` for safety. +- **Stateful/DB**: `maxSurge=1, maxUnavailable=0` (conservative). +- **GPU (fixed reservation)**: `maxSurge=0, maxUnavailable=1` (no surge capacity). +- **Large (50+ nodes)**: `maxSurge=20, maxUnavailable=0` (max parallelism). + +Recommend blue-green upgrade for mission-critical apps needing fast rollback or strict validation. Use autoscaled blue-green for long-running batch or disruption-sensitive workloads. + +For standard command sequences and runbook templates, see `references/runbook-template.md`. + +### Large-Scale AI/ML Clusters (GPU/TPU) + +- GPU VMs do not support live migration — upgrades force pod restart. +- H100/A100 typically use fixed reservations with no surge capacity. Use `maxSurge=0, maxUnavailable=1`. +- GPU driver is coupled with target node version; verify CUDA compatibility. +- Use maintenance exclusions during active training campaigns. Cordon GPU nodes and wait for jobs to complete. +- TPU slices are recreated atomically (not rolling); maintenance on one slice restarts all slices in the environment. + +## Checklists + +Produce checklists as copyable markdown with checkboxes. See `references/checklists.md` for the full pre-upgrade and post-upgrade checklist templates. Adapt them to the user's environment. + +## Maintenance runbooks + +Produce step-by-step runbooks with actual `gcloud` and `kubectl` commands. See `references/runbook-template.md` for the standard command sequences. + +## Troubleshooting + +When a user reports a stuck or failing upgrade, walk through diagnosis systematically in this order: +1. PDB blocking drain (check `kubectl get pdb -A`) +2. Resource constraints (pods pending, increase maxSurge) +3. Bare pods (must delete or wrap in controllers) +4. Admission webhooks rejecting pod creation +5. PVC attachment issues (volume migration) + +For a detailed diagnostic flowchart and fix procedures, see `references/troubleshooting.md`. diff --git a/skills/cloud/gke-upgrades/references/checklists.md b/skills/cloud/gke-upgrades/references/checklists.md new file mode 100644 index 0000000000..4131f905ad --- /dev/null +++ b/skills/cloud/gke-upgrades/references/checklists.md @@ -0,0 +1,74 @@ +# Checklist Templates + +Adapt these to the user's environment. Fill in cluster names, versions, and remove items that don't apply. + +## Pre-Upgrade Checklist + +``` +Pre-Upgrade Checklist +- [ ] Cluster: ___ | Mode: Standard / Autopilot | Channel: ___ +- [ ] Current version: ___ | Target version: ___ + +Compatibility +- [ ] Target version available in release channel (`gcloud container get-server-config --zone ZONE --format="yaml(channels)"`) +- [ ] No deprecated API usage (check GKE deprecation insights dashboard) +- [ ] GKE release notes reviewed for breaking changes between current → target +- [ ] Node version skew within 2 minor versions of control plane +- [ ] Third-party operators/controllers compatible with target version +- [ ] Admission webhooks tested against target version + +Workload Readiness +- [ ] PDBs configured for critical workloads (not overly restrictive) +- [ ] No bare pods — all managed by controllers +- [ ] terminationGracePeriodSeconds adequate for graceful shutdown +- [ ] StatefulSet PV backups completed, reclaim policies verified +- [ ] Resource requests/limits set on all containers (mandatory for Autopilot) +- [ ] GPU driver compatibility confirmed with target node image (if applicable) +- [ ] Postgres/database operator compatibility verified (if applicable) + +Infrastructure (Standard only) +- [ ] Node pool upgrade strategy chosen (surge / blue-green) +- [ ] Surge settings configured per pool: maxSurge=___ maxUnavailable=___ +- [ ] Sufficient compute quota for surge nodes +- [ ] Maintenance window configured (off-peak hours) +- [ ] Maintenance exclusions set for freeze periods (if applicable) + +Ops Readiness +- [ ] Monitoring and alerting active (Cloud Monitoring / Prometheus) +- [ ] Baseline metrics captured (error rates, latency, throughput) +- [ ] Upgrade window communicated to stakeholders +- [ ] Rollback plan documented +- [ ] On-call team aware and available +``` + +## Post-Upgrade Checklist + +``` +Post-Upgrade Checklist + +Cluster Health +- [ ] Control plane at target version: `gcloud container clusters describe CLUSTER --zone ZONE --format="value(currentMasterVersion)"` +- [ ] All node pools at target version: `gcloud container node-pools list --cluster CLUSTER --zone ZONE` +- [ ] All nodes Ready: `kubectl get nodes` +- [ ] System pods healthy: `kubectl get pods -n kube-system` +- [ ] No stuck PDBs: `kubectl get pdb --all-namespaces` + +Workload Health +- [ ] All deployments at desired replica count: `kubectl get deployments -A` +- [ ] No CrashLoopBackOff or Pending pods: `kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded` +- [ ] StatefulSets fully ready: `kubectl get statefulsets -A` +- [ ] Ingress/load balancers responding +- [ ] Application health checks and smoke tests passing + +Observability +- [ ] Metrics pipeline active, no collection gaps +- [ ] Logs flowing to aggregation +- [ ] Error rates within pre-upgrade baseline +- [ ] Latency (p50/p95/p99) within pre-upgrade baseline + +Cleanup +- [ ] Old node pools removed (if blue-green) +- [ ] Surge quota released (automatic for surge upgrades) +- [ ] Upgrade documented in changelog +- [ ] Lessons learned captured +``` diff --git a/skills/cloud/gke-upgrades/references/runbook-template.md b/skills/cloud/gke-upgrades/references/runbook-template.md new file mode 100644 index 0000000000..47e4296730 --- /dev/null +++ b/skills/cloud/gke-upgrades/references/runbook-template.md @@ -0,0 +1,102 @@ +# Runbook Command Templates + +Standard command sequences for GKE upgrades. Replace placeholders: `CLUSTER_NAME`, `ZONE`, `TARGET_VERSION`, `NODE_POOL_NAME`. + +## Pre-flight + +```bash +# Current versions +gcloud container clusters describe CLUSTER_NAME \ + --zone ZONE \ + --format="table(name, currentMasterVersion, nodePools[].version)" + +# Available versions for channel +gcloud container get-server-config --zone ZONE \ + --format="yaml(channels)" + +# Deprecated API usage +kubectl get --raw /metrics | grep apiserver_request_total | grep deprecated + +# Cluster health +kubectl get nodes +kubectl get pods -A | grep -v Running | grep -v Completed +``` + +## Control plane upgrade + +```bash +gcloud container clusters upgrade CLUSTER_NAME \ + --zone ZONE \ + --master \ + --cluster-version TARGET_VERSION + +# Verify (wait ~10-15 min) +gcloud container clusters describe CLUSTER_NAME \ + --zone ZONE \ + --format="value(currentMasterVersion)" + +kubectl get pods -n kube-system +``` + +## Node pool upgrade (Standard only) + +```bash +# Configure surge settings +gcloud container node-pools update NODE_POOL_NAME \ + --cluster CLUSTER_NAME \ + --zone ZONE \ + --max-surge-upgrade MAX_SURGE \ + --max-unavailable-upgrade MAX_UNAVAILABLE + +# Upgrade +gcloud container node-pools upgrade NODE_POOL_NAME \ + --cluster CLUSTER_NAME \ + --zone ZONE \ + --cluster-version TARGET_VERSION + +# Monitor progress +watch 'kubectl get nodes -o wide -L cloud.google.com/gke-nodepool' + +# Verify +gcloud container node-pools list --cluster CLUSTER_NAME --zone ZONE +kubectl get pods -A | grep -v Running | grep -v Completed +``` + +## Maintenance window configuration + +```bash +# Set recurring maintenance window +gcloud container clusters update CLUSTER_NAME \ + --zone ZONE \ + --maintenance-window-start YYYY-MM-DDTHH:MM:SSZ \ + --maintenance-window-end YYYY-MM-DDTHH:MM:SSZ \ + --maintenance-window-recurrence "FREQ=WEEKLY;BYDAY=SA" + +# Add maintenance exclusion (up to 30 days) +gcloud container clusters update CLUSTER_NAME \ + --zone ZONE \ + --add-maintenance-exclusion-name "EXCLUSION_NAME" \ + --add-maintenance-exclusion-start-time START_TIME \ + --add-maintenance-exclusion-end-time END_TIME +``` + +## Rollback guidance + +Control plane downgrade is rare and not recommended without GKE support involvement. Node pool downgrades require creating a new pool at the old version and migrating workloads. + +```bash +# Cancel in-progress node pool upgrade (if needed) +# GKE will finish the current node and stop +gcloud container operations list --cluster CLUSTER_NAME --zone ZONE + +# Create replacement node pool at previous version (if rollback needed) +gcloud container node-pools create NODE_POOL_NAME-rollback \ + --cluster CLUSTER_NAME \ + --zone ZONE \ + --cluster-version PREVIOUS_VERSION \ + --num-nodes NUM_NODES \ + --machine-type MACHINE_TYPE + +# Cordon old pool and migrate workloads +kubectl cordon -l cloud.google.com/gke-nodepool=NODE_POOL_NAME +``` diff --git a/skills/cloud/gke-upgrades/references/troubleshooting.md b/skills/cloud/gke-upgrades/references/troubleshooting.md new file mode 100644 index 0000000000..4bab7c51af --- /dev/null +++ b/skills/cloud/gke-upgrades/references/troubleshooting.md @@ -0,0 +1,118 @@ +# Troubleshooting GKE Upgrade Issues + +## Diagnostic flowchart + +When an upgrade is stuck or failing, work through these checks in order. Each section has the diagnosis command, what to look for, and the fix. + +## 1. PDB blocking drain (most common) + +**Diagnose:** +```bash +kubectl get pdb -A -o wide +# Look for ALLOWED DISRUPTIONS = 0 +kubectl describe pdb PDB_NAME -n NAMESPACE +``` + +**Fix — temporarily relax the PDB:** +```bash +# Option A: Allow all disruptions temporarily +kubectl patch pdb PDB_NAME -n NAMESPACE \ + -p '{"spec":{"minAvailable":null,"maxUnavailable":"100%"}}' + +# Option B: Back up and edit +kubectl get pdb PDB_NAME -n NAMESPACE -o yaml > pdb-backup.yaml +# Edit minAvailable/maxUnavailable, then: +kubectl apply -f pdb-backup.yaml +``` + +Restore original PDB after upgrade completes. + +## 2. Resource constraints (no room for pods) + +**Diagnose:** +```bash +kubectl get pods -A | grep Pending +kubectl get events -A --field-selector reason=FailedScheduling +kubectl top nodes +kubectl describe nodes | grep -A 5 "Allocated resources" +``` + +**Fix — increase surge capacity:** +```bash +gcloud container node-pools update NODE_POOL_NAME \ + --cluster CLUSTER_NAME \ + --zone ZONE \ + --max-surge-upgrade 2 \ + --max-unavailable-upgrade 0 +``` + +Or scale down non-critical workloads temporarily. + +## 3. Bare pods blocking drain + +**Diagnose:** +```bash +kubectl get pods -A -o json | \ + jq -r '.items[] | select(.metadata.ownerReferences | length == 0) | "\(.metadata.namespace)/\(.metadata.name)"' +``` + +**Fix:** Delete bare pods (they won't reschedule anyway) or wrap in Deployments. + +## 4. Admission webhooks rejecting pod creation + +**Diagnose:** +```bash +kubectl get validatingwebhookconfigurations +kubectl get mutatingwebhookconfigurations +# Check for webhooks matching broad API groups +kubectl describe validatingwebhookconfigurations WEBHOOK_NAME +``` + +**Fix — temporarily disable problematic webhook:** +```bash +# Add failure policy annotation or delete temporarily +kubectl delete validatingwebhookconfigurations WEBHOOK_NAME +# Re-create after upgrade +``` + +## 5. PVC attachment issues + +**Diagnose:** +```bash +kubectl get pvc -A | grep -v Bound +kubectl get events -A --field-selector reason=FailedAttachVolume +``` + +**Fix:** Check if volumes are zone-locked. For regional clusters, PVs may need to be in the same zone as the new node. Consider migrating workloads to already-upgraded nodes. + +## 6. Long termination grace periods + +**Diagnose:** +```bash +kubectl get pods -A -o json | \ + jq '.items[] | select(.spec.terminationGracePeriodSeconds > 120) | {ns:.metadata.namespace, name:.metadata.name, grace:.spec.terminationGracePeriodSeconds}' +``` + +**Fix:** Reduce `terminationGracePeriodSeconds` in the workload spec if possible. GKE waits up to 1 hour for pod eviction during surge upgrades. + +## 7. Upgrade operation stuck at GKE level + +**Diagnose:** +```bash +gcloud container operations list --cluster CLUSTER_NAME --zone ZONE --filter="operationType=UPGRADE_NODES" +``` + +**Fix:** If the operation shows no progress for >2 hours after resolving pod-level issues, contact GKE support with cluster name, zone, and operation ID. + +## Validation after applying a fix + +```bash +# Monitor node upgrade progress +watch 'kubectl get nodes -o wide | grep -E "NAME|CURRENT_VERSION|TARGET_VERSION"' + +# Check no pods stuck +kubectl get pods -A | grep -E "Terminating|Pending" + +# Confirm upgrade resuming +gcloud container operations list --cluster CLUSTER_NAME --zone ZONE --limit=1 +```