Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
110 changes: 110 additions & 0 deletions skills/cloud/gke-upgrades/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
---
name: gke-upgrades
description: >
Plans, executes, and validates Google Kubernetes Engine (GKE) cluster upgrades
and maintenance operations for both Standard and Autopilot clusters. Produces
upgrade plans, pre/post-upgrade checklists, maintenance runbooks with gcloud
commands, release channel strategy, and troubleshooting guides. Handles node
pool upgrade strategies (surge, blue-green), version compatibility, PDB
management, and workload-specific concerns (stateful, GPU, operators). Use this
skill whenever the user mentions GKE upgrades, Kubernetes version bumps, node
pool maintenance, GKE patching, cluster version management, release channel
selection, maintenance windows, surge upgrades, stuck upgrades, or any GKE
lifecycle management task — even casual mentions like "we need to upgrade our
clusters" or "plan our next GKE maintenance" or "our upgrade is stuck."
---

# GKE Upgrades & Maintenance

Produce clear, actionable documents — upgrade plans, runbooks, or checklists — tailored to the user's environment. Output should be specific to their cluster mode, release channel, version, and workload types rather than generic advice.

Always frame guidance around the auto-upgrade model: auto-upgrade with maintenance windows and exclusions is the preferred control mechanism. Manual upgrades are for exceptions.

## Context Gathering

Before producing any upgrade artifact, establish:
- **Cluster mode** — Standard or Autopilot? (Autopilot has no node pool management, mandatory resource requests, no SSH)
- **Current and target versions** — Node version skew must be within 2 minor versions of control plane.
- **Release channel** — Rapid, Regular, Stable, or Extended.
- **Environment topology** — Single vs multi-cluster, dev/staging/prod tiers.
- **Workload sensitivity** — StatefulSets, databases, GPU, long-running batch need special handling.

If the user provides these upfront, skip straight to the deliverable. If they're vague, fill in reasonable defaults and flag assumptions.

## Core Principles

1. **Sequential control plane, skip-level node pools** -- Control plane upgrades are sequential (N → N+1 → N+2). Node pools support skip-level (N+2) upgrades. GKE supports a 2-step CP minor upgrade where step 1 is rollbackable.
2. **Control plane first** -- Control plane must be upgraded before node pools. Nodes can trail by up to 2 minor versions.
3. **Environment progression** -- Always upgrade dev/staging before production. Use release channels to enforce this: Rapid → Regular → Stable.
4. **Workload-aware** -- Upgrade strategy depends on what's running (stateless, stateful, GPU, batch).
5. **Release channels first** -- Always recommend release channels with maintenance exclusions. Never recommend "No channel" as a first option.
6. **Rollback** -- CP patch downgrades are customer-doable. CP minor downgrades require GKE support. Node pools can be re-created at a different version.

## Release Channels

| Channel | Best for | SLA | Support |
|---------|----------|-----|---------|
| **Rapid** | Dev/test, early feature access | No upgrade stability SLA | 14 months |
| **Regular** (default) | Most production | Full SLA | 14 months |
| **Stable** | Mission-critical, stability-first | Full SLA | 14 months |
| **Extended** | Compliance, EoS enforcement control | Full SLA | Up to 24 months (extra cost) |

Common multi-environment strategy: Dev→Rapid, Staging→Regular, Prod→Stable or Regular.

## Maintenance Windows & Exclusions

Configure maintenance windows to control auto-upgrade timing.

**Exclusion types:**
- **"No upgrades"**: Blocks everything for up to 30 days (BFCM, freezes).
- **"No minor or node upgrades"**: Blocks minor and node upgrades, allows CP patches. Up to EoS.
- **"No minor upgrades"**: Blocks minor upgrades, allows patches and node upgrades. Up to EoS.

Recommend cluster-level exclusions to prevent skew. Use `--add-maintenance-exclusion-until-end-of-support` for persistent exclusions.

## Upgrade Planning

When asked to plan an upgrade, produce a structured document covering:
- Version compatibility (breaking changes, deprecated APIs)
- Upgrade path (sequential minor version upgrades)
- Node pool upgrade strategy (Standard only)
- Workload readiness (PDBs, resource requests)

### Node Pool Strategy (Standard Only)

Recommend surge upgrade as the default, with per-pool settings:
- **Stateless**: Higher `maxSurge` (2-3) for speed, `maxUnavailable=0` for safety.
- **Stateful/DB**: `maxSurge=1, maxUnavailable=0` (conservative).
- **GPU (fixed reservation)**: `maxSurge=0, maxUnavailable=1` (no surge capacity).
- **Large (50+ nodes)**: `maxSurge=20, maxUnavailable=0` (max parallelism).

Recommend blue-green upgrade for mission-critical apps needing fast rollback or strict validation. Use autoscaled blue-green for long-running batch or disruption-sensitive workloads.

For standard command sequences and runbook templates, see `references/runbook-template.md`.

### Large-Scale AI/ML Clusters (GPU/TPU)

- GPU VMs do not support live migration — upgrades force pod restart.
- H100/A100 typically use fixed reservations with no surge capacity. Use `maxSurge=0, maxUnavailable=1`.
- GPU driver is coupled with target node version; verify CUDA compatibility.
- Use maintenance exclusions during active training campaigns. Cordon GPU nodes and wait for jobs to complete.
- TPU slices are recreated atomically (not rolling); maintenance on one slice restarts all slices in the environment.

## Checklists

Produce checklists as copyable markdown with checkboxes. See `references/checklists.md` for the full pre-upgrade and post-upgrade checklist templates. Adapt them to the user's environment.

## Maintenance runbooks

Produce step-by-step runbooks with actual `gcloud` and `kubectl` commands. See `references/runbook-template.md` for the standard command sequences.

## Troubleshooting

When a user reports a stuck or failing upgrade, walk through diagnosis systematically in this order:
1. PDB blocking drain (check `kubectl get pdb -A`)
2. Resource constraints (pods pending, increase maxSurge)
3. Bare pods (must delete or wrap in controllers)
4. Admission webhooks rejecting pod creation
5. PVC attachment issues (volume migration)

For a detailed diagnostic flowchart and fix procedures, see `references/troubleshooting.md`.
74 changes: 74 additions & 0 deletions skills/cloud/gke-upgrades/references/checklists.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
# Checklist Templates

Adapt these to the user's environment. Fill in cluster names, versions, and remove items that don't apply.

## Pre-Upgrade Checklist

```
Pre-Upgrade Checklist
- [ ] Cluster: ___ | Mode: Standard / Autopilot | Channel: ___
- [ ] Current version: ___ | Target version: ___

Compatibility
- [ ] Target version available in release channel (`gcloud container get-server-config --zone ZONE --format="yaml(channels)"`)
- [ ] No deprecated API usage (check GKE deprecation insights dashboard)
- [ ] GKE release notes reviewed for breaking changes between current → target
- [ ] Node version skew within 2 minor versions of control plane
- [ ] Third-party operators/controllers compatible with target version
- [ ] Admission webhooks tested against target version

Workload Readiness
- [ ] PDBs configured for critical workloads (not overly restrictive)
- [ ] No bare pods — all managed by controllers
- [ ] terminationGracePeriodSeconds adequate for graceful shutdown
- [ ] StatefulSet PV backups completed, reclaim policies verified
- [ ] Resource requests/limits set on all containers (mandatory for Autopilot)
- [ ] GPU driver compatibility confirmed with target node image (if applicable)
- [ ] Postgres/database operator compatibility verified (if applicable)

Infrastructure (Standard only)
- [ ] Node pool upgrade strategy chosen (surge / blue-green)
- [ ] Surge settings configured per pool: maxSurge=___ maxUnavailable=___
- [ ] Sufficient compute quota for surge nodes
- [ ] Maintenance window configured (off-peak hours)
- [ ] Maintenance exclusions set for freeze periods (if applicable)

Ops Readiness
- [ ] Monitoring and alerting active (Cloud Monitoring / Prometheus)
- [ ] Baseline metrics captured (error rates, latency, throughput)
- [ ] Upgrade window communicated to stakeholders
- [ ] Rollback plan documented
- [ ] On-call team aware and available
```

## Post-Upgrade Checklist

```
Post-Upgrade Checklist

Cluster Health
- [ ] Control plane at target version: `gcloud container clusters describe CLUSTER --zone ZONE --format="value(currentMasterVersion)"`
- [ ] All node pools at target version: `gcloud container node-pools list --cluster CLUSTER --zone ZONE`
- [ ] All nodes Ready: `kubectl get nodes`
- [ ] System pods healthy: `kubectl get pods -n kube-system`
- [ ] No stuck PDBs: `kubectl get pdb --all-namespaces`

Workload Health
- [ ] All deployments at desired replica count: `kubectl get deployments -A`
- [ ] No CrashLoopBackOff or Pending pods: `kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded`
- [ ] StatefulSets fully ready: `kubectl get statefulsets -A`
- [ ] Ingress/load balancers responding
- [ ] Application health checks and smoke tests passing

Observability
- [ ] Metrics pipeline active, no collection gaps
- [ ] Logs flowing to aggregation
- [ ] Error rates within pre-upgrade baseline
- [ ] Latency (p50/p95/p99) within pre-upgrade baseline

Cleanup
- [ ] Old node pools removed (if blue-green)
- [ ] Surge quota released (automatic for surge upgrades)
- [ ] Upgrade documented in changelog
- [ ] Lessons learned captured
```
102 changes: 102 additions & 0 deletions skills/cloud/gke-upgrades/references/runbook-template.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
# Runbook Command Templates

Standard command sequences for GKE upgrades. Replace placeholders: `CLUSTER_NAME`, `ZONE`, `TARGET_VERSION`, `NODE_POOL_NAME`.

## Pre-flight

```bash
# Current versions
gcloud container clusters describe CLUSTER_NAME \
--zone ZONE \
--format="table(name, currentMasterVersion, nodePools[].version)"

# Available versions for channel
gcloud container get-server-config --zone ZONE \
--format="yaml(channels)"

# Deprecated API usage
kubectl get --raw /metrics | grep apiserver_request_total | grep deprecated

# Cluster health
kubectl get nodes
kubectl get pods -A | grep -v Running | grep -v Completed
```

## Control plane upgrade

```bash
gcloud container clusters upgrade CLUSTER_NAME \
--zone ZONE \
--master \
--cluster-version TARGET_VERSION

# Verify (wait ~10-15 min)
gcloud container clusters describe CLUSTER_NAME \
--zone ZONE \
--format="value(currentMasterVersion)"

kubectl get pods -n kube-system
```

## Node pool upgrade (Standard only)

```bash
# Configure surge settings
gcloud container node-pools update NODE_POOL_NAME \
--cluster CLUSTER_NAME \
--zone ZONE \
--max-surge-upgrade MAX_SURGE \
--max-unavailable-upgrade MAX_UNAVAILABLE

# Upgrade
gcloud container node-pools upgrade NODE_POOL_NAME \
--cluster CLUSTER_NAME \
--zone ZONE \
--cluster-version TARGET_VERSION

# Monitor progress
watch 'kubectl get nodes -o wide -L cloud.google.com/gke-nodepool'

# Verify
gcloud container node-pools list --cluster CLUSTER_NAME --zone ZONE
kubectl get pods -A | grep -v Running | grep -v Completed
```

## Maintenance window configuration

```bash
# Set recurring maintenance window
gcloud container clusters update CLUSTER_NAME \
--zone ZONE \
--maintenance-window-start YYYY-MM-DDTHH:MM:SSZ \
--maintenance-window-end YYYY-MM-DDTHH:MM:SSZ \
--maintenance-window-recurrence "FREQ=WEEKLY;BYDAY=SA"

# Add maintenance exclusion (up to 30 days)
gcloud container clusters update CLUSTER_NAME \
--zone ZONE \
--add-maintenance-exclusion-name "EXCLUSION_NAME" \
--add-maintenance-exclusion-start-time START_TIME \
--add-maintenance-exclusion-end-time END_TIME
```

## Rollback guidance

Control plane downgrade is rare and not recommended without GKE support involvement. Node pool downgrades require creating a new pool at the old version and migrating workloads.

```bash
# Cancel in-progress node pool upgrade (if needed)
# GKE will finish the current node and stop
gcloud container operations list --cluster CLUSTER_NAME --zone ZONE

# Create replacement node pool at previous version (if rollback needed)
gcloud container node-pools create NODE_POOL_NAME-rollback \
--cluster CLUSTER_NAME \
--zone ZONE \
--cluster-version PREVIOUS_VERSION \
--num-nodes NUM_NODES \
--machine-type MACHINE_TYPE

# Cordon old pool and migrate workloads
kubectl cordon -l cloud.google.com/gke-nodepool=NODE_POOL_NAME
```
Loading