NVIDIA · mchmarny · May 14, 2026 · May 14, 2026
@@ -28,7 +28,6 @@ brew install aicr
 # Or use the install script
 curl -sfL https://raw.githubusercontent.com/NVIDIA/aicr/main/install | bash -s --
 
-
 # Capture your cluster's current state
 aicr snapshot --output snapshot.yaml
 

@@ -9,7 +9,9 @@ Runbooks for testing and demonstrating AICR end-to-end workflows on live cluster
 | [cuj1-eks.md](cuj1-eks.md) | CUJ1 - EKS cluster setup |
 | [cuj1-gke.md](cuj1-gke.md) | CUJ1 - GKE cluster setup |
 | [cuj2.md](cuj2.md) | CUJ2 - EKS inference with Dynamo |
+| [cuj2-demo.md](cuj2-demo.md) | CUJ2 - Annotated demo walkthrough (training vs inference) |
 | [cuj2-eks.md](cuj2-eks.md) | CUJ2 - EKS variant |
+| [cuj2-gke.md](cuj2-gke.md) | CUJ2 - GKE variant |
 | [e2e.md](e2e.md) | End-to-end CLI demo |
 | [valid.md](valid.md) | Validation demo |
 | [data.md](data.md) | External data directory demo |

@@ -3,7 +3,7 @@
 ## Assumptions
 
 * Assuming user is already authenticated to an EKS cluster with 2+ H100 node.
-* Values used in `--accelerated-node-selector`, `--accelerated-node-toleration`, and `--system-node-toleration` flags are only for example purposes. Assuming user will update these to match their cluster.
+* Values used in `--accelerated-node-selector`, `--accelerated-node-toleration`, `--system-node-selector`, and `--system-node-toleration` flags are only for example purposes. Assuming user will update these to match their cluster.
 
 ## Snapshot
 
@@ -41,13 +41,20 @@ aicr validate \
 
 ## Generate Bundle
 
+The selector and toleration values below mirror AICR's reference EKS clusters
+(`aicr1`, `aicr2`): nodes carry the label `nodeGroup={system-worker,gpu-worker}`
+and the unrelated taint key `dedicated={system-workload,worker-workload}:{NoSchedule,NoExecute}`.
+The selector key (`nodeGroup`) and toleration key (`dedicated`) intentionally
+differ — the label drives scheduling targeting and the taint drives admission.
+Adjust both pairs to your cluster's actual labels and taints.
+
 ```shell
 aicr bundle \
   --recipe recipe.yaml \
   --accelerated-node-selector nodeGroup=gpu-worker \
   --accelerated-node-toleration dedicated=worker-workload:NoSchedule \
   --accelerated-node-toleration dedicated=worker-workload:NoExecute \
-  --system-node-selector dedicated=system-workload \
+  --system-node-selector nodeGroup=system-worker \
   --system-node-toleration dedicated=system-workload:NoSchedule \
   --system-node-toleration dedicated=system-workload:NoExecute \
   --storage-class <storage-class> \

@@ -46,8 +46,8 @@
   │    ├── agentgateway-crds/        (Gateway API + inference CRDs)        │
   │    ├── agentgateway/             (inference gateway controller)        │
   │    ├── nvsentinel/               (security/compliance)                 │
-  │    ├── nodewright-operator/         (node configuration)                  │
-  │    ├── nodewright-customizations/   (H100 tuning)                         │
+  │    ├── nodewright-operator/      (node configuration)                  │
+  │    ├── nodewright-customizations/ (H100 tuning)                        │
   │    ├── aws-ebs-csi-driver/       (EBS storage)                         │
   │    ├── aws-efa/                  (Elastic Fabric Adapter)              │
   │    ├── dynamo-crds/              (Dynamo CRDs)                         │
@@ -125,7 +125,7 @@
 │  h100-eks-ubuntu-training.yaml      │  (Ubuntu constraints)               │
 │  (Ubuntu constraints)               │      │                              │
 │      │                              │  h100-eks-ubuntu-inference-dynamo   │
-│  h100-eks-ubuntu-training-kubeflow  │  ├── gpu-operator (v25.3.4, CDI)    │
+│  h100-eks-ubuntu-training-kubeflow  │  ├── gpu-operator (v26.3.1, CDI)    │
 │  └── kubeflow-trainer       ◀── NEW │  ├── nvidia-dra-driver (gpuRes)◀─NEW│
 │                                     │  ├── dynamo-crds             ◀─ NEW │
 │                                     │  └── dynamo-platform         ◀─ NEW │

@@ -33,11 +33,11 @@ aicr bundle \
   --recipe recipe.yaml \
   --output bundle \
   --accelerated-node-selector [key]=[value] \
-  --accelerated-node-toleration [key]=[value]:[operation] \
+  --accelerated-node-toleration [key]=[value]:[effect] \
   --storage-class [storage-class]
 ```
 
-Replace the values for `--accelerated-node-selector` and `--accelerated-node-toleration` with the appropriate ones to match your gpu pool(s). You do not want optimizations and inference workloads to run across all nodes. Both options allow for comma delimination to supply multiple values. See the [aicr bundle](../docs/user/cli-reference.md#aicr-bundle) section for more information.
+Replace the values for `--accelerated-node-selector` and `--accelerated-node-toleration` with the appropriate ones to match your gpu pool(s). You do not want optimizations and inference workloads to run across all nodes. Both options allow for comma-separated values to supply multiple values. See the [aicr bundle](../docs/user/cli-reference.md#aicr-bundle) section for more information.
 
 Set `--storage-class` to a StorageClass that exists on the target cluster (check with `kubectl get storageclass`). Cloud overlays configure `kube-prometheus-stack` with a `volumeClaimTemplate` but no `storageClassName`; without this flag the PVC falls to the cluster's default StorageClass, and if no default is configured the deploy hangs on a Pending PVC.
 

@@ -188,7 +188,7 @@ Order in `dependencyRefs`:
 2. `gpu-operator` (depends on cert-manager)
 3. Other components...
 
-> Asymmetric rule matching based on [Kahn's algorithm](https://www.geeksforgeeks.org/dsa/topological-sorting-indegree-based-solution/).
+> Dependency-driven ordering based on [Kahn's algorithm](https://www.geeksforgeeks.org/dsa/topological-sorting-indegree-based-solution/) for topological sort.
 
 ## API Access
 

@@ -31,7 +31,7 @@ aicr recipe --service eks --accelerator gb200 | yq .
 From criteria file:
 
 ```shell
-cat > /tmp/criteria.yaml << 'EOF'
+cat > "${TMPDIR:-/tmp}/criteria.yaml" << 'EOF'
 kind: RecipeCriteria
 apiVersion: aicr.nvidia.com/v1alpha1
 metadata:
@@ -48,15 +48,15 @@ EOF
 Generate recipe from criteria file
 
 ```shell
-aicr recipe --criteria /tmp/criteria.yaml --output recipe.yaml
+aicr recipe --criteria "${TMPDIR:-/tmp}/criteria.yaml" --output recipe.yaml
 ```
 
 > Metadata overlays: `components=11 overlays=7`
 
 CLI flags override criteria file values
 
 ```shell
-aicr recipe --criteria /tmp/criteria.yaml --service gke | yq .
+aicr recipe --criteria "${TMPDIR:-/tmp}/criteria.yaml" --service gke | yq .
 ```
 
 > Metadata overlays: `components=7 overlays=2`
@@ -238,7 +238,7 @@ aicr recipe \
 ```
 
 Output shows:
-* `18` embedded + `1` external = `19` merged components
+* `<N>` embedded + `<M>` external = `<N+M>` merged components
 * `dgxc-teleport` appears as Kustomize component
 
 Now `dgxc-teleport` is included in `componentRefs` and `deploymentOrder`

@@ -1,5 +1,7 @@
 # CUJ2 Test Report (Run 2) - EKS Inference with Dynamo
 
+**Historical capture.** This report was generated before PR #871 migrated `kgateway` → `agentgateway`. Current bundles install `agentgateway` charts in namespace `agentgateway-system`; the log lines below reflecting `kgateway-*` releases are obsolete.
+
 **Date:** 2026-03-13
 **Branch:** `fix/cuj2-timeout-issue` (includes PR #397 fix, rebased on main)
 **AICR Version:** built from source (fix/cuj2-timeout-issue)

@@ -65,7 +65,7 @@ aicr bundle \
 Expected:
 - Completes in <10s
 - `bundle` created
-- Nodewright Warning about workload selector
+- Nodewright emits a warning about the workload selector
 
 ## Validate (dry run, no-cluster)
 

@@ -8,7 +8,7 @@ NVIDIA AI Cluster Runtime (AICR) is a suite of tooling designed to automate the
 |------|-------------|
 | **Snapshot** | A captured state of a system including OS, kernel, Kubernetes, GPU, and SystemD configuration. Created by `aicr snapshot` or the Kubernetes agent. |
 | **Recipe** | A generated configuration recommendation containing component references, constraints, and deployment order. Created by `aicr recipe` based on criteria or snapshot analysis. |
-| **Criteria** | Query parameters that define the target environment: `service` (eks/gke/aks/oke/kind/lke), `accelerator` (h100/gb200/b200/a100/l40/rtx-pro-6000), `intent` (training/inference), `os` (ubuntu/rhel/cos/amazonlinux/talos), `platform` (kubeflow), and `nodes`. |
+| **Criteria** | Query parameters that define the target environment: `service` (eks/gke/aks/oke/kind/lke), `accelerator` (h100/gb200/b200/a100/l40/rtx-pro-6000), `intent` (training/inference), `os` (ubuntu/rhel/cos/amazonlinux/talos), `platform` (dynamo, kubeflow, nim, slurm), and `nodes`. |
 | **Overlay** | A recipe metadata file that extends the base recipe for specific environments. Overlays are matched against criteria using asymmetric matching. |
 | **Mixin** | A composable recipe fragment (`kind: RecipeMixin`) that carries only `constraints` and `componentRefs`. Mixins live in `recipes/mixins/`, are excluded from overlay discovery, and are referenced by leaf overlays via `spec.mixins` to share orthogonal content (e.g., OS constraints, platform components) without duplication. See [ADR-005](design/005-overlay-refactoring.md). |
 | **Bundle** | Deployment artifacts generated from a recipe: Helm values files, Kubernetes manifests, installation scripts, and checksums. |
@@ -40,7 +40,7 @@ Previously, administrators relied on static documentation and manual installatio
 
 ### The Solution: Automated Approach
 
-AICR replaces manual interpretation of documentation with a **automated approach**. It treats infrastructure configuration as code, providing a deterministic engine that generates the exact artifacts needed for a specific environment.
+AICR replaces manual interpretation of documentation with an **automated approach**. It treats infrastructure configuration as code, providing a deterministic engine that generates the exact artifacts needed for a specific environment.
 
 **Key Benefits:**
 1.  **Deterministic & Validated:** The system guarantees that the inputs (your system state) always produce the same valid outputs, tested against NVIDIA hardware.
@@ -159,6 +159,10 @@ For engineers integrating AICR into CI/CD pipelines, GitOps workflows, or larger
 | [Kubernetes Deployment](integrator/kubernetes-deployment.md) | Self-hosted API server deployment |
 | [Recipe Development](integrator/recipe-development.md) | Adding and modifying recipe metadata |
 | [Validator Extension](integrator/validator-extension.md) | Custom validators via `--data` |
+| [AKS GPU Setup](integrator/aks-gpu-setup.md) | Azure Kubernetes Service GPU node setup |
+| [EKS Dynamo Networking](integrator/eks-dynamo-networking.md) | EKS networking for Dynamo workloads |
+| [GKE TCPXO Networking](integrator/gke-tcpxo-networking.md) | GKE TCPXO networking integration |
+| [Talos Integration](integrator/talos-integration.md) | Running AICR on Talos Linux |
 
 ## Quick Start
 

@@ -37,12 +37,18 @@ docs/conformance/cncf/
             ├── pod-autoscaling.md
             └── cluster-autoscaling.md
 
-pkg/evidence/scripts/                 # Evidence collection script + test manifests
-├── collect-evidence.sh
-└── manifests/
-    ├── dra-gpu-test.yaml
-    ├── gang-scheduling-test.yaml
-    └── hpa-gpu-test.yaml
+pkg/evidence/cncf/                    # CNCF evidence collector package
+├── collector.go                      # Feature registry, alias mapping
+├── renderer.go                       # Evidence file rendering
+├── requirements.go                   # CNCF requirement ID mapping
+├── templates.go                      # Evidence templates
+├── types.go                          # Shared types
+└── scripts/                          # Evidence collection script + test manifests
+    ├── collect-evidence.sh
+    └── manifests/
+        ├── dra-gpu-test.yaml
+        ├── gang-scheduling-test.yaml
+        └── hpa-gpu-test.yaml
 ```
 
 ## Usage
@@ -75,10 +81,13 @@ aicr validate --phase conformance \
   --evidence-dir ./evidence --cncf-submission -f dra -f hpa
 ```
 
-Alternatively, run the evidence collection script directly:
+Alternatively, run the evidence collection script directly. Valid section
+names are `dra`, `gang`, `secure`, `accelerator-metrics`, `service-metrics`,
+`gateway`, `operator`, `hpa`, `cluster-autoscaling`, or `all`:
+
 ```bash
-./pkg/evidence/scripts/collect-evidence.sh all
-./pkg/evidence/scripts/collect-evidence.sh dra
+./pkg/evidence/cncf/scripts/collect-evidence.sh all
+./pkg/evidence/cncf/scripts/collect-evidence.sh dra
 ```
 
 > **Note:** The `--cncf-submission` flag deploys GPU workloads and takes ~5-10

@@ -267,6 +267,7 @@ Supported content types:
 | `gpu` | AcceleratorType | Alias for accelerator | `gpu=h100` |
 | `intent` | IntentType | Enum: training, inference, any | `intent=training` |
 | `os` | OSType | Enum: ubuntu, rhel, cos, amazonlinux, talos, any | `os=ubuntu` |
+| `platform` | PlatformType | Enum: dynamo, kubeflow, nim, slurm, any | `platform=kubeflow` |
 | `nodes` | int | >= 0 | `nodes=8` |
 
 ### Recipe Builder: `pkg/recipe/builder.go`
@@ -369,7 +370,7 @@ Endpoints `GET /v1/recipe` (query parameters) and `POST /v1/recipe` (criteria bo
 - `X-RateLimit-Limit` - Total requests allowed per second
 - `X-RateLimit-Remaining` - Requests remaining in current window
 - `X-RateLimit-Reset` - Unix timestamp when window resets
-- `Cache-Control` - Caching policy (public, max-age=300)
+- `Cache-Control` - Caching policy (public, max-age=600)
 
 ### Health Check
 
@@ -438,7 +439,9 @@ aicr_panic_recoveries_total 0
   "service": "aicrd",
   "version": "v1.0.0",
   "routes": [
-    "/v1/recipe"
+    "/v1/recipe",
+    "/v1/query",
+    "/v1/bundle"
   ]
 }
 ```
@@ -737,7 +740,7 @@ spec:
 ### Caching Strategy
 
 - **Recipe Store**: Loaded once per process, cached globally
-- **Client-Side**: 5-minute cache via Cache-Control header
+- **Client-Side**: 10-minute cache via Cache-Control header (`defaults.RecipeCacheTTL`)
 - **CDN**: Recommended for public-facing deployments
 
 ## Error Handling
@@ -836,7 +839,7 @@ When a request uses a criteria value not in the configured allowlist:
 - Request ID tracking for distributed tracing
 - Structured logging for debugging
 
-## Monitoring & Observability
+## Monitoring and Observability
 
 ### Prometheus Metrics
 
@@ -975,7 +978,7 @@ func TestRecipeHandler(t *testing.T) {
 - `pkg/serializer` - JSON response formatting
 - `pkg/logging` - Logging configuration
 
-## Build & Deployment
+## Build and Deployment
 
 ### Automated CI/CD Pipeline
 
@@ -1002,7 +1005,7 @@ export TAG=$(curl -s https://api.github.com/repos/NVIDIA/aicr/releases/latest |
 gh attestation verify oci://ghcr.io/nvidia/aicrd:${TAG} --owner nvidia
 ```
 
-For detailed CI/CD architecture, see [CONTRIBUTING.md](https://github.com/NVIDIA/aicr/blob/main/CONTRIBUTING.md#github-actions--cicd) and [Architecture Overview](index.md#cicd-architecture).
+For detailed CI/CD architecture, see [CONTRIBUTING.md](https://github.com/NVIDIA/aicr/blob/main/CONTRIBUTING.md#github-actions--cicd) and the [Architecture Overview](index.md).
 
 ### Local Build Configuration
 
@@ -1070,6 +1073,7 @@ export AICR_ALLOWED_SERVICES=eks,gke
 - The `any` value is always allowed regardless of allowlist
 - Both `/v1/recipe` and `/v1/bundle` endpoints enforce allowlists
 - CLI (`aicr`) is not affected by allowlists
+- The `platform` criteria field has no allowlist env var today; all valid platform enum values are accepted (invalid enum values are rejected with HTTP 400 by the criteria parser)
 
 ## Extension and Operating Patterns
 
@@ -1097,7 +1101,7 @@ See [API Server: Extension and Operating Patterns](api-server-extending.md).
 - [Google SRE Book](https://sre.google/sre-book/table-of-contents/) - Site reliability engineering  
 - [Release Engineering](https://sre.google/workbook/release-engineering/) - Deployment best practices
 
-### HTTP & APIs
+### HTTP and APIs
 
 - [HTTP/2 in Go](https://go.dev/blog/h2push) - HTTP/2 server push  
 - [RESTful API Design](https://cloud.google.com/apis/design) - Google Cloud API design guide  

@@ -291,8 +291,9 @@ flowchart TD
     B --> B2["SystemDCollector<br/>(containerd, docker, kubelet)"]
     B --> B3["KubernetesCollector<br/>(server, images, policies)"]
     B --> B4["GPUCollector<br/>(nvidia-smi data)"]
+    B --> B5["NodeTopologyCollector<br/>(cluster-wide taints, labels)"]
 
-    B1 & B2 & B3 & B4 --> C[NodeSnapshotter.Measure]
+    B1 & B2 & B3 & B4 & B5 --> C[NodeSnapshotter.Measure]
 
     C --> D["Parallel Collection<br/>(errgroup)"]
 
@@ -301,17 +302,20 @@ flowchart TD
     D --> D3["Go Routine 3: SystemD<br/>• containerd.service<br/>• docker.service<br/>• kubelet.service"]
     D --> D4["Go Routine 4: OS Config<br/>• GRUB parameters<br/>• Kernel modules<br/>• Sysctl parameters"]
     D --> D5["Go Routine 5: GPU<br/>• nvidia-smi properties<br/>• driver, CUDA, etc."]
+    D --> D6["Go Routine 6: NodeTopology<br/>• cluster-wide taints<br/>• cluster-wide labels"]
 
-    D1 & D2 & D3 & D4 & D5 --> E["All goroutines complete<br/>or first error returns"]
+    D1 & D2 & D3 & D4 & D5 & D6 --> E["All goroutines complete<br/>or first error returns"]
 
-    E --> F["Snapshot Structure<br/>kind: Snapshot<br/>apiVersion: aicr.nvidia.com/v1alpha1<br/>measurements: [k8s, systemd, os, gpu]"]
+    E --> F["Snapshot Structure<br/>kind: Snapshot<br/>apiVersion: aicr.nvidia.com/v1alpha1<br/>measurements: [K8s, SystemD, OS, GPU, NodeTopology]"]
 
     F --> G[serializer.NewFileWriterOrStdout]
 
     G --> G1["Format: JSON/YAML/Table"]
     G --> G2["Output: stdout or file"]
 ```
 
+Snapshot measurement types: `K8s`, `SystemD`, `OS`, `GPU`, `NodeTopology` (cluster-wide node taints and labels — see `pkg/measurement/types.go` for the canonical constants).
+
 #### Usage Examples
 
 ```bash
@@ -979,11 +983,8 @@ INFO  generating bundle recipeFilePath=recipe.yaml outputDir=./bundles bundlerTy
 INFO  starting bundle generation bundler_count=1 output_dir=./bundles
 INFO  bundler completed bundler_type=gpu-operator files=5 size_bytes=12458 duration=45ms
 INFO  bundle generation complete summary="Generated 5 files (12 KB) in 45ms. Success: 1/1 bundlers."
-INFO  bundle generation completed success=1 errors=0 duration_sec=0.045 summary="Generated 5 files (12 KB) in 45ms. Success: 1/1 bundlers."
 ```
 
-**Common Errors**:
-
 ## Shared Infrastructure
 
 ### Collector Factory Pattern
@@ -999,6 +1000,7 @@ type Factory interface {
     CreateOSCollector() Collector
     CreateKubernetesCollector() Collector
     CreateGPUCollector() Collector
+    CreateNodeTopologyCollector() Collector
 }
 ```
 
@@ -1053,7 +1055,7 @@ type Reading struct {
 ```bash
 # Invalid accelerator type
 $ aicr recipe --accelerator invalid-gpu
-[cli] command failed: error=[INTERNAL] error building recipe: [INVALID_REQUEST] error parsing criteria: [INVALID_REQUEST] failed to apply criteria option: [INVALID_REQUEST] failed to parse accelerator type: [INVALID_REQUEST] invalid accelerator type: invalid-gpu exitCode=8
+[cli] command failed: error=[INVALID_REQUEST] error parsing criteria: [INVALID_REQUEST] invalid accelerator type: invalid-gpu exitCode=2
 
 # Unknown output format
 $ aicr snapshot --format xml
@@ -1384,7 +1386,7 @@ spec:
             effect: NoSchedule
           containers:
           - name: aicr
-            image: ghcr.io/nvidia/aicr:v0.6.4
+            image: ghcr.io/nvidia/aicr:<release-tag>  # replace with the AICR release you target
             command:
               - /bin/sh
               - -c