From 50e6eb3d647e8608ddd13f73f3fef147770bbd38 Mon Sep 17 00:00:00 2001 From: Yuan Chen Date: Thu, 14 May 2026 13:26:18 -0700 Subject: [PATCH] docs: fix audit findings across docs, demos, examples MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sweep of doc-quality issues uncovered by an audit of docs/, demos/, and examples/ against current main. Fixes break into four buckets: ERROR (breaking or contradicts code): - examples/recipes/eks-training.yaml: deploymentOrder listed kube-prometheus-stack with no matching componentRefs entry — bundle generation rejected the recipe. Added the missing entry. - examples/recipes/eks-gb200-ubuntu-training-with-validation.yaml: same defect; performance phase referenced nonexistent checks (nccl-bandwidth-test, fabric-health-check) and an unknown `infrastructure: nccl-doctor` line — replaced with the real nccl-all-reduce-bw-net / -nvls checks from the gb200-eks-training overlay. Also bumped gpu-operator (v25.3.3 → v26.3.1) and nvidia-dra-driver-gpu (25.8.1 → 25.12.0) to registry defaults. - docs/contributor/validator.md: validators.Context struct field `Recipe *recipe.RecipeResult` (no longer exists) → `ValidationInput *v1.ValidationInput`; gated skip example updated to match. Added an RBAC subsection documenting per-run aicr-validator- naming introduced in PR #888. - docs/contributor/api-server.md: Cache-Control max-age 300 → 600 to match RecipeCacheTTL (10 min); root endpoint `routes` list expanded from [/v1/recipe] to all three registered routes; Query parameter table gained the missing `platform` row; broken anchor index.md#cicd-architecture replaced with a working link. - docs/contributor/cli.md: Factory interface snippet went from 4 to 5 methods (CreateNodeTopologyCollector was missing); snapshot measurements list gained `topology`; bogus [INTERNAL]-wrapped invalid-accelerator example corrected to [INVALID_REQUEST]; duplicate log line and empty section removed. - docs/user/cli-reference.md: added missing exit code 5 (TIMEOUT) to the validate exit-code table; `--best-effort` (a real flag) is no longer used as the "rejected typo" example. - docs/user/api-reference.md: Bundle Components table gained the three missing components (nfd, slinky-slurm-operator, slinky-slurm-operator-crds) and was re-sorted alphabetically. - docs/conformance/cncf/index.md: corrected script path pkg/evidence/scripts/collect-evidence.sh → pkg/evidence/cncf/scripts/collect-evidence.sh and refreshed the directory tree; section-name list updated to match the script's case block. INCONSISTENCY / drift: - platform enum (kubeflow) → full list (dynamo, kubeflow, nim, slurm) in docs/README.md, docs/contributor/validations.md. - OS enum (ubuntu, rhel) → full list (ubuntu, rhel, cos, amazonlinux, talos) in docs/integrator/data-flow.md. - Stale gpu-operator illustrative pins (v25.3.x) bumped to v26.3.1 in docs/integrator/{recipe-development,automation,data-flow}.md. - demos/cuj1-eks.md: --system-node-selector dedicated=system-workload (taint key as a label selector) → nodeGroup=system-worker; matching prose adjustment. - demos/cuj2.md: toleration grammar field [operation] → [effect]. - demos/cuj2-demo.md: gpu-operator pin v25.3.4 → v26.3.1; ASCII box borders re-aligned. - demos/e2e.md: /tmp/criteria.yaml → ${TMPDIR:-/tmp}/criteria.yaml per project artifact-location convention; stale "18 + 1 = 19" component count replaced with placeholders. - demos/data.md: "Asymmetric rule matching" → "Dependency-driven ordering based on Kahn's algorithm". - demos/README.md: index table gained the missing CUJ pages. - docs/integrator/index.md and docs/README.md: integrator-pages tables gained missing rows. STYLE / slug hygiene (CLAUDE.md doc-style rules): - docs/user/cli-reference.md: stripped trailing (--flag) parentheticals from three headings (Storage Class, Deployment Methods, Value Overrides) — the Deploy/Undeploy variants are left alone in this PR because their slugs have many inbound references in bundle templates and golden-files which would need a coordinated follow-up. - docs/contributor/api-server.md: three headings rewritten with "and" in place of "&" to fix double-hyphen slugs. - docs/integrator/aks-gpu-setup.md: dropped trailing "(Important)" from a heading. TYPO / NIT: - docs/README.md "a automated" → "an automated". - docs/integrator/recipe-development.md "end to end" → "end-to-end". - demos/cuj2.md "comma delimination" → "comma-separated". - demos/valid.md sentence fragment completed. - docs/user/component-catalog.md `***Note:***` → `**Note:**`. - docs/user/{installation,agent-deployment}.md github.com/nvidia/aicr → github.com/NVIDIA/aicr. - docs/user/agent-deployment.md and docs/user/cli-reference.md literal `app.kubernetes.io/version: v0.17.0` → placeholder. - docs/contributor/cli.md CronJob example pin v0.6.4 → placeholder. - examples/recipes/README.md driver version 580.82.07 → 580.105.08 to match current gpu-operator pin. - examples/recipes/{eks-training,eks-gb200-...,kind}.yaml metadata.version v0.26.7-next → dev (hand-written-example convention shared with aks-training.yaml). - demos/examples/CUJ2-Test-Report.md: top-of-file note flags the log as a pre-#871 historical capture (kgateway → agentgateway migration). Also adds the missing CNCF feature list to docs/user/cli-reference.md --feature row and a "Scoping conformance to specific features" section in docs/user/validation.md, since the 9 ValidFeatures names from pkg/evidence/cncf/collector.go were not documented for users. Skipped (out of scope or risk-deferred): - ADR-002 cross-references to per-run RBAC names beyond the existing 2026-05-14 implementation note — ADRs are frozen historical records. - Deploy Script Behavior / Undeploy Script Behavior heading parentheticals — ~23 inbound anchor links in bundle templates and golden-files would need a coordinated PR. - demos/cuj{1,2}-{eks,gke}.md podTemplateOverrides cleanup — both files already explicitly note "No podTemplateOverrides / runtimePatches needed" since the torch-distributed runtime bakes scheduling at bundle time. - pre-existing site/docs/* renders (gitignored, auto-generated). Doc-only PR; full `make qualify` skipped per CLAUDE.local.md ("doc-only / infra-only change ... cheap checks are enough"). Ran `make lint` (yamllint, gofmt, license-check, sidebar check, agents sync, chart-pin verification) — all green. --- README.md | 1 - demos/README.md | 2 ++ demos/cuj1-eks.md | 11 ++++-- demos/cuj2-demo.md | 6 ++-- demos/cuj2.md | 4 +-- demos/data.md | 2 +- demos/e2e.md | 8 ++--- demos/examples/CUJ2-Test-Report.md | 2 ++ demos/valid.md | 2 +- docs/README.md | 8 +++-- docs/conformance/cncf/index.md | 27 +++++++++----- docs/contributor/api-server.md | 18 ++++++---- docs/contributor/cli.md | 18 +++++----- docs/contributor/index.md | 3 +- docs/contributor/validations.md | 2 +- docs/contributor/validator.md | 36 ++++++++++++------- docs/integrator/aks-gpu-setup.md | 2 +- docs/integrator/automation.md | 2 +- docs/integrator/data-flow.md | 4 +-- docs/integrator/index.md | 2 ++ docs/integrator/recipe-development.md | 10 +++--- docs/user/agent-deployment.md | 6 ++-- docs/user/api-reference.md | 35 +++++++++--------- docs/user/cli-reference.md | 17 ++++----- docs/user/component-catalog.md | 2 +- docs/user/installation.md | 2 +- docs/user/validation.md | 33 +++++++++++++++++ examples/recipes/README.md | 2 +- ...gb200-ubuntu-training-with-validation.yaml | 35 +++++++++++++----- examples/recipes/eks-training.yaml | 9 ++++- examples/recipes/kind.yaml | 2 +- 31 files changed, 208 insertions(+), 105 deletions(-) diff --git a/README.md b/README.md index 1679d6f8c..0b513f0ae 100755 --- a/README.md +++ b/README.md @@ -28,7 +28,6 @@ brew install aicr # Or use the install script curl -sfL https://raw.githubusercontent.com/NVIDIA/aicr/main/install | bash -s -- - # Capture your cluster's current state aicr snapshot --output snapshot.yaml diff --git a/demos/README.md b/demos/README.md index 29e51663b..d50f3c927 100644 --- a/demos/README.md +++ b/demos/README.md @@ -9,7 +9,9 @@ Runbooks for testing and demonstrating AICR end-to-end workflows on live cluster | [cuj1-eks.md](cuj1-eks.md) | CUJ1 - EKS cluster setup | | [cuj1-gke.md](cuj1-gke.md) | CUJ1 - GKE cluster setup | | [cuj2.md](cuj2.md) | CUJ2 - EKS inference with Dynamo | +| [cuj2-demo.md](cuj2-demo.md) | CUJ2 - Annotated demo walkthrough (training vs inference) | | [cuj2-eks.md](cuj2-eks.md) | CUJ2 - EKS variant | +| [cuj2-gke.md](cuj2-gke.md) | CUJ2 - GKE variant | | [e2e.md](e2e.md) | End-to-end CLI demo | | [valid.md](valid.md) | Validation demo | | [data.md](data.md) | External data directory demo | diff --git a/demos/cuj1-eks.md b/demos/cuj1-eks.md index a78b9786c..65f97b94f 100644 --- a/demos/cuj1-eks.md +++ b/demos/cuj1-eks.md @@ -3,7 +3,7 @@ ## Assumptions * Assuming user is already authenticated to an EKS cluster with 2+ H100 node. -* Values used in `--accelerated-node-selector`, `--accelerated-node-toleration`, and `--system-node-toleration` flags are only for example purposes. Assuming user will update these to match their cluster. +* Values used in `--accelerated-node-selector`, `--accelerated-node-toleration`, `--system-node-selector`, and `--system-node-toleration` flags are only for example purposes. Assuming user will update these to match their cluster. ## Snapshot @@ -41,13 +41,20 @@ aicr validate \ ## Generate Bundle +The selector and toleration values below mirror AICR's reference EKS clusters +(`aicr1`, `aicr2`): nodes carry the label `nodeGroup={system-worker,gpu-worker}` +and the unrelated taint key `dedicated={system-workload,worker-workload}:{NoSchedule,NoExecute}`. +The selector key (`nodeGroup`) and toleration key (`dedicated`) intentionally +differ — the label drives scheduling targeting and the taint drives admission. +Adjust both pairs to your cluster's actual labels and taints. + ```shell aicr bundle \ --recipe recipe.yaml \ --accelerated-node-selector nodeGroup=gpu-worker \ --accelerated-node-toleration dedicated=worker-workload:NoSchedule \ --accelerated-node-toleration dedicated=worker-workload:NoExecute \ - --system-node-selector dedicated=system-workload \ + --system-node-selector nodeGroup=system-worker \ --system-node-toleration dedicated=system-workload:NoSchedule \ --system-node-toleration dedicated=system-workload:NoExecute \ --storage-class \ diff --git a/demos/cuj2-demo.md b/demos/cuj2-demo.md index be2c0787e..925870165 100644 --- a/demos/cuj2-demo.md +++ b/demos/cuj2-demo.md @@ -46,8 +46,8 @@ │ ├── agentgateway-crds/ (Gateway API + inference CRDs) │ │ ├── agentgateway/ (inference gateway controller) │ │ ├── nvsentinel/ (security/compliance) │ - │ ├── nodewright-operator/ (node configuration) │ - │ ├── nodewright-customizations/ (H100 tuning) │ + │ ├── nodewright-operator/ (node configuration) │ + │ ├── nodewright-customizations/ (H100 tuning) │ │ ├── aws-ebs-csi-driver/ (EBS storage) │ │ ├── aws-efa/ (Elastic Fabric Adapter) │ │ ├── dynamo-crds/ (Dynamo CRDs) │ @@ -125,7 +125,7 @@ │ h100-eks-ubuntu-training.yaml │ (Ubuntu constraints) │ │ (Ubuntu constraints) │ │ │ │ │ │ h100-eks-ubuntu-inference-dynamo │ -│ h100-eks-ubuntu-training-kubeflow │ ├── gpu-operator (v25.3.4, CDI) │ +│ h100-eks-ubuntu-training-kubeflow │ ├── gpu-operator (v26.3.1, CDI) │ │ └── kubeflow-trainer ◀── NEW │ ├── nvidia-dra-driver (gpuRes)◀─NEW│ │ │ ├── dynamo-crds ◀─ NEW │ │ │ └── dynamo-platform ◀─ NEW │ diff --git a/demos/cuj2.md b/demos/cuj2.md index fe7e179f9..6959e99ea 100644 --- a/demos/cuj2.md +++ b/demos/cuj2.md @@ -33,11 +33,11 @@ aicr bundle \ --recipe recipe.yaml \ --output bundle \ --accelerated-node-selector [key]=[value] \ - --accelerated-node-toleration [key]=[value]:[operation] \ + --accelerated-node-toleration [key]=[value]:[effect] \ --storage-class [storage-class] ``` -Replace the values for `--accelerated-node-selector` and `--accelerated-node-toleration` with the appropriate ones to match your gpu pool(s). You do not want optimizations and inference workloads to run across all nodes. Both options allow for comma delimination to supply multiple values. See the [aicr bundle](../docs/user/cli-reference.md#aicr-bundle) section for more information. +Replace the values for `--accelerated-node-selector` and `--accelerated-node-toleration` with the appropriate ones to match your gpu pool(s). You do not want optimizations and inference workloads to run across all nodes. Both options allow for comma-separated values to supply multiple values. See the [aicr bundle](../docs/user/cli-reference.md#aicr-bundle) section for more information. Set `--storage-class` to a StorageClass that exists on the target cluster (check with `kubectl get storageclass`). Cloud overlays configure `kube-prometheus-stack` with a `volumeClaimTemplate` but no `storageClassName`; without this flag the PVC falls to the cluster's default StorageClass, and if no default is configured the deploy hangs on a Pending PVC. diff --git a/demos/data.md b/demos/data.md index 90eaf5f50..801d2b93e 100644 --- a/demos/data.md +++ b/demos/data.md @@ -188,7 +188,7 @@ Order in `dependencyRefs`: 2. `gpu-operator` (depends on cert-manager) 3. Other components... -> Asymmetric rule matching based on [Kahn's algorithm](https://www.geeksforgeeks.org/dsa/topological-sorting-indegree-based-solution/). +> Dependency-driven ordering based on [Kahn's algorithm](https://www.geeksforgeeks.org/dsa/topological-sorting-indegree-based-solution/) for topological sort. ## API Access diff --git a/demos/e2e.md b/demos/e2e.md index 79eb66c0c..a2b3801df 100644 --- a/demos/e2e.md +++ b/demos/e2e.md @@ -31,7 +31,7 @@ aicr recipe --service eks --accelerator gb200 | yq . From criteria file: ```shell -cat > /tmp/criteria.yaml << 'EOF' +cat > "${TMPDIR:-/tmp}/criteria.yaml" << 'EOF' kind: RecipeCriteria apiVersion: aicr.nvidia.com/v1alpha1 metadata: @@ -48,7 +48,7 @@ EOF Generate recipe from criteria file ```shell -aicr recipe --criteria /tmp/criteria.yaml --output recipe.yaml +aicr recipe --criteria "${TMPDIR:-/tmp}/criteria.yaml" --output recipe.yaml ``` > Metadata overlays: `components=11 overlays=7` @@ -56,7 +56,7 @@ aicr recipe --criteria /tmp/criteria.yaml --output recipe.yaml CLI flags override criteria file values ```shell -aicr recipe --criteria /tmp/criteria.yaml --service gke | yq . +aicr recipe --criteria "${TMPDIR:-/tmp}/criteria.yaml" --service gke | yq . ``` > Metadata overlays: `components=7 overlays=2` @@ -238,7 +238,7 @@ aicr recipe \ ``` Output shows: -* `18` embedded + `1` external = `19` merged components +* `` embedded + `` external = `` merged components * `dgxc-teleport` appears as Kustomize component Now `dgxc-teleport` is included in `componentRefs` and `deploymentOrder` diff --git a/demos/examples/CUJ2-Test-Report.md b/demos/examples/CUJ2-Test-Report.md index cd15454c8..8d05cae13 100644 --- a/demos/examples/CUJ2-Test-Report.md +++ b/demos/examples/CUJ2-Test-Report.md @@ -1,5 +1,7 @@ # CUJ2 Test Report (Run 2) - EKS Inference with Dynamo +**Historical capture.** This report was generated before PR #871 migrated `kgateway` → `agentgateway`. Current bundles install `agentgateway` charts in namespace `agentgateway-system`; the log lines below reflecting `kgateway-*` releases are obsolete. + **Date:** 2026-03-13 **Branch:** `fix/cuj2-timeout-issue` (includes PR #397 fix, rebased on main) **AICR Version:** built from source (fix/cuj2-timeout-issue) diff --git a/demos/valid.md b/demos/valid.md index 88807e61e..4d515d6e7 100644 --- a/demos/valid.md +++ b/demos/valid.md @@ -65,7 +65,7 @@ aicr bundle \ Expected: - Completes in <10s - `bundle` created -- Nodewright Warning about workload selector +- Nodewright emits a warning about the workload selector ## Validate (dry run, no-cluster) diff --git a/docs/README.md b/docs/README.md index bccf2a1fc..d7755ad32 100644 --- a/docs/README.md +++ b/docs/README.md @@ -8,7 +8,7 @@ NVIDIA AI Cluster Runtime (AICR) is a suite of tooling designed to automate the |------|-------------| | **Snapshot** | A captured state of a system including OS, kernel, Kubernetes, GPU, and SystemD configuration. Created by `aicr snapshot` or the Kubernetes agent. | | **Recipe** | A generated configuration recommendation containing component references, constraints, and deployment order. Created by `aicr recipe` based on criteria or snapshot analysis. | -| **Criteria** | Query parameters that define the target environment: `service` (eks/gke/aks/oke/kind/lke), `accelerator` (h100/gb200/b200/a100/l40/rtx-pro-6000), `intent` (training/inference), `os` (ubuntu/rhel/cos/amazonlinux/talos), `platform` (kubeflow), and `nodes`. | +| **Criteria** | Query parameters that define the target environment: `service` (eks/gke/aks/oke/kind/lke), `accelerator` (h100/gb200/b200/a100/l40/rtx-pro-6000), `intent` (training/inference), `os` (ubuntu/rhel/cos/amazonlinux/talos), `platform` (dynamo, kubeflow, nim, slurm), and `nodes`. | | **Overlay** | A recipe metadata file that extends the base recipe for specific environments. Overlays are matched against criteria using asymmetric matching. | | **Mixin** | A composable recipe fragment (`kind: RecipeMixin`) that carries only `constraints` and `componentRefs`. Mixins live in `recipes/mixins/`, are excluded from overlay discovery, and are referenced by leaf overlays via `spec.mixins` to share orthogonal content (e.g., OS constraints, platform components) without duplication. See [ADR-005](design/005-overlay-refactoring.md). | | **Bundle** | Deployment artifacts generated from a recipe: Helm values files, Kubernetes manifests, installation scripts, and checksums. | @@ -40,7 +40,7 @@ Previously, administrators relied on static documentation and manual installatio ### The Solution: Automated Approach -AICR replaces manual interpretation of documentation with a **automated approach**. It treats infrastructure configuration as code, providing a deterministic engine that generates the exact artifacts needed for a specific environment. +AICR replaces manual interpretation of documentation with an **automated approach**. It treats infrastructure configuration as code, providing a deterministic engine that generates the exact artifacts needed for a specific environment. **Key Benefits:** 1. **Deterministic & Validated:** The system guarantees that the inputs (your system state) always produce the same valid outputs, tested against NVIDIA hardware. @@ -159,6 +159,10 @@ For engineers integrating AICR into CI/CD pipelines, GitOps workflows, or larger | [Kubernetes Deployment](integrator/kubernetes-deployment.md) | Self-hosted API server deployment | | [Recipe Development](integrator/recipe-development.md) | Adding and modifying recipe metadata | | [Validator Extension](integrator/validator-extension.md) | Custom validators via `--data` | +| [AKS GPU Setup](integrator/aks-gpu-setup.md) | Azure Kubernetes Service GPU node setup | +| [EKS Dynamo Networking](integrator/eks-dynamo-networking.md) | EKS networking for Dynamo workloads | +| [GKE TCPXO Networking](integrator/gke-tcpxo-networking.md) | GKE TCPXO networking integration | +| [Talos Integration](integrator/talos-integration.md) | Running AICR on Talos Linux | ## Quick Start diff --git a/docs/conformance/cncf/index.md b/docs/conformance/cncf/index.md index bee8027e2..3bf9840c8 100644 --- a/docs/conformance/cncf/index.md +++ b/docs/conformance/cncf/index.md @@ -37,12 +37,18 @@ docs/conformance/cncf/ ├── pod-autoscaling.md └── cluster-autoscaling.md -pkg/evidence/scripts/ # Evidence collection script + test manifests -├── collect-evidence.sh -└── manifests/ - ├── dra-gpu-test.yaml - ├── gang-scheduling-test.yaml - └── hpa-gpu-test.yaml +pkg/evidence/cncf/ # CNCF evidence collector package +├── collector.go # Feature registry, alias mapping +├── renderer.go # Evidence file rendering +├── requirements.go # CNCF requirement ID mapping +├── templates.go # Evidence templates +├── types.go # Shared types +└── scripts/ # Evidence collection script + test manifests + ├── collect-evidence.sh + └── manifests/ + ├── dra-gpu-test.yaml + ├── gang-scheduling-test.yaml + └── hpa-gpu-test.yaml ``` ## Usage @@ -75,10 +81,13 @@ aicr validate --phase conformance \ --evidence-dir ./evidence --cncf-submission -f dra -f hpa ``` -Alternatively, run the evidence collection script directly: +Alternatively, run the evidence collection script directly. Valid section +names are `dra`, `gang`, `secure`, `accelerator-metrics`, `service-metrics`, +`gateway`, `operator`, `hpa`, `cluster-autoscaling`, or `all`: + ```bash -./pkg/evidence/scripts/collect-evidence.sh all -./pkg/evidence/scripts/collect-evidence.sh dra +./pkg/evidence/cncf/scripts/collect-evidence.sh all +./pkg/evidence/cncf/scripts/collect-evidence.sh dra ``` > **Note:** The `--cncf-submission` flag deploys GPU workloads and takes ~5-10 diff --git a/docs/contributor/api-server.md b/docs/contributor/api-server.md index 3dd4ae065..34dc24119 100644 --- a/docs/contributor/api-server.md +++ b/docs/contributor/api-server.md @@ -267,6 +267,7 @@ Supported content types: | `gpu` | AcceleratorType | Alias for accelerator | `gpu=h100` | | `intent` | IntentType | Enum: training, inference, any | `intent=training` | | `os` | OSType | Enum: ubuntu, rhel, cos, amazonlinux, talos, any | `os=ubuntu` | +| `platform` | PlatformType | Enum: dynamo, kubeflow, nim, slurm, any | `platform=kubeflow` | | `nodes` | int | >= 0 | `nodes=8` | ### Recipe Builder: `pkg/recipe/builder.go` @@ -369,7 +370,7 @@ Endpoints `GET /v1/recipe` (query parameters) and `POST /v1/recipe` (criteria bo - `X-RateLimit-Limit` - Total requests allowed per second - `X-RateLimit-Remaining` - Requests remaining in current window - `X-RateLimit-Reset` - Unix timestamp when window resets -- `Cache-Control` - Caching policy (public, max-age=300) +- `Cache-Control` - Caching policy (public, max-age=600) ### Health Check @@ -438,7 +439,9 @@ aicr_panic_recoveries_total 0 "service": "aicrd", "version": "v1.0.0", "routes": [ - "/v1/recipe" + "/v1/recipe", + "/v1/query", + "/v1/bundle" ] } ``` @@ -737,7 +740,7 @@ spec: ### Caching Strategy - **Recipe Store**: Loaded once per process, cached globally -- **Client-Side**: 5-minute cache via Cache-Control header +- **Client-Side**: 10-minute cache via Cache-Control header (`defaults.RecipeCacheTTL`) - **CDN**: Recommended for public-facing deployments ## Error Handling @@ -836,7 +839,7 @@ When a request uses a criteria value not in the configured allowlist: - Request ID tracking for distributed tracing - Structured logging for debugging -## Monitoring & Observability +## Monitoring and Observability ### Prometheus Metrics @@ -975,7 +978,7 @@ func TestRecipeHandler(t *testing.T) { - `pkg/serializer` - JSON response formatting - `pkg/logging` - Logging configuration -## Build & Deployment +## Build and Deployment ### Automated CI/CD Pipeline @@ -1002,7 +1005,7 @@ export TAG=$(curl -s https://api.github.com/repos/NVIDIA/aicr/releases/latest | gh attestation verify oci://ghcr.io/nvidia/aicrd:${TAG} --owner nvidia ``` -For detailed CI/CD architecture, see [CONTRIBUTING.md](https://github.com/NVIDIA/aicr/blob/main/CONTRIBUTING.md#github-actions--cicd) and [Architecture Overview](index.md#cicd-architecture). +For detailed CI/CD architecture, see [CONTRIBUTING.md](https://github.com/NVIDIA/aicr/blob/main/CONTRIBUTING.md#github-actions--cicd) and the [Architecture Overview](index.md). ### Local Build Configuration @@ -1070,6 +1073,7 @@ export AICR_ALLOWED_SERVICES=eks,gke - The `any` value is always allowed regardless of allowlist - Both `/v1/recipe` and `/v1/bundle` endpoints enforce allowlists - CLI (`aicr`) is not affected by allowlists +- The `platform` criteria field has no allowlist env var today; all valid platform enum values are accepted (invalid enum values are rejected with HTTP 400 by the criteria parser) ## Extension and Operating Patterns @@ -1097,7 +1101,7 @@ See [API Server: Extension and Operating Patterns](api-server-extending.md). - [Google SRE Book](https://sre.google/sre-book/table-of-contents/) - Site reliability engineering - [Release Engineering](https://sre.google/workbook/release-engineering/) - Deployment best practices -### HTTP & APIs +### HTTP and APIs - [HTTP/2 in Go](https://go.dev/blog/h2push) - HTTP/2 server push - [RESTful API Design](https://cloud.google.com/apis/design) - Google Cloud API design guide diff --git a/docs/contributor/cli.md b/docs/contributor/cli.md index 598a5dbea..da3994f14 100644 --- a/docs/contributor/cli.md +++ b/docs/contributor/cli.md @@ -291,8 +291,9 @@ flowchart TD B --> B2["SystemDCollector
(containerd, docker, kubelet)"] B --> B3["KubernetesCollector
(server, images, policies)"] B --> B4["GPUCollector
(nvidia-smi data)"] + B --> B5["NodeTopologyCollector
(cluster-wide taints, labels)"] - B1 & B2 & B3 & B4 --> C[NodeSnapshotter.Measure] + B1 & B2 & B3 & B4 & B5 --> C[NodeSnapshotter.Measure] C --> D["Parallel Collection
(errgroup)"] @@ -301,10 +302,11 @@ flowchart TD D --> D3["Go Routine 3: SystemD
• containerd.service
• docker.service
• kubelet.service"] D --> D4["Go Routine 4: OS Config
• GRUB parameters
• Kernel modules
• Sysctl parameters"] D --> D5["Go Routine 5: GPU
• nvidia-smi properties
• driver, CUDA, etc."] + D --> D6["Go Routine 6: NodeTopology
• cluster-wide taints
• cluster-wide labels"] - D1 & D2 & D3 & D4 & D5 --> E["All goroutines complete
or first error returns"] + D1 & D2 & D3 & D4 & D5 & D6 --> E["All goroutines complete
or first error returns"] - E --> F["Snapshot Structure
kind: Snapshot
apiVersion: aicr.nvidia.com/v1alpha1
measurements: [k8s, systemd, os, gpu]"] + E --> F["Snapshot Structure
kind: Snapshot
apiVersion: aicr.nvidia.com/v1alpha1
measurements: [K8s, SystemD, OS, GPU, NodeTopology]"] F --> G[serializer.NewFileWriterOrStdout] @@ -312,6 +314,8 @@ flowchart TD G --> G2["Output: stdout or file"] ``` +Snapshot measurement types: `K8s`, `SystemD`, `OS`, `GPU`, `NodeTopology` (cluster-wide node taints and labels — see `pkg/measurement/types.go` for the canonical constants). + #### Usage Examples ```bash @@ -979,11 +983,8 @@ INFO generating bundle recipeFilePath=recipe.yaml outputDir=./bundles bundlerTy INFO starting bundle generation bundler_count=1 output_dir=./bundles INFO bundler completed bundler_type=gpu-operator files=5 size_bytes=12458 duration=45ms INFO bundle generation complete summary="Generated 5 files (12 KB) in 45ms. Success: 1/1 bundlers." -INFO bundle generation completed success=1 errors=0 duration_sec=0.045 summary="Generated 5 files (12 KB) in 45ms. Success: 1/1 bundlers." ``` -**Common Errors**: - ## Shared Infrastructure ### Collector Factory Pattern @@ -999,6 +1000,7 @@ type Factory interface { CreateOSCollector() Collector CreateKubernetesCollector() Collector CreateGPUCollector() Collector + CreateNodeTopologyCollector() Collector } ``` @@ -1053,7 +1055,7 @@ type Reading struct { ```bash # Invalid accelerator type $ aicr recipe --accelerator invalid-gpu -[cli] command failed: error=[INTERNAL] error building recipe: [INVALID_REQUEST] error parsing criteria: [INVALID_REQUEST] failed to apply criteria option: [INVALID_REQUEST] failed to parse accelerator type: [INVALID_REQUEST] invalid accelerator type: invalid-gpu exitCode=8 +[cli] command failed: error=[INVALID_REQUEST] error parsing criteria: [INVALID_REQUEST] invalid accelerator type: invalid-gpu exitCode=2 # Unknown output format $ aicr snapshot --format xml @@ -1384,7 +1386,7 @@ spec: effect: NoSchedule containers: - name: aicr - image: ghcr.io/nvidia/aicr:v0.6.4 + image: ghcr.io/nvidia/aicr: # replace with the AICR release you target command: - /bin/sh - -c diff --git a/docs/contributor/index.md b/docs/contributor/index.md index 101c41162..920d3c621 100644 --- a/docs/contributor/index.md +++ b/docs/contributor/index.md @@ -39,7 +39,7 @@ AICR is not a deployment engine. It does not: - Orchestrate cross-component dependencies at runtime These responsibilities belong to the deployment tool that consumes -AICR's artifacts (e.g. Helm, Argo CD, Flux). These tools own release reconciliation and lifecycle. +AICR's artifacts (e.g. Helm, Argo CD, Flux). These tools own release reconciliation and lifecycle. A note on terminology: code under `pkg/bundler` includes things we call *deployers*. They are **output adapters** that emit artifacts in @@ -157,6 +157,7 @@ running outside the cluster. Detail per stage lives in the | `pkg/bundler` | Per-component bundle generation, output adapters | [component.md](component.md) | | `pkg/component` | Bundler utilities and test helpers | [component.md](component.md) | | `pkg/collector` | System state collection (parallel via errgroup) | — | +| `pkg/collector/topology` | Cluster-wide node taint/label topology collection | — | | `pkg/snapshotter` | Orchestrates collector execution and aggregates measurements | — | | `pkg/validator` | Constraint evaluation; container-per-validator | [validator.md](validator.md), [validations.md](validations.md) | | `pkg/k8s/client` | Singleton Kubernetes clientset (in-cluster + kubeconfig) | — | diff --git a/docs/contributor/validations.md b/docs/contributor/validations.md index 2fbc9956c..2f06aa33a 100644 --- a/docs/contributor/validations.md +++ b/docs/contributor/validations.md @@ -96,7 +96,7 @@ conditions: - `service`: Kubernetes service (eks, gke, aks, oke, kind, lke) - `accelerator`: GPU type (h100, gb200, b200, a100, l40, rtx-pro-6000) - `os`: Operating system (ubuntu, rhel, cos, amazonlinux, talos) -- `platform`: Platform/framework (kubeflow) +- `platform`: Platform/framework (dynamo, kubeflow, nim, slurm) ### Example: Nodewright Customizations Validations diff --git a/docs/contributor/validator.md b/docs/contributor/validator.md index dc43f4ac5..b928f27d7 100644 --- a/docs/contributor/validator.md +++ b/docs/contributor/validator.md @@ -119,6 +119,12 @@ Every validator container must follow this contract: | **stderr** | Debug/progress logs (`slog` output) | Streamed live to user terminal | | `/dev/termination-log` | Failure reason (max 4096 bytes) | CTRF report on failure | +### RBAC + +The validator engine creates a per-run ServiceAccount and ClusterRoleBinding for every `aicr validate` invocation. Both are named `aicr-validator-` where `` is the unique identifier generated at the start of the run (see `pkg/validator/job/rbac.go`). Per-run naming prevents concurrent validation runs from clobbering each other's RBAC and ensures cleanup at run end deletes only the resources owned by that run. + +External tooling that needs to match validator RBAC (e.g., for monitoring or cleanup) should select by the `app.kubernetes.io/name=aicr-validator` label rather than by literal resource name, since the suffix changes every run. + ### Mounted Data The validator engine mounts snapshot and recipe data as ConfigMaps: @@ -147,16 +153,16 @@ The `validators.Context` struct provides all dependencies a check needs: ```go type Context struct { - Ctx context.Context // Parent context with timeout - Cancel context.CancelFunc // Release resources (caller must defer) - Clientset kubernetes.Interface // Typed K8s client - RESTConfig *rest.Config // For exec, port-forward, dynamic client - DynamicClient dynamic.Interface // For CRD access - Snapshot *snapshotter.Snapshot // Captured cluster state - Recipe *recipe.RecipeResult // Recipe with validation config - Namespace string // Validation namespace - NodeSelector map[string]string // User-provided node selector override (nil = use defaults) - Tolerations []corev1.Toleration // User-provided toleration override (nil = use defaults) + Ctx context.Context // Parent context with timeout + Cancel context.CancelFunc // Release resources (caller must defer) + Clientset kubernetes.Interface // Typed K8s client + RESTConfig *rest.Config // For exec, port-forward, dynamic client + DynamicClient dynamic.Interface // For CRD access + Snapshot *snapshotter.Snapshot // Captured cluster state + ValidationInput *v1.ValidationInput // Validation specification (config + context) + Namespace string // Validation namespace + NodeSelector map[string]string // User-provided node selector override (nil = use defaults) + Tolerations []corev1.Toleration // User-provided toleration override (nil = use defaults) } ``` @@ -199,8 +205,8 @@ pods, err := ctx.Clientset.CoreV1().Pods(ns).List(subCtx, opts) ```go func checkFeatureX(ctx *validators.Context) error { - if ctx.Recipe.Validation == nil { - return validators.Skip("no validation section in recipe") + if ctx.ValidationInput == nil { + return validators.Skip("no validation input provided") } // ... actual check logic ... return nil @@ -274,13 +280,17 @@ validation: ## Performance Validators -Two performance checks ship today, both registered in `validators/performance/main.go`: +Four performance checks ship today (see [`recipes/validators/catalog.yaml`](https://github.com/NVIDIA/aicr/blob/main/recipes/validators/catalog.yaml) for the authoritative list), registered in `validators/performance/main.go`: | Check | Intent | Workload | Constraints | |-------|--------|----------|-------------| | `nccl-all-reduce-bw` | training | NCCL `all_reduce_perf` under a Kubeflow `TrainJob` | `nccl-all-reduce-bw >= N GB/s` | +| `nccl-all-reduce-bw-net` | training | NCCL `all_reduce_perf` over network fabric | `nccl-all-reduce-bw-net >= N GB/s` | +| `nccl-all-reduce-bw-nvls` | training | NCCL `all_reduce_perf` with NVLink Sharp | `nccl-all-reduce-bw-nvls >= N GB/s` | | `inference-perf` | inference+Dynamo | `DynamoGraphDeployment` (vLLM, Qwen/Qwen3-0.6B) + AIPerf Job | `inference-throughput >= N tok/s`, `inference-ttft-p99 <= N ms` | +> **Constraint-name contract.** Each NCCL variant looks up a constraint with the *exact* same name as the check (`constraintNameForVariant` in `validators/performance/nccl_all_reduce_bw.go`). A recipe that runs the `-net` or `-nvls` variant **must** declare a same-named constraint; a generic `nccl-all-reduce-bw` constraint only satisfies the legacy default variant and the variant checks will Skip when it's the only one present. + Both follow a consistent lifecycle: 1. **Deploy** a fresh benchmark workload. `inference-perf` always provisions its own `DynamoGraphDeployment` into a per-run namespace (`aicr-inference-perf-`) derived from `AICR_RUN_ID`, so two concurrent runs cannot collide and a prior run's leftovers cannot be silently adopted. An earlier design sketch had a "discover existing frontend" path — it was intentionally dropped because it admitted ambiguity about which service was being benchmarked on shared clusters. diff --git a/docs/integrator/aks-gpu-setup.md b/docs/integrator/aks-gpu-setup.md index ff270902a..9d079bbf7 100644 --- a/docs/integrator/aks-gpu-setup.md +++ b/docs/integrator/aks-gpu-setup.md @@ -71,7 +71,7 @@ aicr bundle -r recipe.yaml \ --set dradriver:resources.gpus.enabled=false ``` -### Device Plugin vs DRA (Important) +### Device Plugin vs DRA Both device-plugin and DRA are enabled by default, but **only one should be used per node**. Using both concurrently causes GPU over-admission — both systems diff --git a/docs/integrator/automation.md b/docs/integrator/automation.md index fdd5d074b..ccb8fe7ff 100644 --- a/docs/integrator/automation.md +++ b/docs/integrator/automation.md @@ -328,7 +328,7 @@ spec: # Helm chart from upstream - repoURL: https://helm.ngc.nvidia.com/nvidia chart: gpu-operator - targetRevision: v25.3.3 + targetRevision: v26.3.1 helm: valueFiles: - $values/gpu-operator/values.yaml diff --git a/docs/integrator/data-flow.md b/docs/integrator/data-flow.md index f48538571..beea1e3aa 100644 --- a/docs/integrator/data-flow.md +++ b/docs/integrator/data-flow.md @@ -370,7 +370,7 @@ Constraints use fully qualified paths: `{Type}.{Subtype}.{Key}` | Path | Description | |------|-------------| | `K8s.server.version` | Kubernetes server version | -| `OS.release.ID` | Operating system family (ubuntu, rhel) | +| `OS.release.ID` | Operating system family (ubuntu, rhel, cos, amazonlinux, talos) | | `OS.release.VERSION_ID` | OS version (22.04, 24.04) | | `OS.sysctl./proc/sys/kernel/osrelease` | Kernel version | | `GPU.driver.version` | NVIDIA driver version | @@ -711,7 +711,7 @@ spec: sources: # Helm chart from upstream - repoURL: https://helm.ngc.nvidia.com/nvidia - targetRevision: v25.3.3 + targetRevision: v26.3.1 chart: gpu-operator helm: valueFiles: diff --git a/docs/integrator/index.md b/docs/integrator/index.md index 1e75bcccf..a47caf947 100644 --- a/docs/integrator/index.md +++ b/docs/integrator/index.md @@ -20,8 +20,10 @@ This section is for integrators who: | [EKS Dynamo Networking](eks-dynamo-networking.md) | Security group prerequisites for Dynamo overlays on EKS | | [GKE TCPXO Networking](gke-tcpxo-networking.md) | GPUDirect TCPXO prerequisites for GKE training overlays | | [AKS GPU Setup](aks-gpu-setup.md) | AKS prerequisites: Kubernetes 1.34+ (DRA GA), GPU driver setup, DRA configuration | +| [Talos Integration](talos-integration.md) | Running AICR on Talos Linux | | [Recipe Development](recipe-development.md) | Creating and modifying recipe metadata for custom environments | | [Validator Extension](validator-extension.md) | Adding custom validators and overriding embedded ones via `--data` | +| [NodeWright Component](components/nodewright.md) | NodeWright component reference and configuration | ## Quick Start diff --git a/docs/integrator/recipe-development.md b/docs/integrator/recipe-development.md index a2520dcb9..752564708 100644 --- a/docs/integrator/recipe-development.md +++ b/docs/integrator/recipe-development.md @@ -23,7 +23,7 @@ spec: intent: training componentRefs: - name: gpu-operator - version: v25.3.4 + version: v26.3.1 valuesFile: components/gpu-operator/eks-gb200-training.yaml overrides: driver: @@ -33,7 +33,7 @@ spec: **3. Run tests** ([details](#testing-and-validation)) ```bash make test # Validates schema, criteria, references, constraints -make qualify # Includes end to end tests before submitting +make qualify # Includes end-to-end tests before submitting ``` **4. Open PR** ([best practices](#best-practices)) @@ -153,7 +153,7 @@ Only use this pattern when the content is truly uniform across the wildcard dime componentRefs: - name: gpu-operator type: Helm - version: v25.3.4 + version: v26.3.1 valuesFile: components/gpu-operator/values.yaml overrides: driver: @@ -337,7 +337,7 @@ spec: intent: training componentRefs: - name: gpu-operator - version: v25.3.4 + version: v26.3.1 valuesFile: components/gpu-operator/eks-gb200-training.yaml ``` @@ -348,7 +348,7 @@ spec: # Update component version componentRefs: - name: gpu-operator - version: v25.3.4 # Changed from v25.3.3 + version: v26.3.1 # Changed from v26.3.0 ``` **Adding components:** diff --git a/docs/user/agent-deployment.md b/docs/user/agent-deployment.md index fa2279335..0f809c427 100644 --- a/docs/user/agent-deployment.md +++ b/docs/user/agent-deployment.md @@ -44,7 +44,7 @@ metadata: labels: app.kubernetes.io/name: aicr app.kubernetes.io/component: snapshot - app.kubernetes.io/version: v0.17.0 + app.kubernetes.io/version: data: snapshot.yaml: | # Complete snapshot YAML apiVersion: aicr.nvidia.com/v1alpha1 @@ -185,8 +185,8 @@ aicr snapshot --image ghcr.io/nvidia/aicr:v0.8.0 ``` **Finding versions:** -- [GitHub Releases](https://github.com/nvidia/aicr/releases) -- Container registry: [ghcr.io/nvidia/aicr](https://github.com/nvidia/aicr/pkgs/container/aicr) +- [GitHub Releases](https://github.com/NVIDIA/aicr/releases) +- Container registry: [ghcr.io/nvidia/aicr](https://github.com/NVIDIA/aicr/pkgs/container/aicr) ## Post-Deployment diff --git a/docs/user/api-reference.md b/docs/user/api-reference.md index 16eb016e4..efb4454b7 100644 --- a/docs/user/api-reference.md +++ b/docs/user/api-reference.md @@ -343,27 +343,30 @@ Bundler names correspond to component names in [`recipes/registry.yaml`](https:/ | Component | Description | |-----------|-------------| -| `gpu-operator` | NVIDIA GPU Operator — driver and runtime lifecycle | -| `network-operator` | NVIDIA Network Operator — RDMA, SR-IOV, host networking | -| `gke-nccl-tcpxo` | NCCL TCPxO network plugin for optimized collective communication (GKE) | +| `agentgateway` | Kubernetes Gateway API implementation for AI/ML inference (InferencePool routing) | +| `agentgateway-crds` | Kubernetes Gateway API CRDs for AI/ML inference (Gateway API + Inference Extension) | +| `aws-ebs-csi-driver` | Amazon EBS CSI driver (EKS) | | `aws-efa` | AWS Elastic Fabric Adapter device plugin (EKS) | | `cert-manager` | TLS certificate management | -| `nodewright-operator` | OS-level node tuning and kernel configuration | -| `nodewright-customizations` | Environment-specific node tuning profiles | -| `nvsentinel` | GPU health monitoring and automated remediation | -| `nvidia-dra-driver-gpu` | Dynamic Resource Allocation driver for GPUs | -| `kube-prometheus-stack` | Prometheus, Grafana, Alertmanager monitoring stack | -| `prometheus-adapter` | Custom metrics for HPA scaling | -| `aws-ebs-csi-driver` | Amazon EBS CSI driver (EKS) | -| `k8s-ephemeral-storage-metrics` | Ephemeral storage usage metrics | -| `kai-scheduler` | DRA-aware gang scheduler with topology-aware placement | -| `grove` | Dynamo pod lifecycle management | | `dynamo-platform` | NVIDIA Dynamo inference serving platform | -| `agentgateway-crds` | Kubernetes Gateway API CRDs for AI/ML inference (Gateway API + Inference Extension) | -| `agentgateway` | Kubernetes Gateway API implementation for AI/ML inference (InferencePool routing) | +| `gke-nccl-tcpxo` | NCCL TCPxO network plugin for optimized collective communication (GKE) | +| `gpu-operator` | NVIDIA GPU Operator — driver and runtime lifecycle | +| `grove` | Dynamo pod lifecycle management | +| `k8s-ephemeral-storage-metrics` | Ephemeral storage usage metrics | | `k8s-nim-operator` | NVIDIA NIM Operator for inference microservice deployments | -| `kueue` | Kubernetes-native job queuing for batch and AI workloads | +| `kai-scheduler` | DRA-aware gang scheduler with topology-aware placement | +| `kube-prometheus-stack` | Prometheus, Grafana, Alertmanager monitoring stack | | `kubeflow-trainer` | Kubeflow Training Operator for distributed training | +| `kueue` | Kubernetes-native job queuing for batch and AI workloads | +| `network-operator` | NVIDIA Network Operator — RDMA, SR-IOV, host networking | +| `nfd` | Node Feature Discovery — labels nodes with hardware features; publishes per-node `NodeResourceTopology` CRDs on production GPU recipes | +| `nodewright-customizations` | Environment-specific node tuning profiles | +| `nodewright-operator` | OS-level node tuning and kernel configuration | +| `nvidia-dra-driver-gpu` | Dynamic Resource Allocation driver for GPUs | +| `nvsentinel` | GPU health monitoring and automated remediation | +| `prometheus-adapter` | Custom metrics for HPA scaling | +| `slinky-slurm-operator` | SchedMD Slinky Slurm operator and admission webhook | +| `slinky-slurm-operator-crds` | CRDs for the SchedMD Slinky Slurm operator (`slinky.slurm.net`) | **Examples:** diff --git a/docs/user/cli-reference.md b/docs/user/cli-reference.md index 82317fe0b..aa7b859a5 100644 --- a/docs/user/cli-reference.md +++ b/docs/user/cli-reference.md @@ -232,7 +232,7 @@ metadata: labels: app.kubernetes.io/name: aicr app.kubernetes.io/component: snapshot - app.kubernetes.io/version: v0.17.0 + app.kubernetes.io/version: data: snapshot.yaml: | # Full snapshot content @@ -593,7 +593,7 @@ aicr validate [flags] | `--no-cluster` | | bool | false | Skip cluster access (test mode): skips RBAC and Job deployment, reports checks as skipped | | `--evidence-dir` | | string | | Directory to write conformance evidence artifacts | | `--cncf-submission` | | bool | false | Generate CNCF conformance submission artifacts | -| `--feature` | `-f` | string[] | | Feature flags for validation (repeatable) | +| `--feature` | `-f` | string[] | | CNCF evidence-collection feature(s) to scope (repeatable). Valid names: `dra-support`, `gang-scheduling`, `secure-access`, `accelerator-metrics`, `ai-service-metrics`, `inference-gateway`, `robust-operator`, `pod-autoscaling`, `cluster-autoscaling`. Empty selects all features. | | `--emit-attestation` | | string | | Directory to write a recipe-evidence v1 attestation bundle (signed when `--push` is set). See [ADR-007](../design/007-recipe-evidence.md). | | `--bom` | | string | | Path to a CycloneDX BOM (`bom.cdx.json`) to embed. Optional with `--emit-attestation`; when omitted, aicr synthesizes a recipe-bound BOM from the recipe's component refs + validator catalog images. Pass `make bom`'s output for an exhaustive BOM. | | `--push` | | string | | OCI registry reference (e.g. `ghcr.io/myorg/aicr-evidence`) to push the signed summary bundle to. Triggers Sigstore keyless signing via the precedence chain documented under `--identity-token`. | @@ -928,6 +928,7 @@ Results are output in CTRF (Common Test Report Format) — an industry-standard |------|-------------| | `0` | All checks passed | | `2` | Invalid input (bad flags, missing recipe) | +| `5` | Timeout (validator section or context deadline exceeded) | | `8` | One or more checks failed (when `--fail-on-error` is set) | --- @@ -1118,7 +1119,7 @@ This results in: - Each component creates a subdirectory in the output directory - Components are deployed in the order specified by `deploymentOrder` in the recipe -#### Storage Class (`--storage-class`) +#### Storage Class The `--storage-class` flag injects a Kubernetes StorageClass name into components at bundle time. StorageClass is a cluster infrastructure detail — the right value depends on what the target cluster has provisioned, not on the recipe. @@ -1140,7 +1141,7 @@ aicr bundle --recipe recipe.yaml \ When `--storage-class` is not set, any `storageClassName` values already defined in the recipe overlays are preserved as defaults. When it is set, `--set :=` on the same path still wins — `--storage-class` only fills in paths that were not explicitly overridden. -#### Deployment Methods (`--deployer`) +#### Deployment Methods The `--deployer` flag controls how deployment artifacts are generated: @@ -1161,7 +1162,7 @@ All deployers respect the `deploymentOrder` field from the recipe, ensuring comp - **Argo CD**: Uses `argocd.argoproj.io/sync-wave` annotation (0 = first, 1 = second, etc.) - **Flux**: Uses `dependsOn` references in HelmRelease/Kustomization CRs (each component depends on its predecessor) -#### Value Overrides (`--set`) +#### Value Overrides Override any value in the generated bundle files using dot notation: @@ -1766,7 +1767,7 @@ The deploy script installs components in the order specified by `deploymentOrder | `--best-effort` | Continue past individual component failures instead of exiting | | `--retries N` | Retry failed helm/kubectl operations N times with exponential backoff (default: 5) | -Unknown flags are rejected with an error to catch typos (e.g., `--best-effort`). +Unknown flags are rejected with an error to catch typos (e.g., `--bes-effort` or `--retires N`). > **Note on install completion vs. workload readiness.** By default, `deploy.sh` waits on Helm chart readiness where AICR uses `helm --wait`. Some components are intentionally installed without Helm chart-level waiting, and the script does not wait for bundle-level workload readiness such as Nodewright node tuning, GPU operator operand rollout (driver, toolkit, device-plugin DaemonSets), or NVIDIA DRA kubelet plugin registration. Those continue asynchronously after the script exits. When `--best-effort` is used, the script may also finish with non-fatal component failures; check warning lines and logs before treating the install/apply pass as fully successful. `--no-wait` only skips the Helm chart-level wait where AICR uses it; it does not affect bundle-level convergence. @@ -2251,8 +2252,8 @@ aicr recipe --os ubuntu --gpu h100 # Verify recipe file cat recipe.yaml -# Check bundler is valid -aicr bundle --help # Shows available bundlers +# List available flags +aicr bundle --help # Run with debug aicr --debug bundle -r recipe.yaml diff --git a/docs/user/component-catalog.md b/docs/user/component-catalog.md index b6392ee84..56a790c20 100644 --- a/docs/user/component-catalog.md +++ b/docs/user/component-catalog.md @@ -2,7 +2,7 @@ AICR recipes are composed of components — the individual software packages that make up a GPU-accelerated Kubernetes runtime. This page lists every component that can appear in a recipe. -> ***Note:*** Components are included as appropriate in recipes. Not every component listed here will appear in a recipe. +> **Note:** Components are included as appropriate in recipes. Not every component listed here will appear in a recipe. The source of truth is [`recipes/registry.yaml`](https://github.com/NVIDIA/aicr/blob/main/recipes/registry.yaml). Each entry in the registry defines the component's Helm chart (or Kustomize source), default version, namespace, and node scheduling configuration. If a component is not listed there, it cannot appear in a recipe. diff --git a/docs/user/installation.md b/docs/user/installation.md index f28d11d04..c8a39559f 100644 --- a/docs/user/installation.md +++ b/docs/user/installation.md @@ -50,7 +50,7 @@ This script: 1. **Download the latest release** -Visit the [releases page](https://github.com/nvidia/aicr/releases/latest) and download the appropriate binary for your platform: +Visit the [releases page](https://github.com/NVIDIA/aicr/releases/latest) and download the appropriate binary for your platform: - **macOS ARM64** (M1/M2/M3): `aicr__darwin_arm64.tar.gz` - **macOS Intel**: `aicr__darwin_amd64.tar.gz` diff --git a/docs/user/validation.md b/docs/user/validation.md index 02d203a02..11342a455 100644 --- a/docs/user/validation.md +++ b/docs/user/validation.md @@ -208,6 +208,39 @@ aicr validate --recipe recipe.yaml --snapshot snapshot.yaml Phases run sequentially. If any phase fails, subsequent phases are skipped. +## Scoping CNCF submission evidence to specific features + +The `--feature` flag scopes which CNCF AI conformance features get behavioral +evidence collected. It only applies to the CNCF-submission evidence collector +and is rejected by the CLI unless `--cncf-submission` is also set (which in +turn requires `--evidence-dir`). It does **not** scope the regular +`--phase conformance` validator run — that one always evaluates every check +defined in the recipe. + +```bash +aicr validate --recipe recipe.yaml --snapshot snapshot.yaml \ + --phase conformance \ + --cncf-submission \ + --evidence-dir ./evidence \ + --feature dra-support --feature gang-scheduling +``` + +Empty `--feature` (the default) collects evidence for every feature. + +Valid feature names (from `pkg/evidence/cncf/collector.go`): + +| Name | What it checks | +|------|----------------| +| `dra-support` | Dynamic Resource Allocation driver and ResourceSlices | +| `gang-scheduling` | Gang-scheduler presence and PodGroup support | +| `secure-access` | Cluster authn/authz posture for AI workloads | +| `accelerator-metrics` | GPU metrics exporter and Prometheus scrape config | +| `ai-service-metrics` | Inference-service metrics via custom-metrics API | +| `inference-gateway` | Gateway API + Inference Extension installation | +| `robust-operator` | Operator readiness and leader-election posture | +| `pod-autoscaling` | HPA / custom-metrics-driven pod autoscaling | +| `cluster-autoscaling` | Karpenter (preferred) or EKS managed node-group autoscaling fallback | + ## Input modes Snapshot and recipe can come from a file, an HTTPS URL, or a Kubernetes ConfigMap: diff --git a/examples/recipes/README.md b/examples/recipes/README.md index cdf3bd732..760572522 100644 --- a/examples/recipes/README.md +++ b/examples/recipes/README.md @@ -18,7 +18,7 @@ aicr bundle --recipe eks-gb200-ubuntu-training-with-validation.yaml --output ./b # With value overrides aicr bundle \ --recipe eks-gb200-ubuntu-training-with-validation.yaml \ - --set gpuoperator:driver.version=580.82.07 \ + --set gpuoperator:driver.version=580.105.08 \ --output ./bundles # Validate against snapshot (default phase: readiness) diff --git a/examples/recipes/eks-gb200-ubuntu-training-with-validation.yaml b/examples/recipes/eks-gb200-ubuntu-training-with-validation.yaml index fabe6701d..8b8bf032f 100644 --- a/examples/recipes/eks-gb200-ubuntu-training-with-validation.yaml +++ b/examples/recipes/eks-gb200-ubuntu-training-with-validation.yaml @@ -18,7 +18,7 @@ kind: RecipeResult apiVersion: aicr.nvidia.com/v1alpha1 metadata: - version: v0.26.7-next + version: dev appliedOverlays: - base - eks @@ -61,18 +61,27 @@ validation: deployment: constraints: - name: gpu-operator.version - value: "== v25.10.1" + value: "== v26.3.1" severity: warning - remediation: "Update GPU Operator to v25.10.1" + remediation: "Update GPU Operator to v26.3.1" checks: - expected-resources - # Performance phase: Validate system performance + # Performance phase: Validate system performance. + # + # Each NCCL variant requires an exact-named constraint — a generic + # `nccl-all-reduce-bw` only satisfies the legacy default variant, so the + # `-net` and `-nvls` variants Skip when their named constraints are absent. + # Thresholds below mirror recipes/overlays/gb200-eks-training.yaml. performance: - infrastructure: nccl-doctor checks: - - nccl-bandwidth-test - - fabric-health-check + - nccl-all-reduce-bw-net + - nccl-all-reduce-bw-nvls + constraints: + - name: nccl-all-reduce-bw-net + value: ">= 40" + - name: nccl-all-reduce-bw-nvls + value: ">= 500" # Conformance phase: Validate workload-specific requirements # (not configured in this example - will be skipped) @@ -91,7 +100,7 @@ componentRefs: chart: gpu-operator type: Helm source: https://helm.ngc.nvidia.com/nvidia - version: v25.3.3 + version: v26.3.1 valuesFile: components/gpu-operator/values-eks-training.yaml expectedResources: - kind: Deployment @@ -121,7 +130,7 @@ componentRefs: chart: nvidia-dra-driver-gpu type: Helm source: https://helm.ngc.nvidia.com/nvidia - version: 25.8.1 + version: 25.12.0 valuesFile: components/nvidia-dra-driver-gpu/values.yaml dependencyRefs: - gpu-operator @@ -146,6 +155,14 @@ componentRefs: overrides: customization: ubuntu + - name: kube-prometheus-stack + namespace: monitoring + chart: kube-prometheus-stack + type: Helm + source: https://prometheus-community.github.io/helm-charts + version: 84.4.0 + valuesFile: components/kube-prometheus-stack/values.yaml + deploymentOrder: - cert-manager - kube-prometheus-stack diff --git a/examples/recipes/eks-training.yaml b/examples/recipes/eks-training.yaml index 7495c450f..cd9387e58 100644 --- a/examples/recipes/eks-training.yaml +++ b/examples/recipes/eks-training.yaml @@ -15,7 +15,7 @@ kind: RecipeResult apiVersion: aicr.nvidia.com/v1alpha1 metadata: - version: v0.26.7-next + version: dev appliedOverlays: - base - eks @@ -61,6 +61,13 @@ componentRefs: source: https://helm.ngc.nvidia.com/nvidia/skyhook version: v0.15.1 valuesFile: components/nodewright-operator/values.yaml + - name: kube-prometheus-stack + namespace: monitoring + chart: kube-prometheus-stack + type: Helm + source: https://prometheus-community.github.io/helm-charts + version: 84.4.0 + valuesFile: components/kube-prometheus-stack/values.yaml deploymentOrder: - cert-manager - kube-prometheus-stack diff --git a/examples/recipes/kind.yaml b/examples/recipes/kind.yaml index 4b56f5da6..39a117fb1 100644 --- a/examples/recipes/kind.yaml +++ b/examples/recipes/kind.yaml @@ -15,7 +15,7 @@ kind: RecipeResult apiVersion: aicr.nvidia.com/v1alpha1 metadata: - version: v0.26.7-next + version: dev criteria: service: kind accelerator: any