Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 0 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,6 @@ brew install aicr
# Or use the install script
curl -sfL https://raw.githubusercontent.com/NVIDIA/aicr/main/install | bash -s --


# Capture your cluster's current state
aicr snapshot --output snapshot.yaml

Expand Down
2 changes: 2 additions & 0 deletions demos/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,9 @@ Runbooks for testing and demonstrating AICR end-to-end workflows on live cluster
| [cuj1-eks.md](cuj1-eks.md) | CUJ1 - EKS cluster setup |
| [cuj1-gke.md](cuj1-gke.md) | CUJ1 - GKE cluster setup |
| [cuj2.md](cuj2.md) | CUJ2 - EKS inference with Dynamo |
| [cuj2-demo.md](cuj2-demo.md) | CUJ2 - Annotated demo walkthrough (training vs inference) |
| [cuj2-eks.md](cuj2-eks.md) | CUJ2 - EKS variant |
| [cuj2-gke.md](cuj2-gke.md) | CUJ2 - GKE variant |
| [e2e.md](e2e.md) | End-to-end CLI demo |
| [valid.md](valid.md) | Validation demo |
| [data.md](data.md) | External data directory demo |
Expand Down
11 changes: 9 additions & 2 deletions demos/cuj1-eks.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
## Assumptions

* Assuming user is already authenticated to an EKS cluster with 2+ H100 node.
* Values used in `--accelerated-node-selector`, `--accelerated-node-toleration`, and `--system-node-toleration` flags are only for example purposes. Assuming user will update these to match their cluster.
* Values used in `--accelerated-node-selector`, `--accelerated-node-toleration`, `--system-node-selector`, and `--system-node-toleration` flags are only for example purposes. Assuming user will update these to match their cluster.

## Snapshot

Expand Down Expand Up @@ -41,13 +41,20 @@ aicr validate \

## Generate Bundle

The selector and toleration values below mirror AICR's reference EKS clusters
(`aicr1`, `aicr2`): nodes carry the label `nodeGroup={system-worker,gpu-worker}`
and the unrelated taint key `dedicated={system-workload,worker-workload}:{NoSchedule,NoExecute}`.
The selector key (`nodeGroup`) and toleration key (`dedicated`) intentionally
differ — the label drives scheduling targeting and the taint drives admission.
Adjust both pairs to your cluster's actual labels and taints.

```shell
aicr bundle \
--recipe recipe.yaml \
--accelerated-node-selector nodeGroup=gpu-worker \
--accelerated-node-toleration dedicated=worker-workload:NoSchedule \
--accelerated-node-toleration dedicated=worker-workload:NoExecute \
--system-node-selector dedicated=system-workload \
--system-node-selector nodeGroup=system-worker \
--system-node-toleration dedicated=system-workload:NoSchedule \
--system-node-toleration dedicated=system-workload:NoExecute \
Comment thread
coderabbitai[bot] marked this conversation as resolved.
--storage-class <storage-class> \
Expand Down
6 changes: 3 additions & 3 deletions demos/cuj2-demo.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,8 +46,8 @@
│ ├── agentgateway-crds/ (Gateway API + inference CRDs) │
│ ├── agentgateway/ (inference gateway controller) │
│ ├── nvsentinel/ (security/compliance) │
│ ├── nodewright-operator/ (node configuration) │
│ ├── nodewright-customizations/ (H100 tuning)
│ ├── nodewright-operator/ (node configuration) │
│ ├── nodewright-customizations/ (H100 tuning) │
│ ├── aws-ebs-csi-driver/ (EBS storage) │
│ ├── aws-efa/ (Elastic Fabric Adapter) │
│ ├── dynamo-crds/ (Dynamo CRDs) │
Expand Down Expand Up @@ -125,7 +125,7 @@
│ h100-eks-ubuntu-training.yaml │ (Ubuntu constraints) │
│ (Ubuntu constraints) │ │ │
│ │ │ h100-eks-ubuntu-inference-dynamo │
│ h100-eks-ubuntu-training-kubeflow │ ├── gpu-operator (v25.3.4, CDI) │
│ h100-eks-ubuntu-training-kubeflow │ ├── gpu-operator (v26.3.1, CDI) │
│ └── kubeflow-trainer ◀── NEW │ ├── nvidia-dra-driver (gpuRes)◀─NEW│
│ │ ├── dynamo-crds ◀─ NEW │
│ │ └── dynamo-platform ◀─ NEW │
Expand Down
4 changes: 2 additions & 2 deletions demos/cuj2.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,11 +33,11 @@ aicr bundle \
--recipe recipe.yaml \
--output bundle \
--accelerated-node-selector [key]=[value] \
--accelerated-node-toleration [key]=[value]:[operation] \
--accelerated-node-toleration [key]=[value]:[effect] \
--storage-class [storage-class]
```

Replace the values for `--accelerated-node-selector` and `--accelerated-node-toleration` with the appropriate ones to match your gpu pool(s). You do not want optimizations and inference workloads to run across all nodes. Both options allow for comma delimination to supply multiple values. See the [aicr bundle](../docs/user/cli-reference.md#aicr-bundle) section for more information.
Replace the values for `--accelerated-node-selector` and `--accelerated-node-toleration` with the appropriate ones to match your gpu pool(s). You do not want optimizations and inference workloads to run across all nodes. Both options allow for comma-separated values to supply multiple values. See the [aicr bundle](../docs/user/cli-reference.md#aicr-bundle) section for more information.

Set `--storage-class` to a StorageClass that exists on the target cluster (check with `kubectl get storageclass`). Cloud overlays configure `kube-prometheus-stack` with a `volumeClaimTemplate` but no `storageClassName`; without this flag the PVC falls to the cluster's default StorageClass, and if no default is configured the deploy hangs on a Pending PVC.

Expand Down
2 changes: 1 addition & 1 deletion demos/data.md
Original file line number Diff line number Diff line change
Expand Up @@ -188,7 +188,7 @@ Order in `dependencyRefs`:
2. `gpu-operator` (depends on cert-manager)
3. Other components...

> Asymmetric rule matching based on [Kahn's algorithm](https://www.geeksforgeeks.org/dsa/topological-sorting-indegree-based-solution/).
> Dependency-driven ordering based on [Kahn's algorithm](https://www.geeksforgeeks.org/dsa/topological-sorting-indegree-based-solution/) for topological sort.

## API Access

Expand Down
8 changes: 4 additions & 4 deletions demos/e2e.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ aicr recipe --service eks --accelerator gb200 | yq .
From criteria file:

```shell
cat > /tmp/criteria.yaml << 'EOF'
cat > "${TMPDIR:-/tmp}/criteria.yaml" << 'EOF'
kind: RecipeCriteria
apiVersion: aicr.nvidia.com/v1alpha1
metadata:
Expand All @@ -48,15 +48,15 @@ EOF
Generate recipe from criteria file

```shell
aicr recipe --criteria /tmp/criteria.yaml --output recipe.yaml
aicr recipe --criteria "${TMPDIR:-/tmp}/criteria.yaml" --output recipe.yaml
Comment thread
coderabbitai[bot] marked this conversation as resolved.
```

> Metadata overlays: `components=11 overlays=7`

CLI flags override criteria file values

```shell
aicr recipe --criteria /tmp/criteria.yaml --service gke | yq .
aicr recipe --criteria "${TMPDIR:-/tmp}/criteria.yaml" --service gke | yq .
```

> Metadata overlays: `components=7 overlays=2`
Expand Down Expand Up @@ -238,7 +238,7 @@ aicr recipe \
```

Output shows:
* `18` embedded + `1` external = `19` merged components
* `<N>` embedded + `<M>` external = `<N+M>` merged components
* `dgxc-teleport` appears as Kustomize component

Now `dgxc-teleport` is included in `componentRefs` and `deploymentOrder`
Expand Down
2 changes: 2 additions & 0 deletions demos/examples/CUJ2-Test-Report.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
# CUJ2 Test Report (Run 2) - EKS Inference with Dynamo

**Historical capture.** This report was generated before PR #871 migrated `kgateway` → `agentgateway`. Current bundles install `agentgateway` charts in namespace `agentgateway-system`; the log lines below reflecting `kgateway-*` releases are obsolete.

**Date:** 2026-03-13
**Branch:** `fix/cuj2-timeout-issue` (includes PR #397 fix, rebased on main)
**AICR Version:** built from source (fix/cuj2-timeout-issue)
Expand Down
2 changes: 1 addition & 1 deletion demos/valid.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@ aicr bundle \
Expected:
- Completes in <10s
- `bundle` created
- Nodewright Warning about workload selector
- Nodewright emits a warning about the workload selector

## Validate (dry run, no-cluster)

Expand Down
8 changes: 6 additions & 2 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ NVIDIA AI Cluster Runtime (AICR) is a suite of tooling designed to automate the
|------|-------------|
| **Snapshot** | A captured state of a system including OS, kernel, Kubernetes, GPU, and SystemD configuration. Created by `aicr snapshot` or the Kubernetes agent. |
| **Recipe** | A generated configuration recommendation containing component references, constraints, and deployment order. Created by `aicr recipe` based on criteria or snapshot analysis. |
| **Criteria** | Query parameters that define the target environment: `service` (eks/gke/aks/oke/kind/lke), `accelerator` (h100/gb200/b200/a100/l40/rtx-pro-6000), `intent` (training/inference), `os` (ubuntu/rhel/cos/amazonlinux/talos), `platform` (kubeflow), and `nodes`. |
| **Criteria** | Query parameters that define the target environment: `service` (eks/gke/aks/oke/kind/lke), `accelerator` (h100/gb200/b200/a100/l40/rtx-pro-6000), `intent` (training/inference), `os` (ubuntu/rhel/cos/amazonlinux/talos), `platform` (dynamo, kubeflow, nim, slurm), and `nodes`. |
| **Overlay** | A recipe metadata file that extends the base recipe for specific environments. Overlays are matched against criteria using asymmetric matching. |
| **Mixin** | A composable recipe fragment (`kind: RecipeMixin`) that carries only `constraints` and `componentRefs`. Mixins live in `recipes/mixins/`, are excluded from overlay discovery, and are referenced by leaf overlays via `spec.mixins` to share orthogonal content (e.g., OS constraints, platform components) without duplication. See [ADR-005](design/005-overlay-refactoring.md). |
| **Bundle** | Deployment artifacts generated from a recipe: Helm values files, Kubernetes manifests, installation scripts, and checksums. |
Expand Down Expand Up @@ -40,7 +40,7 @@ Previously, administrators relied on static documentation and manual installatio

### The Solution: Automated Approach

AICR replaces manual interpretation of documentation with a **automated approach**. It treats infrastructure configuration as code, providing a deterministic engine that generates the exact artifacts needed for a specific environment.
AICR replaces manual interpretation of documentation with an **automated approach**. It treats infrastructure configuration as code, providing a deterministic engine that generates the exact artifacts needed for a specific environment.

**Key Benefits:**
1. **Deterministic & Validated:** The system guarantees that the inputs (your system state) always produce the same valid outputs, tested against NVIDIA hardware.
Expand Down Expand Up @@ -159,6 +159,10 @@ For engineers integrating AICR into CI/CD pipelines, GitOps workflows, or larger
| [Kubernetes Deployment](integrator/kubernetes-deployment.md) | Self-hosted API server deployment |
| [Recipe Development](integrator/recipe-development.md) | Adding and modifying recipe metadata |
| [Validator Extension](integrator/validator-extension.md) | Custom validators via `--data` |
| [AKS GPU Setup](integrator/aks-gpu-setup.md) | Azure Kubernetes Service GPU node setup |
| [EKS Dynamo Networking](integrator/eks-dynamo-networking.md) | EKS networking for Dynamo workloads |
| [GKE TCPXO Networking](integrator/gke-tcpxo-networking.md) | GKE TCPXO networking integration |
| [Talos Integration](integrator/talos-integration.md) | Running AICR on Talos Linux |

## Quick Start

Expand Down
27 changes: 18 additions & 9 deletions docs/conformance/cncf/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,12 +37,18 @@ docs/conformance/cncf/
├── pod-autoscaling.md
└── cluster-autoscaling.md

pkg/evidence/scripts/ # Evidence collection script + test manifests
├── collect-evidence.sh
└── manifests/
├── dra-gpu-test.yaml
├── gang-scheduling-test.yaml
└── hpa-gpu-test.yaml
pkg/evidence/cncf/ # CNCF evidence collector package
├── collector.go # Feature registry, alias mapping
├── renderer.go # Evidence file rendering
├── requirements.go # CNCF requirement ID mapping
├── templates.go # Evidence templates
├── types.go # Shared types
└── scripts/ # Evidence collection script + test manifests
├── collect-evidence.sh
└── manifests/
├── dra-gpu-test.yaml
├── gang-scheduling-test.yaml
└── hpa-gpu-test.yaml
```

## Usage
Expand Down Expand Up @@ -75,10 +81,13 @@ aicr validate --phase conformance \
--evidence-dir ./evidence --cncf-submission -f dra -f hpa
```

Alternatively, run the evidence collection script directly:
Alternatively, run the evidence collection script directly. Valid section
names are `dra`, `gang`, `secure`, `accelerator-metrics`, `service-metrics`,
`gateway`, `operator`, `hpa`, `cluster-autoscaling`, or `all`:

```bash
./pkg/evidence/scripts/collect-evidence.sh all
./pkg/evidence/scripts/collect-evidence.sh dra
./pkg/evidence/cncf/scripts/collect-evidence.sh all
./pkg/evidence/cncf/scripts/collect-evidence.sh dra
```
Comment thread
coderabbitai[bot] marked this conversation as resolved.

> **Note:** The `--cncf-submission` flag deploys GPU workloads and takes ~5-10
Expand Down
18 changes: 11 additions & 7 deletions docs/contributor/api-server.md
Original file line number Diff line number Diff line change
Expand Up @@ -267,6 +267,7 @@ Supported content types:
| `gpu` | AcceleratorType | Alias for accelerator | `gpu=h100` |
| `intent` | IntentType | Enum: training, inference, any | `intent=training` |
| `os` | OSType | Enum: ubuntu, rhel, cos, amazonlinux, talos, any | `os=ubuntu` |
| `platform` | PlatformType | Enum: dynamo, kubeflow, nim, slurm, any | `platform=kubeflow` |
| `nodes` | int | >= 0 | `nodes=8` |

### Recipe Builder: `pkg/recipe/builder.go`
Expand Down Expand Up @@ -369,7 +370,7 @@ Endpoints `GET /v1/recipe` (query parameters) and `POST /v1/recipe` (criteria bo
- `X-RateLimit-Limit` - Total requests allowed per second
- `X-RateLimit-Remaining` - Requests remaining in current window
- `X-RateLimit-Reset` - Unix timestamp when window resets
- `Cache-Control` - Caching policy (public, max-age=300)
- `Cache-Control` - Caching policy (public, max-age=600)

### Health Check

Expand Down Expand Up @@ -438,7 +439,9 @@ aicr_panic_recoveries_total 0
"service": "aicrd",
"version": "v1.0.0",
"routes": [
"/v1/recipe"
"/v1/recipe",
"/v1/query",
"/v1/bundle"
]
}
```
Expand Down Expand Up @@ -737,7 +740,7 @@ spec:
### Caching Strategy

- **Recipe Store**: Loaded once per process, cached globally
- **Client-Side**: 5-minute cache via Cache-Control header
- **Client-Side**: 10-minute cache via Cache-Control header (`defaults.RecipeCacheTTL`)
- **CDN**: Recommended for public-facing deployments

## Error Handling
Expand Down Expand Up @@ -836,7 +839,7 @@ When a request uses a criteria value not in the configured allowlist:
- Request ID tracking for distributed tracing
- Structured logging for debugging

## Monitoring & Observability
## Monitoring and Observability

### Prometheus Metrics

Expand Down Expand Up @@ -975,7 +978,7 @@ func TestRecipeHandler(t *testing.T) {
- `pkg/serializer` - JSON response formatting
- `pkg/logging` - Logging configuration

## Build & Deployment
## Build and Deployment

### Automated CI/CD Pipeline

Expand All @@ -1002,7 +1005,7 @@ export TAG=$(curl -s https://api.github.com/repos/NVIDIA/aicr/releases/latest |
gh attestation verify oci://ghcr.io/nvidia/aicrd:${TAG} --owner nvidia
```

For detailed CI/CD architecture, see [CONTRIBUTING.md](https://github.com/NVIDIA/aicr/blob/main/CONTRIBUTING.md#github-actions--cicd) and [Architecture Overview](index.md#cicd-architecture).
For detailed CI/CD architecture, see [CONTRIBUTING.md](https://github.com/NVIDIA/aicr/blob/main/CONTRIBUTING.md#github-actions--cicd) and the [Architecture Overview](index.md).

### Local Build Configuration

Expand Down Expand Up @@ -1070,6 +1073,7 @@ export AICR_ALLOWED_SERVICES=eks,gke
- The `any` value is always allowed regardless of allowlist
- Both `/v1/recipe` and `/v1/bundle` endpoints enforce allowlists
- CLI (`aicr`) is not affected by allowlists
- The `platform` criteria field has no allowlist env var today; all valid platform enum values are accepted (invalid enum values are rejected with HTTP 400 by the criteria parser)

## Extension and Operating Patterns

Expand Down Expand Up @@ -1097,7 +1101,7 @@ See [API Server: Extension and Operating Patterns](api-server-extending.md).
- [Google SRE Book](https://sre.google/sre-book/table-of-contents/) - Site reliability engineering
- [Release Engineering](https://sre.google/workbook/release-engineering/) - Deployment best practices

### HTTP & APIs
### HTTP and APIs

- [HTTP/2 in Go](https://go.dev/blog/h2push) - HTTP/2 server push
- [RESTful API Design](https://cloud.google.com/apis/design) - Google Cloud API design guide
Expand Down
18 changes: 10 additions & 8 deletions docs/contributor/cli.md
Original file line number Diff line number Diff line change
Expand Up @@ -291,8 +291,9 @@ flowchart TD
B --> B2["SystemDCollector<br/>(containerd, docker, kubelet)"]
B --> B3["KubernetesCollector<br/>(server, images, policies)"]
B --> B4["GPUCollector<br/>(nvidia-smi data)"]
B --> B5["NodeTopologyCollector<br/>(cluster-wide taints, labels)"]

B1 & B2 & B3 & B4 --> C[NodeSnapshotter.Measure]
B1 & B2 & B3 & B4 & B5 --> C[NodeSnapshotter.Measure]

C --> D["Parallel Collection<br/>(errgroup)"]

Expand All @@ -301,17 +302,20 @@ flowchart TD
D --> D3["Go Routine 3: SystemD<br/>• containerd.service<br/>• docker.service<br/>• kubelet.service"]
D --> D4["Go Routine 4: OS Config<br/>• GRUB parameters<br/>• Kernel modules<br/>• Sysctl parameters"]
D --> D5["Go Routine 5: GPU<br/>• nvidia-smi properties<br/>• driver, CUDA, etc."]
D --> D6["Go Routine 6: NodeTopology<br/>• cluster-wide taints<br/>• cluster-wide labels"]

D1 & D2 & D3 & D4 & D5 --> E["All goroutines complete<br/>or first error returns"]
D1 & D2 & D3 & D4 & D5 & D6 --> E["All goroutines complete<br/>or first error returns"]

E --> F["Snapshot Structure<br/>kind: Snapshot<br/>apiVersion: aicr.nvidia.com/v1alpha1<br/>measurements: [k8s, systemd, os, gpu]"]
E --> F["Snapshot Structure<br/>kind: Snapshot<br/>apiVersion: aicr.nvidia.com/v1alpha1<br/>measurements: [K8s, SystemD, OS, GPU, NodeTopology]"]

F --> G[serializer.NewFileWriterOrStdout]

G --> G1["Format: JSON/YAML/Table"]
G --> G2["Output: stdout or file"]
```

Snapshot measurement types: `K8s`, `SystemD`, `OS`, `GPU`, `NodeTopology` (cluster-wide node taints and labels — see `pkg/measurement/types.go` for the canonical constants).

#### Usage Examples

```bash
Expand Down Expand Up @@ -979,11 +983,8 @@ INFO generating bundle recipeFilePath=recipe.yaml outputDir=./bundles bundlerTy
INFO starting bundle generation bundler_count=1 output_dir=./bundles
INFO bundler completed bundler_type=gpu-operator files=5 size_bytes=12458 duration=45ms
INFO bundle generation complete summary="Generated 5 files (12 KB) in 45ms. Success: 1/1 bundlers."
INFO bundle generation completed success=1 errors=0 duration_sec=0.045 summary="Generated 5 files (12 KB) in 45ms. Success: 1/1 bundlers."
```

**Common Errors**:

## Shared Infrastructure

### Collector Factory Pattern
Expand All @@ -999,6 +1000,7 @@ type Factory interface {
CreateOSCollector() Collector
CreateKubernetesCollector() Collector
CreateGPUCollector() Collector
CreateNodeTopologyCollector() Collector
}
```

Expand Down Expand Up @@ -1053,7 +1055,7 @@ type Reading struct {
```bash
# Invalid accelerator type
$ aicr recipe --accelerator invalid-gpu
[cli] command failed: error=[INTERNAL] error building recipe: [INVALID_REQUEST] error parsing criteria: [INVALID_REQUEST] failed to apply criteria option: [INVALID_REQUEST] failed to parse accelerator type: [INVALID_REQUEST] invalid accelerator type: invalid-gpu exitCode=8
[cli] command failed: error=[INVALID_REQUEST] error parsing criteria: [INVALID_REQUEST] invalid accelerator type: invalid-gpu exitCode=2

# Unknown output format
$ aicr snapshot --format xml
Expand Down Expand Up @@ -1384,7 +1386,7 @@ spec:
effect: NoSchedule
containers:
- name: aicr
image: ghcr.io/nvidia/aicr:v0.6.4
image: ghcr.io/nvidia/aicr:<release-tag> # replace with the AICR release you target
command:
- /bin/sh
- -c
Expand Down
Loading
Loading