Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 12 additions & 10 deletions AI-ML/microbenchmark/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,37 +26,39 @@ The specific implementation of the CCL benchmark depends on the proposed hardwar
* RCCL tests: https://github.com/ROCm/rocm-systems/tree/develop/projects/rccl-tests
* Intel OneCCL tests: https://www.intel.com/content/www/us/en/docs/oneccl/benchmark-user-guide/2021-14/benchmark.html

If an open-source implementation of CCL tests is not available, then the offeror may provide another implementation, but must report exactly how it was built & run, including the source code and any relevant scripts. The implementation must follow the rules outlined in the "baseline/ported/optimized" definitions in the technical specifications. Specifically, the implementation cannot use unknown or unpublished libraries, and any language interface or architecture-specific language constructs used must be well-documented and publically available at the time of machine arrival.
If an open-source implementation of CCL tests is not available, then the offeror may provide another implementation, but must report exactly how it was built & run, including the source code and any relevant scripts. The implementation must follow the rules outlined in the "baseline/ported/optimized" definitions in the technical specifications. Specifically, the implementation cannot use unknown or unpublished libraries, and any language interface or architecture-specific language constructs used must be well-documented and publicly available at the time of machine arrival.

### Tests

Two types of runs are requested to satisfy this benchmark: single-node and multi-node. In total, the requirements from 3 individual CCL configurations are described below. For each configuration, we ask for 5 replicate runs, for a total of 15 AllReduce runs.

#### Single-node

To demonstrate intra-node CCL performance, each collective should be run across all available accelerated devices within a single node.
To demonstrate intra-node CCL performance, each collective should be run across all available accelerated devices within a single node (see table below for details).

#### Multi-node

To demonstrate inter-node CCL performance, each collective should be run in two jobs with increasingly large node counts relative to the size of the test system.
To demonstrate inter-node CCL performance, each collective should be run in two multi-node jobs, one using 2 nodes and another using 15% of nodes proposed by the offerer (see table below for details).

#### Summary of Requested Tests

| Test | Nodes Used | Ranks Used |
|---------------|-------------|-------------------|
| AllReduce | 1 | 1 per accelerator |
| AllReduce | 2 | 1 per accelerator |
| AllReduce | 15%** | 1 per accelerator |
In total, three tests (each consisting of 5 replicates) are required for this benchmark:

**15% of nodes proposed by offeror. If this number exceeds the total number of nodes on the test system, then running the CCL benchmark on all accelerated test nodes satisfies this requirement.
| Test | Nodes Used | Ranks Used | Number of Replicate Runs |
|:-- |:-- |:-- | :-- |
| AllReduce | 1 | 1 per accelerator | 5 |
| AllReduce | 2 | 1 per accelerator | 5 |
| AllReduce | 15%** | 1 per accelerator | 5 |

**15% of nodes proposed by offeror. If this number exceeds the total number of nodes on the system used for benchmarking, then running the CCL benchmark on all available accelerated test nodes satisfies this requirement.

### Run Rules

Any run must utilize all available accelerators on each node. For all configurations described above, the collective test should scan message sizes between 8B to 4GB, increasing by a factor of 2. The reducing operation must be `sum`, and the highest precision that is natively supported on the hardware (e.g., `double`) must be used. See [`run_nccl_cxi.sh`](./run_nccl_cxi.sh) for an example submission script of running `all_reduce_perf` on Kestrel from the official [nccl-tests](https://github.com/NVIDIA/nccl-tests/tree/master) repository following these rules.

## Benchmark Test Results to Report and Files to Return

**File response:** We request the raw data associated with each CCL run be provided, demonstrating the bandwidth and latency for each message size. An example logfile is provided [below](#logfile-example). Additionally, any environment variables, fabric settings, and/or CCL configuration settings necessary to reproduce the results should be provided. The offeror should distinguish between parameters which may be set by an unprivileged user from those which would be globally set by system adminstrators.
**File response:** We request the raw data associated with each CCL run be provided, demonstrating the bandwidth and latency for each message size. An example logfile is provided [below](#logfile-example-from-kestrel-reference-run-using-nccl). Additionally, any environment variables, fabric settings, and/or CCL configuration settings necessary to reproduce the results should be provided. The offeror should distinguish between parameters which may be set by an unprivileged user from those which would be globally set by system adminstrators.

**Spreadsheet response:** We request the out-of-place and in-place bandwidth and latencies, as well as high-level information about the system the benchmark was run on, to be reported in a spreadsheet (template [below](#spreadsheet-template)) for the following message sizes (in bytes):

Expand Down