Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
169 changes: 125 additions & 44 deletions onnxruntime/test/perftest/README.md
Original file line number Diff line number Diff line change
@@ -1,69 +1,150 @@
# ONNXRuntime Performance Test
# ONNX Runtime Performance Test

This tool provides the performance results using the ONNX Runtime with the specific execution provider to run the inference for a given model using the sample input test data. This tool can provide a reliable measurement for the inference latency usign ONNX Runtime on the device. The options to use with the tool are listed below:
`onnxruntime_perf_test` measures inference latency and throughput of a model with ONNX Runtime using a chosen execution provider. It builds an inference session, runs a warm-up iteration, and then repeatedly runs the model (either for a fixed number of times or a fixed duration), reporting timing and resource-usage statistics.

`onnxruntime_perf_test [options...] model_path result_file`
## Building the tool

Options:
`onnxruntime_perf_test` is built together with the ONNX Runtime tests. Build from source with `--build` and tests enabled (the default), for example:

-A: Disable memory arena.
```bash
./build.sh --config Release --build_dir build/Release --parallel # Linux/macOS
.\build.bat --config Release --build_dir build\Release --parallel # Windows
```

-M: Disable memory pattern.
The binary is produced under the build directory, for example `build/Release/Release/onnxruntime_perf_test`. See the [build instructions](https://onnxruntime.ai/docs/build/) for prerequisites and execution-provider specific flags.

-P: Use parallel executor instead of sequential executor.
## Usage

-c: [parallel runs]: Specifies the (max) number of runs to invoke simultaneously. Default:1.
```
onnxruntime_perf_test [options...] model_path [result_file]
```

-e: [cpu|cuda|mkldnn|tensorrt|openvino|acl|vitisai]: Specifies the execution provider 'cpu','cuda','dnnn','tensorrt', 'openvino', 'acl' and 'vitisai'. Default is 'cpu'.
- `model_path`: path to the `.onnx` (or `.ort`) model file.
- `result_file`: optional path to append the run results to. If omitted, statistics (`-s`) are printed to stdout by default.

-m: [test_mode]: Specifies the test mode. Value coulde be 'duration' or 'times'. Provide 'duration' to run the test for a fix duration, and 'times' to repeated for a certain times. Default:'duration'.
Options may be given with a single dash (`-e cpu`) or a double dash (`--e cpu`); both forms are equivalent.

-o: [optimization level]: Default is 1. Valid values are 0 (disable), 1 (basic), 2 (extended), 3 (layout), 99 (all). Please see __onnxruntime_c_api.h__ (enum GraphOptimizationLevel) for the full list of all optimization levels.
For the complete, always-current list of options (including the many execution-provider specific runtime options passed via `-i`), run:

-u: [path to save optimized model]: Default is empty so no optimized model would be saved.
```bash
onnxruntime_perf_test --help
```

-p: [profile_file]: Specifies the profile name to enable profiling and dump the profile data to the file.
## Providing input data

-r: [repeated_times]: Specifies the repeated times if running in 'times' test mode.Default:1000.
The tool needs one set of inputs per model. There are two ways to supply them.

-s: Show statistics result, like P75, P90.
### 1. Auto-generate random input (simplest)

-t: [seconds_to_run]: Specifies the seconds to run for 'duration' mode. Default:600.
Pass `-I` to have the tool generate input tensors automatically. No data files are required. Free (symbolic) dimensions are treated as `1` unless overridden with `-f`, and `-S` sets a fixed random seed for reproducible data.

-v: Show verbose information.
```bash
onnxruntime_perf_test -I -e cpu model.onnx
# override a symbolic dimension named "batch" to 4
onnxruntime_perf_test -I -f "batch:4" -e cpu model.onnx
```

-x: [intra_op_num_threads]: Sets the number of threads used to parallelize the execution within nodes. A value of 0 means the test will auto-select a default. Must >=0.
### 2. Provide test data files

-y: [inter_op_num_threads]: Sets the number of threads used to parallelize the execution of the graph (across nodes), A value of 0 means the test will auto-select a default. Must >=0.
The tool reuses the `onnx_test_runner` / ONNX backend-test data layout. Place one or more input-set subdirectories next to the model file:

-C: [session_config_entries]: Specify session configuration entries as key-value pairs: -C "<key1>|<val1> <key2>|<val2>"
Refer to onnxruntime_session_options_config_keys.h for valid keys and values.
[Example] -C "session.disable_cpu_ep_fallback|1 ep.context_enable|1"
```
<model_dir>/
├── model.onnx # pass this path as model_path (any .onnx name works)
├── test_data_set_0/ # one input set
│ ├── input_0.pb # first model input (serialized onnx.TensorProto)
│ └── input_1.pb # second model input, ...
└── test_data_set_1/ # optional additional input set(s)
└── input_0.pb
```

-h: help.
Notes:

Model path and input data dependency:
Performance test uses the same input structure as *onnx_test_runner* tool. It requrires the directory trees as below:
- Every subdirectory of the model's directory is treated as one input set, and the tool cycles through the sets across iterations. The conventional name is `test_data_set_<N>`, but any subdirectory name works.
- Within a set, files named `input_<N>.pb` are loaded in sorted order and bound to the model inputs by position. Each `.pb` file is a serialized `onnx.TensorProto`.

--ModelName
--test_data_set_0
--input0.pb
--test_data_set_2
--input0.pb
--model.onnx
#### Creating `input_<N>.pb` files

The path of model.onnx needs to be provided as `<model_path>` argument.
Use the helper script [`tools/python/onnx_test_data_utils.py`](../../../tools/python/onnx_test_data_utils.py) to generate a serialized `TensorProto`. For example, to create a random `float32` tensor of shape `10240x512` for a model input named `x`:

__Sample output__ from the tool will look something like this:
```bash
python tools/python/onnx_test_data_utils.py \
--action random_to_pb \
--name x \
--shape 10240,512 \
--datatype f4 \
--output my_model/test_data_set_0/input_0.pb
```

Total time cost:58.8053
Total iterations:1000
Average time cost:58.8053 ms
Total run time:58.8102 s
Min Latency is 0.0559777sec
Max Latency is 0.0623472sec
P50 Latency is 0.0587108sec
P90 Latency is 0.0599845sec
P95 Latency is 0.0605676sec
P99 Latency is 0.0619517sec
P999 Latency is 0.0623472se
`--name` must match the model's input name. `--datatype` is a numpy dtype string (for example `f4` = float32, `f2` = float16, `i8` = int64), and `--seed` can be used for deterministic values. Run `python tools/python/onnx_test_data_utils.py --help` for the full set of actions (for example converting existing `.npy` data to `.pb`).

## Examples

```bash
# Auto-generated input, CPU EP, show statistics
onnxruntime_perf_test -I -e cpu -s model.onnx

# Run for a fixed number of iterations (times mode) on CUDA
onnxruntime_perf_test -e cuda -m times -r 2000 model.onnx result.txt

# Run for a fixed duration (duration mode) for 30 seconds
onnxruntime_perf_test -e cpu -m duration -t 30 model.onnx

# Use test data directories located next to the model
onnxruntime_perf_test -e cpu my_model/model.onnx

# Pass an execution-provider specific runtime option (TensorRT FP16)
onnxruntime_perf_test -e tensorrt -i "trt_fp16_enable|true" model.onnx
```

## Common options

| Option | Description |
| --- | --- |
| `-e [provider]` | Execution provider: `cpu` (default), `cuda`, `dnnl`, `tensorrt`, `nvtensorrtrtx`, `openvino`, `dml`, `acl`, `nnapi`, `coreml`, `qnn`, `snpe`, `migraphx`, `xnnpack`, `vitisai`, `webgpu`. |
| `-m [mode]` | Test mode: `duration` (default) or `times`. |
| `-r [count]` | Number of iterations to run in `times` mode. Default: 1000. |
| `-t [seconds]` | Seconds to run in `duration` mode. Default: 600. |
| `-c [count]` | Max number of runs to invoke simultaneously. Default: 1. |
| `-I` | Auto-generate model input; no test data files required. |
| `-S [seed]` | Random seed for generated input data (for reproducibility). Default: -1 (uninitialized). |
| `-f "name:value"` | Override a free (symbolic) dimension by name. May be repeated. |
| `-x [count]` | Intra-op thread count (0 lets ORT choose). |
| `-y [count]` | Inter-op thread count (0 lets ORT choose). |
| `-o [level]` | Graph optimization level: 0 (disable), 1 (basic), 2 (extended), 3 (layout), 99 (all). Default: 99. |
| `-p [file]` | Enable profiling and write the profile data to `file`. |
| `-i "k1\|v1 k2\|v2"` | Execution-provider specific runtime options (see `--help` for per-provider keys). |
| `-C "k1\|v1 k2\|v2"` | Session configuration entries. See `onnxruntime_session_options_config_keys.h` for valid keys. |
| `-s` | Show latency statistics (P50, P90, ...). Defaults to on when no `result_file` is given. |
| `-v` | Verbose output. |
| `-h` | Print the full usage, including all options. |

This is a curated subset of the most commonly used options. Run `onnxruntime_perf_test --help` for the authoritative and complete list.

## Sample output

A typical summary printed to stdout looks like:

```
Session creation time cost: 0.512 s
First inference time cost: 12 ms
Total inference time cost: 5.88053 s
Total inference requests: 1000
Average inference time cost total: 5.88053 ms
Total inference run time: 5.88102 s
Number of inferences per second: 170.04
Avg CPU usage: 98 %
Peak working set size: 123456789 bytes
```

When `-s` is enabled, the latency percentiles are also reported:

```
Min Latency: 0.0559777 s
Max Latency: 0.0623472 s
P50 Latency: 0.0587108 s
P90 Latency: 0.0599845 s
P95 Latency: 0.0605676 s
P99 Latency: 0.0619517 s
P999 Latency: 0.0623472 s
```