Complete guide to profiling model performance and analyzing library calls with madengine.
madengine integrates multiple profiling and tracing tools to analyze GPU usage, library calls, and system performance. Tools are configured via --additional-context and applied in a stackable design pattern.
madengine run --tags model \
--additional-context '{
"gpu_vendor": "AMD",
"guest_os": "UBUNTU",
"tools": [{"name": "rocprof"}]
}'Output: rocprof_output/ directory with profiling results
For complex profiling setups, use configuration files:
profiling-config.json:
{
"gpu_vendor": "AMD",
"guest_os": "UBUNTU",
"tools": [
{"name": "rocprof"}
]
}madengine run --tags model --additional-context-file profiling-config.jsonProfile GPU kernels and HIP API calls:
{
"tools": [
{"name": "rocprof"}
]
}Default Behavior: HIP trace mode
Output: rocprof_output/ directory
Custom Configuration:
{
"tools": [
{
"name": "rocprof",
"cmd": "rocprof --timestamp on",
"env_vars": {
"NCCL_DEBUG": "INFO"
}
}
]
}madengine uses rocprof_wrapper.sh to automatically handle the transition between rocprof (legacy) and rocprofv3:
| ROCm Version | Profiler Used | Command Syntax |
|---|---|---|
| ROCm < 7.0 | rocprof (legacy) | rocprof [options] <app> |
| ROCm >= 7.0 | rocprofv3 (preferred) | rocprofv3 [options] -- <app> |
Key Points:
- Automatic Detection: The wrapper detects which profiler is available and uses the appropriate syntax
- Separator Requirement: When using custom commands with
rocprof_wrapper.sh, always include the trailing--:{ "name": "rocprof", "cmd": "bash ../scripts/common/tools/rocprof_wrapper.sh --sys-trace --" } - Backward Compatibility: The
--works with both rocprof and rocprofv3, ensuring your configurations work across ROCm versions
Example - Custom Command with Wrapper:
{
"tools": [
{
"name": "rocprof",
"cmd": "bash ../scripts/common/tools/rocprof_wrapper.sh --hip-trace --sys-trace --",
"env_vars": {
"HSA_ENABLE_SDMA": "0"
}
}
]
}Collect comprehensive ROCm profiling data:
{
"tools": [
{"name": "rpd"}
]
}Output: ROCm profiler data files (e.g. rpd_output/trace.rpd).
Note: The rpd pre-script installs build dependencies in the container (e.g. nlohmann-json3-dev on Ubuntu) so the rocmProfileData tracer can compile; the first run may take longer while packages are installed.
rocm-trace-lite captures GPU kernel dispatch timestamps via HSA runtime interception and writes a SQLite .db file (RPD-compatible). It does not use rocprofiler-sdk or rocprofiler. Use it when you want a low-overhead dispatch timeline without installing the full rocprofv3 stack, or alongside workflows that already rely on RPD-style SQLite.
Do not wrap the same workload with both rocprofv3 (or rocprof via rocprof_wrapper.sh) and rocm_trace_lite / rocm_trace_lite_default in one run: choose one primary GPU profiler.
{
"tools": [
{"name": "rocm_trace_lite"}
]
}Use rocm_trace_lite for RTL lite mode (lower overhead; skips some dispatches that already carry a completion signal) or rocm_trace_lite_default for RTL default mode (broader coverage; higher overhead). Both set RTL_MODE for rtl_trace_wrapper.sh, which passes rtl trace --mode <mode> … when supported by your installed rocm-trace-lite. See upstream rocm-trace-lite (--mode / profiling modes). Example: examples/profiling-configs/rocm_trace_lite_default.json.
How madengine runs it: The tool prepends bash ../scripts/common/tools/rtl_trace_wrapper.sh around your model command. The wrapper runs rtl trace with -o rocm_trace_lite_output/trace.db and optional --mode from RTL_MODE (see the RTL quick start). If rtl is not on PATH but the Python package is installed, it falls back to python3 -m rocm_trace_lite.cli.
Installing rocm-trace-lite in the container: Upstream distributes wheels on GitHub Releases, not on PyPI. The trace pre-script (scripts/common/pre_scripts/trace.sh with args rocm_trace_lite) installs via pip from a pinned linux_x86_64 wheel URL by default (reproducible; bump the pin in that script when you intentionally upgrade RTL). To follow upstream’s latest release instead, set ROCM_TRACE_LITE_FOLLOW_LATEST=1 (uses the GitHub API; needs curl). For a specific wheel, set ROCM_TRACE_LITE_WHEEL_URL to the full URL of a .whl file (or bake the package into the image). You need outbound HTTPS to github.com for the default or latest path unless the wheel is already present. Published wheels target linux x86_64; other architectures require a compatible wheel and the env override.
Output: rocm_trace_lite_output/trace.db under the model workspace (and optionally trace.json.gz, trace_summary.txt, etc., depending on RTL version). The trace post-script copies rocm_trace_lite_output/ to /myworkspace/ like other profiling tools.
RTL vs rocprofv3
| Topic | rocprofv3 (this guide, presets rocprofv3_*) |
rocm-trace-lite |
|---|---|---|
| Stack | rocprofiler-sdk, rich traces and counters | HSA interception, SQLite timeline |
| Multi-node (K8s/SLURM) | rocprof is upgraded to rocprofv3 when available |
Does not require rocprofv3 on the submission host; other rocprof-family tools are omitted if rocprofv3 is missing (see multi-node profiling behavior below) |
| When to prefer | Deep analysis, hardware counters, Perfetto from rocprofv3 | Minimal-deps dispatch trace, RPD-compatible .db |
Multi-node profiling: Multi-node runs that use only tools outside the rocprof/rocprofv3 family (such as rocm_trace_lite or rocm_trace_lite_default) keep profiling enabled even when rocprofv3 is not installed on the machine submitting the job. If the tool list includes rocprof or any rocprofv3_* preset and rocprofv3 is unavailable, those entries are dropped; if no tools remain, profiling is disabled and the usual rocprofiler-sdk installation guidance is logged.
ROCprofv3 is the next-generation profiler for ROCm 7.0+ with enhanced features and better performance. madengine provides pre-configured profiles for common bottleneck scenarios.
Compute-Bound Analysis:
{
"tools": [
{"name": "rocprofv3_compute"}
]
}- Use Case: Models bottlenecked by ALU operations
- Metrics: Wave execution, VALU/SALU instructions, wait states
- Output Format: Perfetto trace with hardware counters
Memory-Bound Analysis:
{
"tools": [
{"name": "rocprofv3_memory"}
]
}- Use Case: Models bottlenecked by memory bandwidth
- Metrics: Cache hits/misses, memory transfers, LDS usage
- Output Format: Perfetto trace with memory counters
Communication-Bound Analysis (Multi-GPU):
{
"tools": [
{"name": "rocprofv3_communication"}
]
}- Use Case: Multi-GPU distributed training
- Metrics: RCCL traces, inter-GPU transfers, synchronization
- Output Format: Perfetto trace with RCCL data
Comprehensive Profiling:
{
"tools": [
{"name": "rocprofv3_full"}
]
}- Use Case: Complete analysis with all metrics (high overhead)
- Metrics: All traces + counters + stats
- Output Format: Perfetto trace with full instrumentation
Lightweight Profiling:
{
"tools": [
{"name": "rocprofv3_lightweight"}
]
}- Use Case: Production-like profiling with minimal overhead
- Metrics: HIP and kernel traces only
- Output Format: JSON (compact)
Perfetto Visualization:
{
"tools": [
{"name": "rocprofv3_perfetto"}
]
}- Use Case: Generate Perfetto-compatible traces
- Metrics: HIP, kernel, memory traces
- Output Format: Perfetto trace file (
.pftrace)
API Overhead Analysis:
{
"tools": [
{"name": "rocprofv3_api_overhead"}
]
}- Use Case: Analyze HIP/HSA API call overhead
- Metrics: API call timing and statistics
- Output Format: JSON with stats
PC Sampling (Hotspot Analysis):
{
"tools": [
{"name": "rocprofv3_pc_sampling"}
]
}- Use Case: Identify kernel hotspots
- Metrics: Program counter sampling at 1000 Hz
- Output Format: Perfetto trace with PC samples
madengine provides ready-to-use configuration files in examples/profiling-configs/:
# Compute-bound profiling
madengine run --tags your_model \
--additional-context-file examples/profiling-configs/rocprofv3_compute_bound.json
# Memory-bound profiling
madengine run --tags your_model \
--additional-context-file examples/profiling-configs/rocprofv3_memory_bound.json
# Multi-GPU profiling
madengine run --tags your_model \
--additional-context-file examples/profiling-configs/rocprofv3_multi_gpu.json
# Comprehensive profiling
madengine run --tags your_model \
--additional-context-file examples/profiling-configs/rocprofv3_comprehensive.json
# rocm-trace-lite (RTL) — not a rocprofv3 preset; do not mix with rocprof on the same run
madengine run --tags your_model \
--additional-context-file examples/profiling-configs/rocm_trace_lite.json
madengine run --tags your_model \
--additional-context-file examples/profiling-configs/rocm_trace_lite_default.jsonSee examples/profiling-configs/README.md for complete documentation.
For advanced users, customize rocprofv3 invocation:
{
"tools": [
{
"name": "rocprof",
"cmd": "bash ../scripts/common/tools/rocprof_wrapper.sh --hip-trace --kernel-trace --memory-copy-trace --rccl-trace --counter-collection -i custom_counters.txt --output-format pftrace --stats -d ./my_output --",
"env_vars": {
"RCCL_DEBUG": "TRACE",
"HSA_ENABLE_SDMA": "0"
}
}
]
}Important: The -- separator at the end of the cmd string is required when using rocprof_wrapper.sh. This separator distinguishes between profiler options and the application command:
- rocprofv3 (ROCm >= 7.0): Requires
--separator →rocprofv3 [options] -- <app> - rocprof (legacy): Works with or without
--→rocprof [options] <app>
The wrapper auto-detects which profiler is available and formats arguments correctly. Always include the trailing -- in your custom commands to ensure compatibility with both versions.
Custom counter files are in scripts/common/tools/counters/:
compute_bound.txt- ALU and execution metricsmemory_bound.txt- Cache and memory metricscommunication_bound.txt- PCIe and synchronization metricsfull_profile.txt- Comprehensive metrics
Create your own counter file:
# my_counters.txt
pmc: SQ_WAVES
pmc: SQ_INSTS_VALU
pmc: L2CacheHit
pmc: TCC_HIT_sum
Then use it:
madengine run --tags your_model \
--additional-context '{
"tools": [{
"name": "rocprof",
"cmd": "bash ../scripts/common/tools/rocprof_wrapper.sh --counter-collection -i my_counters.txt --output-format pftrace -d ./output --"
}]
}'Trace rocBLAS API calls and configurations:
{
"tools": [
{"name": "rocblas_trace"}
]
}Output:
- Trace logs in execution output
library_trace.csvwith library call summary
Use Case: Analyze BLAS operations, identify optimization opportunities
Trace MIOpen API calls for deep learning operations:
{
"tools": [
{"name": "miopen_trace"}
]
}Output:
- Trace logs in execution output
library_trace.csvwith convolution, pooling, and other DNN operations
Use Case: Optimize deep learning layers, analyze convolution configurations
Trace Tensile matrix operations:
{
"tools": [
{"name": "tensile_trace"}
]
}Output:
- Trace logs in execution output
library_trace.csvwith matrix operation details
Use Case: Analyze GEMM operations, optimize matrix multiplications
Trace RCCL collective communication operations:
{
"tools": [
{"name": "rccl_trace"}
]
}Output: Trace logs with communication patterns
Use Case: Debug multi-GPU communication, optimize distributed training
Profile real-time GPU power consumption:
{
"tools": [
{"name": "gpu_info_power_profiler"}
]
}Output: gpu_info_power_profiler_output.csv
Configuration:
{
"tools": [
{
"name": "gpu_info_power_profiler",
"env_vars": {
"POWER_DEVICE": "0",
"POWER_SAMPLING_RATE": "0.1"
}
}
]
}Environment Variables:
POWER_DEVICE- GPU device(s):"0","0,1,2", or"all"(default:"all")POWER_SAMPLING_RATE- Sampling interval in seconds (default:"0.1")POWER_MODE- Must be"power"for this tool (default:"power")POWER_DUAL_GCD- Enable dual-GCD mode:"true"or"false"(default:"false")
Note: To customize, override in tools configuration:
{
"tools": [
{
"name": "gpu_info_power_profiler",
"env_vars": {
"POWER_DEVICE": "0,1",
"POWER_SAMPLING_RATE": "0.2"
}
}
]
}Supported Platforms: ROCm and CUDA
Profile real-time GPU memory consumption:
{
"tools": [
{"name": "gpu_info_vram_profiler"}
]
}Output: gpu_info_vram_profiler_output.csv
Configuration:
{
"tools": [
{
"name": "gpu_info_vram_profiler",
"env_vars": {
"VRAM_DEVICE": "all",
"VRAM_SAMPLING_RATE": "0.5"
}
}
]
}Environment Variables:
VRAM_DEVICE- GPU device(s):"0","0,1,2", or"all"(default:"all")VRAM_SAMPLING_RATE- Sampling interval in seconds (default:"0.1")VRAM_MODE- Must be"vram"for this tool (default:"vram")VRAM_DUAL_GCD- Enable dual-GCD mode:"true"or"false"(default:"false")
Using Both Profilers Together:
{
"tools": [
{"name": "gpu_info_power_profiler"},
{"name": "gpu_info_vram_profiler"}
]
}This will generate both gpu_info_power_profiler_output.csv and gpu_info_vram_profiler_output.csv.
SAMPLING_RATE- Sampling interval in secondsMODE- Must be"vram"for this toolDUAL-GCD- Enable dual-GCD mode
Supported Platforms: ROCm and CUDA
Tools can be stacked to collect multiple types of profiling data simultaneously. Tools are applied in order, with the first tool being innermost:
{
"tools": [
{"name": "rocprof"},
{"name": "miopen_trace"},
{"name": "rocblas_trace"}
]
}Execution Order:
- Setup: rocblas_trace → miopen_trace → rocprof
- Run: Model execution
- Teardown: rocprof → miopen_trace → rocblas_trace
Example:
madengine run --tags pyt_torchvision_alexnet \
--additional-context '{
"gpu_vendor": "AMD",
"guest_os": "UBUNTU",
"tools": [
{"name": "rocprof"},
{"name": "miopen_trace"}
]
}'Analyze and compare performance of different library configurations by:
- Collecting library call traces
- Measuring performance of different configurations
- Comparing competitive implementations
Collect library API call traces:
# Trace MIOpen calls
madengine run --tags pyt_torchvision_alexnet \
--additional-context '{
"gpu_vendor": "AMD",
"guest_os": "UBUNTU",
"tools": [{"name": "miopen_trace"}]
}'
# Trace rocBLAS calls
madengine run --tags pyt_torchvision_alexnet \
--additional-context '{
"gpu_vendor": "AMD",
"guest_os": "UBUNTU",
"tools": [{"name": "rocblas_trace"}]
}'Or collect both in one run:
madengine run --tags pyt_torchvision_alexnet \
--additional-context '{
"gpu_vendor": "AMD",
"guest_os": "UBUNTU",
"tools": [
{"name": "miopen_trace"},
{"name": "rocblas_trace"}
]
}'Output: library_trace.csv containing library calls and configurations
Use the collected traces to benchmark different library configurations:
madengine run --tags pyt_library_config_perfPrerequisites:
library_trace.csvmust exist in the current directory- Contains library call configurations from Step 1
Output: library_perf.csv with performance data for each configuration
Platform Support: Works on both AMD and NVIDIA GPUs
Compare results from library_perf.csv to:
- Identify optimal library configurations
- Compare performance across different implementations
- Validate optimization opportunities
# Step 1: Collect comprehensive traces
madengine run --tags model \
--additional-context '{
"gpu_vendor": "AMD",
"guest_os": "UBUNTU",
"tools": [
{"name": "rocprof"},
{"name": "gpu_info_power_profiler"},
{"name": "gpu_info_vram_profiler"}
]
}'
# Step 2: Analyze results
ls -lh rocprof_output/
cat gpu_info_power_profiler_output.csv
cat gpu_info_vram_profiler_output.csv# 1. Profile current implementation
madengine run --tags model \
--additional-context '{"tools": [{"name": "miopen_trace"}]}'
# 2. Test library configurations
madengine run --tags pyt_library_config_perf
# 3. Analyze and compare
python analyze_library_perf.py library_perf.csvmadengine run --tags model \
--additional-context '{
"gpu_vendor": "AMD",
"guest_os": "UBUNTU",
"docker_gpus": "0,1,2,3",
"tools": [
{
"name": "gpu_info_power_profiler",
"env_vars": {
"DEVICE": "all",
"SAMPLING_RATE": "0.1"
}
},
{"name": "rccl_trace"}
]
}'| Tool | Output File(s) | Content |
|---|---|---|
rocprof |
rocprof_output/* |
GPU kernel traces, HIP API calls |
rpd |
Various RPD files | ROCm profiler data |
rocblas_trace |
library_trace.csv, logs |
rocBLAS API calls |
miopen_trace |
library_trace.csv, logs |
MIOpen API calls |
tensile_trace |
library_trace.csv, logs |
Tensile operations |
rccl_trace |
Execution logs | RCCL communication |
gpu_info_power_profiler |
gpu_info_power_profiler_output.csv |
Power consumption over time |
gpu_info_vram_profiler |
gpu_info_vram_profiler_output.csv |
VRAM usage over time |
All tools support these configuration keys:
Override the default profiling command:
{
"tools": [
{
"name": "rocprof",
"cmd": "rocprof --timestamp on --hip-trace"
}
]
}Note: Tool binary name must be included in custom commands.
Set tool-specific environment variables:
{
"tools": [
{
"name": "rocprof",
"env_vars": {
"NCCL_DEBUG": "INFO",
"HSA_ENABLE_SDMA": "0"
}
}
]
}Profiling works best with single model tags:
# Good
madengine run --tags pyt_torchvision_alexnet \
--additional-context '{"tools": [{"name": "rocprof"}]}'
# Avoid
madengine run --tags model1 model2 model3 \
--additional-context '{"tools": [{"name": "rocprof"}]}'For complex profiling setups:
{
"gpu_vendor": "AMD",
"guest_os": "UBUNTU",
"tools": [
{
"name": "rocprof",
"cmd": "rocprof --timestamp on"
},
{
"name": "gpu_info_power_profiler",
"env_vars": {
"POWER_DEVICE": "all",
"POWER_SAMPLING_RATE": "0.1"
}
}
]
}Balance detail vs. overhead:
{
"tools": [
{
"name": "gpu_info_power_profiler",
"env_vars": {
"SAMPLING_RATE": "1.0" // Less overhead, less detail
}
}
]
}Group related profiling tools:
{
"tools": [
{"name": "miopen_trace"},
{"name": "rocblas_trace"},
{"name": "tensile_trace"}
]
}For performance-critical profiling:
# Baseline run (no profiling)
madengine run --tags model
# Profiling run
madengine run --tags model \
--additional-context '{"tools": [{"name": "rocprof"}]}'Error: Tool binary not available
Solution:
# Verify tool is installed
which rocprof
which rocblas-bench
# Check container has tools
docker run --rm rocm/pytorch:latest which rocprofError: Profiling produces empty results
Causes:
- Model execution too fast
- Incorrect device selection
- Tool configuration error
Solutions:
- Increase workload size
- Verify GPU device IDs
- Check tool logs for errors
Error: Profiling significantly slows execution
Solutions:
- Reduce sampling rate
- Use fewer stacked tools
- Profile subset of execution
- Use targeted profiling
Error: Library trace file missing
Causes:
- No library calls made
- Tool not properly initialized
- Output directory permission issues
Solutions:
- Verify model uses the library (e.g., uses convolutions for MIOpen)
- Check execution logs for errors
- Verify write permissions
Issue: Model runs marked as FAILURE despite successful execution
Symptoms:
- Status shows FAILURE but performance metrics are reported
- Log contains ROCProf messages like
E20251230 ... Opened result file - Error pattern
Error:detected in logs
Root Cause:
ROCProf uses glog-style logging where E prefix means "Error level log" (not an actual error). These informational messages were incorrectly triggering failure detection.
Fixed in: madengine v2.0+
For false failures not caused by ROCProf (for example workloads that print benign RuntimeError: text), see Configuration — Run phase: log error pattern scan (log_error_pattern_scan, log_error_benign_patterns).
Verification:
# Run with profiling - should show SUCCESS status
madengine run --tags pyt_huggingface_gpt2 \
--additional-context '{
"gpu_vendor": "AMD",
"guest_os": "UBUNTU",
"tools": [{"name": "rocprof"}, {"name": "rpd"}]
}'
# Check status in output
# ✅ Expected: Status = SUCCESS, Performance = ~38-40 samples/secondTechnical Details:
- ROCProf log patterns now excluded from error detection
- Error patterns made more specific (e.g.,
RuntimeError:vsError:) - Performance extraction hardened against bash segfaults during profiling
- Tests:
pytest tests/unit/test_error_handling.py::TestErrorPatternMatching
Profiling functionality is implemented via pre/post scripts:
Location:
- Pre-scripts:
scripts/common/pre_scripts/ - Post-scripts:
scripts/common/post_scripts/
Workflow:
- Pre-script: Tool setup and initialization
- Model execution: Tool collects data
- Post-script: Save results, cleanup
Tool defaults are defined in scripts/common/tools.json:
{
"rocprof": {
"cmd": "rocprof --hip-trace",
"env_vars": {}
},
"gpu_info_power_profiler": {
"env_vars": {
"DEVICE": "0",
"SAMPLING_RATE": "0.1",
"MODE": "power",
"DUAL-GCD": "false"
}
}
}To add new profiling tools:
- Create pre-script:
scripts/common/pre_scripts/tool_name_pre.sh - Create post-script:
scripts/common/post_scripts/tool_name_post.sh - Add default config to
scripts/common/tools.json - Test with madengine
Validate TheRock ROCm installations before running models. TheRock is AMD's lightweight build system for HIP and ROCm, distributed via Python pip packages.
Enable TheRock validation:
madengine run --tags dummy_therock \
--tools therock_check \
--additional-context '{"gpu_vendor": "AMD", "guest_os": "UBUNTU"}'Standalone detection:
# Shell script (quick check)
bash src/madengine/scripts/common/tools/detect_therock.sh
# Python script (detailed output)
python3 src/madengine/scripts/common/tools/therock_detector.py --verbose
# JSON output (for scripting)
python3 src/madengine/scripts/common/tools/therock_detector.py --jsonDetection methods:
- Python pip installations (
~/.local/lib/python*/site-packages/rocm) - Virtual environments with rocm packages
- System packages (
/usr/lib/python*/site-packages/rocm) - Tarball installations
- Local build directories
- Environment variables (
ROCM_PATH,HIP_PATH)
Configuration in tools.json:
{
"therock_check": {
"pre_scripts": [
{
"path": "scripts/common/tools/detect_therock.sh"
}
],
"cmd": "",
"env_vars": {},
"post_scripts": []
}
}Features:
- Non-blocking validation (warnings only)
- Automatic integration in
dummy_therockmodel - Reports GPU targets and installation paths
- Exit code 0 = found, 1 = not found
Resources:
- Configuration Guide - Detailed profiling configuration
- Usage Guide - Running models with profiling
- Deployment Guide - Profiling in distributed environments