Profiling Guide

Complete guide to profiling model performance and analyzing library calls with madengine.

Overview

madengine integrates multiple profiling and tracing tools to analyze GPU usage, library calls, and system performance. Tools are configured via --additional-context and applied in a stackable design pattern.

Quick Start

Basic GPU Profiling

madengine run --tags model \
  --additional-context '{
    "gpu_vendor": "AMD",
    "guest_os": "UBUNTU",
    "tools": [{"name": "rocprof"}]
  }'

Output: rocprof_output/ directory with profiling results

Using Configuration Files

For complex profiling setups, use configuration files:

profiling-config.json:

{
  "gpu_vendor": "AMD",
  "guest_os": "UBUNTU",
  "tools": [
    {"name": "rocprof"}
  ]
}

madengine run --tags model --additional-context-file profiling-config.json

Profiling Tools

rocprof - GPU Profiling

Profile GPU kernels and HIP API calls:

{
  "tools": [
    {"name": "rocprof"}
  ]
}

Default Behavior: HIP trace mode Output: rocprof_output/ directory

Custom Configuration:

{
  "tools": [
    {
      "name": "rocprof",
      "cmd": "rocprof --timestamp on",
      "env_vars": {
        "NCCL_DEBUG": "INFO"
      }
    }
  ]
}

ROCm Profiler Version Compatibility

madengine uses rocprof_wrapper.sh to automatically handle the transition between rocprof (legacy) and rocprofv3:

ROCm Version	Profiler Used	Command Syntax
ROCm < 7.0	rocprof (legacy)	`rocprof [options] <app>`
ROCm >= 7.0	rocprofv3 (preferred)	`rocprofv3 [options] -- <app>`

Key Points:

Automatic Detection: The wrapper detects which profiler is available and uses the appropriate syntax

Separator Requirement: When using custom commands with rocprof_wrapper.sh, always include the trailing --:

{
  "name": "rocprof",
  "cmd": "bash ../scripts/common/tools/rocprof_wrapper.sh --sys-trace --"
}

Backward Compatibility: The -- works with both rocprof and rocprofv3, ensuring your configurations work across ROCm versions

Example - Custom Command with Wrapper:

{
  "tools": [
    {
      "name": "rocprof",
      "cmd": "bash ../scripts/common/tools/rocprof_wrapper.sh --hip-trace --sys-trace --",
      "env_vars": {
        "HSA_ENABLE_SDMA": "0"
      }
    }
  ]
}

rpd - ROCm Profiler Data

Collect comprehensive ROCm profiling data:

{
  "tools": [
    {"name": "rpd"}
  ]
}

Output: ROCm profiler data files (e.g. rpd_output/trace.rpd).

Note: The rpd pre-script installs build dependencies in the container (e.g. nlohmann-json3-dev on Ubuntu) so the rocmProfileData tracer can compile; the first run may take longer while packages are installed.

rocm-trace-lite (RTL) — lightweight kernel dispatch trace

rocm-trace-lite captures GPU kernel dispatch timestamps via HSA runtime interception and writes a SQLite .db file (RPD-compatible). It does not use rocprofiler-sdk or rocprofiler. Use it when you want a low-overhead dispatch timeline without installing the full rocprofv3 stack, or alongside workflows that already rely on RPD-style SQLite.

Do not wrap the same workload with both rocprofv3 (or rocprof via rocprof_wrapper.sh) and rocm_trace_lite / rocm_trace_lite_default in one run: choose one primary GPU profiler.

{
  "tools": [
    {"name": "rocm_trace_lite"}
  ]
}

Use rocm_trace_lite for RTL lite mode (lower overhead; skips some dispatches that already carry a completion signal) or rocm_trace_lite_default for RTL default mode (broader coverage; higher overhead). Both set RTL_MODE for rtl_trace_wrapper.sh, which passes rtl trace --mode <mode> … when supported by your installed rocm-trace-lite. See upstream rocm-trace-lite (--mode / profiling modes). Example: examples/profiling-configs/rocm_trace_lite_default.json.

How madengine runs it: The tool prepends bash ../scripts/common/tools/rtl_trace_wrapper.sh around your model command. The wrapper runs rtl trace with -o rocm_trace_lite_output/trace.db and optional --mode from RTL_MODE (see the RTL quick start). If rtl is not on PATH but the Python package is installed, it falls back to python3 -m rocm_trace_lite.cli.

Installing rocm-trace-lite in the container: Upstream distributes wheels on GitHub Releases, not on PyPI. The trace pre-script (scripts/common/pre_scripts/trace.sh with args rocm_trace_lite) installs via pip from a pinned linux_x86_64 wheel URL by default (reproducible; bump the pin in that script when you intentionally upgrade RTL). To follow upstream’s latest release instead, set ROCM_TRACE_LITE_FOLLOW_LATEST=1 (uses the GitHub API; needs curl). For a specific wheel, set ROCM_TRACE_LITE_WHEEL_URL to the full URL of a .whl file (or bake the package into the image). You need outbound HTTPS to github.com for the default or latest path unless the wheel is already present. Published wheels target linux x86_64; other architectures require a compatible wheel and the env override.

Output: rocm_trace_lite_output/trace.db under the model workspace (and optionally trace.json.gz, trace_summary.txt, etc., depending on RTL version). The trace post-script copies rocm_trace_lite_output/ to /myworkspace/ like other profiling tools.

RTL vs rocprofv3

Topic	rocprofv3 (this guide, presets `rocprofv3_*`)	rocm-trace-lite
Stack	rocprofiler-sdk, rich traces and counters	HSA interception, SQLite timeline
Multi-node (K8s/SLURM)	`rocprof` is upgraded to `rocprofv3` when available	Does not require `rocprofv3` on the submission host; other rocprof-family tools are omitted if `rocprofv3` is missing (see multi-node profiling behavior below)
When to prefer	Deep analysis, hardware counters, Perfetto from rocprofv3	Minimal-deps dispatch trace, RPD-compatible `.db`

Multi-node profiling: Multi-node runs that use only tools outside the rocprof/rocprofv3 family (such as rocm_trace_lite or rocm_trace_lite_default) keep profiling enabled even when rocprofv3 is not installed on the machine submitting the job. If the tool list includes rocprof or any rocprofv3_* preset and rocprofv3 is unavailable, those entries are dropped; if no tools remain, profiling is disabled and the usual rocprofiler-sdk installation guidance is logged.

ROCprofv3 - Advanced GPU Profiling

ROCprofv3 is the next-generation profiler for ROCm 7.0+ with enhanced features and better performance. madengine provides pre-configured profiles for common bottleneck scenarios.

Available ROCprofv3 Profiles

Compute-Bound Analysis:

{
  "tools": [
    {"name": "rocprofv3_compute"}
  ]
}

Use Case: Models bottlenecked by ALU operations
Metrics: Wave execution, VALU/SALU instructions, wait states
Output Format: Perfetto trace with hardware counters

Memory-Bound Analysis:

{
  "tools": [
    {"name": "rocprofv3_memory"}
  ]
}

Use Case: Models bottlenecked by memory bandwidth
Metrics: Cache hits/misses, memory transfers, LDS usage
Output Format: Perfetto trace with memory counters

Communication-Bound Analysis (Multi-GPU):

{
  "tools": [
    {"name": "rocprofv3_communication"}
  ]
}

Use Case: Multi-GPU distributed training
Metrics: RCCL traces, inter-GPU transfers, synchronization
Output Format: Perfetto trace with RCCL data

Comprehensive Profiling:

{
  "tools": [
    {"name": "rocprofv3_full"}
  ]
}

Use Case: Complete analysis with all metrics (high overhead)
Metrics: All traces + counters + stats
Output Format: Perfetto trace with full instrumentation

Lightweight Profiling:

{
  "tools": [
    {"name": "rocprofv3_lightweight"}
  ]
}

Use Case: Production-like profiling with minimal overhead
Metrics: HIP and kernel traces only
Output Format: JSON (compact)

Perfetto Visualization:

{
  "tools": [
    {"name": "rocprofv3_perfetto"}
  ]
}

Use Case: Generate Perfetto-compatible traces
Metrics: HIP, kernel, memory traces
Output Format: Perfetto trace file (.pftrace)

API Overhead Analysis:

{
  "tools": [
    {"name": "rocprofv3_api_overhead"}
  ]
}

Use Case: Analyze HIP/HSA API call overhead
Metrics: API call timing and statistics
Output Format: JSON with stats

PC Sampling (Hotspot Analysis):

{
  "tools": [
    {"name": "rocprofv3_pc_sampling"}
  ]
}

Use Case: Identify kernel hotspots
Metrics: Program counter sampling at 1000 Hz
Output Format: Perfetto trace with PC samples

Using Pre-Configured Profiles

madengine provides ready-to-use configuration files in examples/profiling-configs/:

# Compute-bound profiling
madengine run --tags your_model \
  --additional-context-file examples/profiling-configs/rocprofv3_compute_bound.json

# Memory-bound profiling
madengine run --tags your_model \
  --additional-context-file examples/profiling-configs/rocprofv3_memory_bound.json

# Multi-GPU profiling
madengine run --tags your_model \
  --additional-context-file examples/profiling-configs/rocprofv3_multi_gpu.json

# Comprehensive profiling
madengine run --tags your_model \
  --additional-context-file examples/profiling-configs/rocprofv3_comprehensive.json

# rocm-trace-lite (RTL) — not a rocprofv3 preset; do not mix with rocprof on the same run
madengine run --tags your_model \
  --additional-context-file examples/profiling-configs/rocm_trace_lite.json
madengine run --tags your_model \
  --additional-context-file examples/profiling-configs/rocm_trace_lite_default.json

See examples/profiling-configs/README.md for complete documentation.

Custom ROCprofv3 Commands

For advanced users, customize rocprofv3 invocation:

{
  "tools": [
    {
      "name": "rocprof",
      "cmd": "bash ../scripts/common/tools/rocprof_wrapper.sh --hip-trace --kernel-trace --memory-copy-trace --rccl-trace --counter-collection -i custom_counters.txt --output-format pftrace --stats -d ./my_output --",
      "env_vars": {
        "RCCL_DEBUG": "TRACE",
        "HSA_ENABLE_SDMA": "0"
      }
    }
  ]
}

Important: The -- separator at the end of the cmd string is required when using rocprof_wrapper.sh. This separator distinguishes between profiler options and the application command:

rocprofv3 (ROCm >= 7.0): Requires -- separator → rocprofv3 [options] -- <app>
rocprof (legacy): Works with or without -- → rocprof [options] <app>

The wrapper auto-detects which profiler is available and formats arguments correctly. Always include the trailing -- in your custom commands to ensure compatibility with both versions.

Hardware Counter Collection

Custom counter files are in scripts/common/tools/counters/:

compute_bound.txt - ALU and execution metrics
memory_bound.txt - Cache and memory metrics
communication_bound.txt - PCIe and synchronization metrics
full_profile.txt - Comprehensive metrics

Create your own counter file:

# my_counters.txt
pmc: SQ_WAVES
pmc: SQ_INSTS_VALU
pmc: L2CacheHit
pmc: TCC_HIT_sum

Then use it:

madengine run --tags your_model \
  --additional-context '{
    "tools": [{
      "name": "rocprof",
      "cmd": "bash ../scripts/common/tools/rocprof_wrapper.sh --counter-collection -i my_counters.txt --output-format pftrace -d ./output --"
    }]
  }'

rocblas_trace - rocBLAS Library Tracing

Trace rocBLAS API calls and configurations:

{
  "tools": [
    {"name": "rocblas_trace"}
  ]
}

Output:

Trace logs in execution output
library_trace.csv with library call summary

Use Case: Analyze BLAS operations, identify optimization opportunities

miopen_trace - MIOpen Library Tracing

Trace MIOpen API calls for deep learning operations:

{
  "tools": [
    {"name": "miopen_trace"}
  ]
}

Output:

Trace logs in execution output
library_trace.csv with convolution, pooling, and other DNN operations

Use Case: Optimize deep learning layers, analyze convolution configurations

tensile_trace - Tensile Library Tracing

Trace Tensile matrix operations:

{
  "tools": [
    {"name": "tensile_trace"}
  ]
}

Output:

Trace logs in execution output
library_trace.csv with matrix operation details

Use Case: Analyze GEMM operations, optimize matrix multiplications

rccl_trace - RCCL Communication Tracing

Trace RCCL collective communication operations:

{
  "tools": [
    {"name": "rccl_trace"}
  ]
}

Output: Trace logs with communication patterns

Use Case: Debug multi-GPU communication, optimize distributed training

gpu_info_power_profiler - Power Consumption

Profile real-time GPU power consumption:

{
  "tools": [
    {"name": "gpu_info_power_profiler"}
  ]
}

Output: gpu_info_power_profiler_output.csv

Configuration:

{
  "tools": [
    {
      "name": "gpu_info_power_profiler",
      "env_vars": {
        "POWER_DEVICE": "0",
        "POWER_SAMPLING_RATE": "0.1"
      }
    }
  ]
}

Environment Variables:

POWER_DEVICE - GPU device(s): "0", "0,1,2", or "all" (default: "all")
POWER_SAMPLING_RATE - Sampling interval in seconds (default: "0.1")
POWER_MODE - Must be "power" for this tool (default: "power")
POWER_DUAL_GCD - Enable dual-GCD mode: "true" or "false" (default: "false")

Note: To customize, override in tools configuration:

{
  "tools": [
    {
      "name": "gpu_info_power_profiler",
      "env_vars": {
        "POWER_DEVICE": "0,1",
        "POWER_SAMPLING_RATE": "0.2"
      }
    }
  ]
}

Supported Platforms: ROCm and CUDA

gpu_info_vram_profiler - VRAM Usage

Profile real-time GPU memory consumption:

{
  "tools": [
    {"name": "gpu_info_vram_profiler"}
  ]
}

Output: gpu_info_vram_profiler_output.csv

Configuration:

{
  "tools": [
    {
      "name": "gpu_info_vram_profiler",
      "env_vars": {
        "VRAM_DEVICE": "all",
        "VRAM_SAMPLING_RATE": "0.5"
      }
    }
  ]
}

Environment Variables:

VRAM_DEVICE - GPU device(s): "0", "0,1,2", or "all" (default: "all")
VRAM_SAMPLING_RATE - Sampling interval in seconds (default: "0.1")
VRAM_MODE - Must be "vram" for this tool (default: "vram")
VRAM_DUAL_GCD - Enable dual-GCD mode: "true" or "false" (default: "false")

Using Both Profilers Together:

{
  "tools": [
    {"name": "gpu_info_power_profiler"},
    {"name": "gpu_info_vram_profiler"}
  ]
}

This will generate both gpu_info_power_profiler_output.csv and gpu_info_vram_profiler_output.csv.

SAMPLING_RATE - Sampling interval in seconds
MODE - Must be "vram" for this tool
DUAL-GCD - Enable dual-GCD mode

Supported Platforms: ROCm and CUDA

Stackable Design

Tools can be stacked to collect multiple types of profiling data simultaneously. Tools are applied in order, with the first tool being innermost:

{
  "tools": [
    {"name": "rocprof"},
    {"name": "miopen_trace"},
    {"name": "rocblas_trace"}
  ]
}

Execution Order:

Setup: rocblas_trace → miopen_trace → rocprof
Run: Model execution
Teardown: rocprof → miopen_trace → rocblas_trace

Example:

madengine run --tags pyt_torchvision_alexnet \
  --additional-context '{
    "gpu_vendor": "AMD",
    "guest_os": "UBUNTU",
    "tools": [
      {"name": "rocprof"},
      {"name": "miopen_trace"}
    ]
  }'

Competitive Library Performance Analysis

Overview

Analyze and compare performance of different library configurations by:

Collecting library call traces
Measuring performance of different configurations
Comparing competitive implementations

Step 1: Collect Library Traces

Collect library API call traces:

# Trace MIOpen calls
madengine run --tags pyt_torchvision_alexnet \
  --additional-context '{
    "gpu_vendor": "AMD",
    "guest_os": "UBUNTU",
    "tools": [{"name": "miopen_trace"}]
  }'

# Trace rocBLAS calls
madengine run --tags pyt_torchvision_alexnet \
  --additional-context '{
    "gpu_vendor": "AMD",
    "guest_os": "UBUNTU",
    "tools": [{"name": "rocblas_trace"}]
  }'

Or collect both in one run:

madengine run --tags pyt_torchvision_alexnet \
  --additional-context '{
    "gpu_vendor": "AMD",
    "guest_os": "UBUNTU",
    "tools": [
      {"name": "miopen_trace"},
      {"name": "rocblas_trace"}
    ]
  }'

Output: library_trace.csv containing library calls and configurations

Step 2: Measure Library Configuration Performance

Use the collected traces to benchmark different library configurations:

madengine run --tags pyt_library_config_perf

Prerequisites:

library_trace.csv must exist in the current directory
Contains library call configurations from Step 1

Output: library_perf.csv with performance data for each configuration

Platform Support: Works on both AMD and NVIDIA GPUs

Step 3: Analysis

Compare results from library_perf.csv to:

Identify optimal library configurations
Compare performance across different implementations
Validate optimization opportunities

Common Usage Patterns

Full Performance Analysis

# Step 1: Collect comprehensive traces
madengine run --tags model \
  --additional-context '{
    "gpu_vendor": "AMD",
    "guest_os": "UBUNTU",
    "tools": [
      {"name": "rocprof"},
      {"name": "gpu_info_power_profiler"},
      {"name": "gpu_info_vram_profiler"}
    ]
  }'

# Step 2: Analyze results
ls -lh rocprof_output/
cat gpu_info_power_profiler_output.csv
cat gpu_info_vram_profiler_output.csv

Library Optimization Workflow

# 1. Profile current implementation
madengine run --tags model \
  --additional-context '{"tools": [{"name": "miopen_trace"}]}'

# 2. Test library configurations
madengine run --tags pyt_library_config_perf

# 3. Analyze and compare
python analyze_library_perf.py library_perf.csv

Multi-GPU Profiling

madengine run --tags model \
  --additional-context '{
    "gpu_vendor": "AMD",
    "guest_os": "UBUNTU",
    "docker_gpus": "0,1,2,3",
    "tools": [
      {
        "name": "gpu_info_power_profiler",
        "env_vars": {
          "DEVICE": "all",
          "SAMPLING_RATE": "0.1"
        }
      },
      {"name": "rccl_trace"}
    ]
  }'

Output Files Reference

Tool	Output File(s)	Content
`rocprof`	`rocprof_output/*`	GPU kernel traces, HIP API calls
`rpd`	Various RPD files	ROCm profiler data
`rocblas_trace`	`library_trace.csv`, logs	rocBLAS API calls
`miopen_trace`	`library_trace.csv`, logs	MIOpen API calls
`tensile_trace`	`library_trace.csv`, logs	Tensile operations
`rccl_trace`	Execution logs	RCCL communication
`gpu_info_power_profiler`	`gpu_info_power_profiler_output.csv`	Power consumption over time
`gpu_info_vram_profiler`	`gpu_info_vram_profiler_output.csv`	VRAM usage over time

Tool Configuration Options

All tools support these configuration keys:

cmd - Custom Command

Override the default profiling command:

{
  "tools": [
    {
      "name": "rocprof",
      "cmd": "rocprof --timestamp on --hip-trace"
    }
  ]
}

Note: Tool binary name must be included in custom commands.

env_vars - Environment Variables

Set tool-specific environment variables:

{
  "tools": [
    {
      "name": "rocprof",
      "env_vars": {
        "NCCL_DEBUG": "INFO",
        "HSA_ENABLE_SDMA": "0"
      }
    }
  ]
}

Best Practices

1. Profile Single Workloads

Profiling works best with single model tags:

# Good
madengine run --tags pyt_torchvision_alexnet \
  --additional-context '{"tools": [{"name": "rocprof"}]}'

# Avoid
madengine run --tags model1 model2 model3 \
  --additional-context '{"tools": [{"name": "rocprof"}]}'

2. Use Configuration Files

For complex profiling setups:

{
  "gpu_vendor": "AMD",
  "guest_os": "UBUNTU",
  "tools": [
    {
      "name": "rocprof",
      "cmd": "rocprof --timestamp on"
    },
    {
      "name": "gpu_info_power_profiler",
      "env_vars": {
        "POWER_DEVICE": "all",
        "POWER_SAMPLING_RATE": "0.1"
      }
    }
  ]
}

3. Optimize Sampling Rates

Balance detail vs. overhead:

{
  "tools": [
    {
      "name": "gpu_info_power_profiler",
      "env_vars": {
        "SAMPLING_RATE": "1.0"  // Less overhead, less detail
      }
    }
  ]
}

4. Stack Related Tools

Group related profiling tools:

{
  "tools": [
    {"name": "miopen_trace"},
    {"name": "rocblas_trace"},
    {"name": "tensile_trace"}
  ]
}

5. Separate Profiling Runs

For performance-critical profiling:

# Baseline run (no profiling)
madengine run --tags model

# Profiling run
madengine run --tags model \
  --additional-context '{"tools": [{"name": "rocprof"}]}'

Troubleshooting

Profiling Tool Not Found

Error: Tool binary not available

Solution:

# Verify tool is installed
which rocprof
which rocblas-bench

# Check container has tools
docker run --rm rocm/pytorch:latest which rocprof

Empty Output Files

Error: Profiling produces empty results

Causes:

Model execution too fast
Incorrect device selection
Tool configuration error

Solutions:

Increase workload size
Verify GPU device IDs
Check tool logs for errors

High Profiling Overhead

Error: Profiling significantly slows execution

Solutions:

Reduce sampling rate
Use fewer stacked tools
Profile subset of execution
Use targeted profiling

library_trace.csv Not Generated

Error: Library trace file missing

Causes:

No library calls made
Tool not properly initialized
Output directory permission issues

Solutions:

Verify model uses the library (e.g., uses convolutions for MIOpen)
Check execution logs for errors
Verify write permissions

False Failure Detection with ROCProf

Issue: Model runs marked as FAILURE despite successful execution

Symptoms:

Status shows FAILURE but performance metrics are reported
Log contains ROCProf messages like E20251230 ... Opened result file
Error pattern Error: detected in logs

Root Cause: ROCProf uses glog-style logging where E prefix means "Error level log" (not an actual error). These informational messages were incorrectly triggering failure detection.

Fixed in: madengine v2.0+

For false failures not caused by ROCProf (for example workloads that print benign RuntimeError: text), see Configuration — Run phase: log error pattern scan (log_error_pattern_scan, log_error_benign_patterns).

Verification:

# Run with profiling - should show SUCCESS status
madengine run --tags pyt_huggingface_gpt2 \
  --additional-context '{
    "gpu_vendor": "AMD",
    "guest_os": "UBUNTU",
    "tools": [{"name": "rocprof"}, {"name": "rpd"}]
  }'

# Check status in output
# ✅ Expected: Status = SUCCESS, Performance = ~38-40 samples/second

Technical Details:

ROCProf log patterns now excluded from error detection
Error patterns made more specific (e.g., RuntimeError: vs Error:)
Performance extraction hardened against bash segfaults during profiling
Tests: pytest tests/unit/test_error_handling.py::TestErrorPatternMatching

Developer Information

Tool Implementation

Profiling functionality is implemented via pre/post scripts:

Location:

Pre-scripts: scripts/common/pre_scripts/
Post-scripts: scripts/common/post_scripts/

Workflow:

Pre-script: Tool setup and initialization
Model execution: Tool collects data
Post-script: Save results, cleanup

Default Tool Configuration

Tool defaults are defined in scripts/common/tools.json:

{
  "rocprof": {
    "cmd": "rocprof --hip-trace",
    "env_vars": {}
  },
  "gpu_info_power_profiler": {
    "env_vars": {
      "DEVICE": "0",
      "SAMPLING_RATE": "0.1",
      "MODE": "power",
      "DUAL-GCD": "false"
    }
  }
}

Adding Custom Tools

To add new profiling tools:

Create pre-script: scripts/common/pre_scripts/tool_name_pre.sh
Create post-script: scripts/common/post_scripts/tool_name_post.sh
Add default config to scripts/common/tools.json
Test with madengine

Environment Validation Tools

TheRock Detection

Validate TheRock ROCm installations before running models. TheRock is AMD's lightweight build system for HIP and ROCm, distributed via Python pip packages.

Enable TheRock validation:

madengine run --tags dummy_therock \
  --tools therock_check \
  --additional-context '{"gpu_vendor": "AMD", "guest_os": "UBUNTU"}'

Standalone detection:

# Shell script (quick check)
bash src/madengine/scripts/common/tools/detect_therock.sh

# Python script (detailed output)
python3 src/madengine/scripts/common/tools/therock_detector.py --verbose

# JSON output (for scripting)
python3 src/madengine/scripts/common/tools/therock_detector.py --json

Detection methods:

Python pip installations (~/.local/lib/python*/site-packages/rocm)
Virtual environments with rocm packages
System packages (/usr/lib/python*/site-packages/rocm)
Tarball installations
Local build directories
Environment variables (ROCM_PATH, HIP_PATH)

Configuration in tools.json:

{
  "therock_check": {
    "pre_scripts": [
      {
        "path": "scripts/common/tools/detect_therock.sh"
      }
    ],
    "cmd": "",
    "env_vars": {},
    "post_scripts": []
  }
}

Features:

Non-blocking validation (warnings only)
Automatic integration in dummy_therock model
Reports GPU targets and installation paths
Exit code 0 = found, 1 = not found

Resources:

Next Steps

Configuration Guide - Detailed profiling configuration
Usage Guide - Running models with profiling
Deployment Guide - Profiling in distributed environments

FilesExpand file tree

profiling.md

Latest commit

History

profiling.md

File metadata and controls

Profiling Guide

Overview

Quick Start

Basic GPU Profiling

Using Configuration Files

Profiling Tools

rocprof - GPU Profiling

ROCm Profiler Version Compatibility

rpd - ROCm Profiler Data

rocm-trace-lite (RTL) — lightweight kernel dispatch trace

ROCprofv3 - Advanced GPU Profiling

Available ROCprofv3 Profiles

Using Pre-Configured Profiles

Custom ROCprofv3 Commands

Hardware Counter Collection

rocblas_trace - rocBLAS Library Tracing

miopen_trace - MIOpen Library Tracing

tensile_trace - Tensile Library Tracing

rccl_trace - RCCL Communication Tracing

gpu_info_power_profiler - Power Consumption

gpu_info_vram_profiler - VRAM Usage

Stackable Design

Competitive Library Performance Analysis

Overview

Step 1: Collect Library Traces

Step 2: Measure Library Configuration Performance

Step 3: Analysis

Common Usage Patterns

Full Performance Analysis

Library Optimization Workflow

Multi-GPU Profiling

Output Files Reference

Tool Configuration Options

cmd - Custom Command

env_vars - Environment Variables

Best Practices

1. Profile Single Workloads

2. Use Configuration Files

3. Optimize Sampling Rates

4. Stack Related Tools

5. Separate Profiling Runs

Troubleshooting

Profiling Tool Not Found

Empty Output Files

High Profiling Overhead

library_trace.csv Not Generated

False Failure Detection with ROCProf

Developer Information

Tool Implementation

Default Tool Configuration

Adding Custom Tools

Environment Validation Tools

TheRock Detection

Next Steps