A Go CLI for analyzing PyTorch Profiler traces.
Reimplements most of the workflows in HolisticTraceAnalysis with following features:
install-skillsupports agent usage.- single go binary.
- use sqlite table to store intermediate state.
- markdown output for cli usage.
go build -o trace-blame ./cmd/trace-blame/
# Install accompany claude skills
trace-blame install-skills
# Parse raw traces into a SQLite database
trace-blame pre-process --trace-dir ./traces --output trace.db
# Run analyses
trace-blame temporal-breakdown --db trace.db
trace-blame gpu-kernel-breakdown --db trace.db
trace-blame idle-time-breakdown --db trace.db --ranks 0,1| Category | Subcommands |
|---|---|
| Preprocessing | pre-process |
| Overview | temporal-breakdown, comm-comp-overlap, profiler-steps, potential-stragglers |
| GPU Kernels | gpu-kernel-breakdown, gpu-kernels-with-annotations, cuda-kernel-launch-stats, aten-op-kernels-and-delay, frequent-cuda-kernel-sequences |
| Counters | queue-length-summary, queue-length-time-series, blocked-on-full-queue, memory-bw-summary, memory-bw-time-series, generate-trace-with-counters |
| Idle Time | idle-time-breakdown |
| Critical Path | critical-path |
| CUPTI | cupti-counter-data |
Run trace-blame with no arguments for usage, or trace-blame <subcommand> -h for flag details.
- Expand debugging workflows beyond the current HTA coverage
- Navigate up and down the operator call stack from within the agent
- Traverse forward and backward along a CUDA stream in the agent
- Support memory-profiling workflows
Ideas and contributions are welcome! See CONTRIBUTING.md to get started.
Check out these tools to make debugging pytorch training job easier: