Describe the Bug
I am running a simple single-GPU GPT training script with the PyTorch profiler to collect execution traces, which I then convert to Chakra ET format using chakra_trace_link and chakra_converter.
After inspecting the final chakra_read_rank0.json, I find that GPU kernel nodes (sgemm, elementwise kernels, etc.) are never listed as dataDeps or ctrlDeps of any other node in the graph. Every GPU kernel acts as a leaf. The only edges that exist go from CPU ops to their GPU kernel children—never in the other direction, and never between GPU kernels.
The same behavior occurs with NCCL collective operations. In practice, this causes incorrect behavior in simulators: because NCCL kernel nodes have no important incoming dependencies, they are immediately free to execute at the very start of the simulation. In a real training run, they should only start during the backpropagation phase after the relevant gradients have been computed (as in the real run)
I am not sure whether the problem originates in the chakra_trace_link step, the chakra_converter step, or in how the input traces are collected. However, I do not think it is a profiling issue, as the raw Kineto and PyTorch ET files appear to contain the expected events and correlation IDs.
Steps to Reproduce
1. Collect Traces
The relevant profiler setup in the training script is configured as follows:
from torch.profiler import ExecutionTraceObserver, profile
from torch.autograd.profiler import _ExperimentalConfig
et = ExecutionTraceObserver()
et.register_callback(pyet_path)
with profile(
activities=[
torch.profiler.ProfilerActivity.CPU,
torch.profiler.ProfilerActivity.CUDA,
],
schedule=torch.profiler.schedule(
skip_first=2, wait=1, warmup=3, active=1, repeat=1
),
record_shapes=True,
with_flops=True,
on_trace_ready=trace_handler,
execution_trace_observer=et,
experimental_config=_ExperimentalConfig(enable_cuda_sync_events=True),
) as prof:
for step, batch in enumerate(train_dataloader):
# ... forward / backward / optimizer step ...
prof.step()
et.unregister_callback()
This produces pytorch_et_rank0.json (schema 1.1.0-chakra.0.0.4) and kineto_trace_rank0.json.
2. Link and Convert
Run the following post-processing pipeline:
chakra_trace_link \
--rank 0 \
--chakra-host-trace ./pytorch_et_rank0.json \
--chakra-device-trace ./kineto_trace_rank0.json \
--output-file ./merged/merged_rank0.json
chakra_converter PyTorch \
--input ./merged/merged_rank0.json \
--output ./chakra/chakra_rank.0.et --simulate
chakra_jsonizer \
--input_filename ./chakra/chakra_rank.0.et \
--output_filename ./chakra_read/chakra_read_rank0.json
Environment & Version Information:
- Chakra: 1.0.0 (via pip)
- PyTorch: 2.7 (Schema:
1.1.0-chakra.0.0.4)
3. Inspect the Output
In chakra_read_rank0.json, taking three representative nodes as an example:
- Node 361 (
aten::mm): CPU op, 59 µs, dataDeps: [358, ...]
- Node 362 (
ampere_sgemm_128x32_sliced1x4_tn): GPU kernel, 98078 µs, dataDeps: [361], ctrlDeps: [361]
- Node 372 (
ampere_sgemm_64x64_tn): GPU kernel, next sgemm in the trace, dataDeps: [<its own CPU launcher>]
Node 362 does not appear in the dataDeps or ctrlDeps of any other node. Searching the entire JSON file for "362" as a dependency value returns zero results.
Expected Behavior
I would expect some dependency path to exist between consecutive GPU kernels (such as node 362 and node 372). As it stands, all GPU kernel nodes become dependency-free simultaneously once their single CPU-launcher edge is satisfied. This leaves simulators with no timing info to prevent issuing them all at once, which does not reflect real execution.
Two possible architectural approaches come to mind, though I am unsure which is intended:
- Intra-stream chaining: All kernels submitted to the same CUDA stream are linked sequentially (
kernel[i+1] depends on kernel[i]), since CUDA guarantees stream-ordered execution.
- GPU kernel promoted into the CPU chain: Any node that currently has a
dataDep on a CPU launcher (e.g., node 363 depending on 361) would instead depend on that launcher's GPU child (e.g., node 363 depends on 362). This places the GPU kernel in the middle of the execution chain rather than leaving it dangling as a leaf.
Attachments
The reduced input and output trace files are attached below for reference:
Describe the Bug
I am running a simple single-GPU GPT training script with the PyTorch profiler to collect execution traces, which I then convert to Chakra ET format using
chakra_trace_linkandchakra_converter.After inspecting the final
chakra_read_rank0.json, I find that GPU kernel nodes (sgemm, elementwise kernels, etc.) are never listed asdataDepsorctrlDepsof any other node in the graph. Every GPU kernel acts as a leaf. The only edges that exist go from CPU ops to their GPU kernel children—never in the other direction, and never between GPU kernels.The same behavior occurs with NCCL collective operations. In practice, this causes incorrect behavior in simulators: because NCCL kernel nodes have no important incoming dependencies, they are immediately free to execute at the very start of the simulation. In a real training run, they should only start during the backpropagation phase after the relevant gradients have been computed (as in the real run)
I am not sure whether the problem originates in the
chakra_trace_linkstep, thechakra_converterstep, or in how the input traces are collected. However, I do not think it is a profiling issue, as the raw Kineto and PyTorch ET files appear to contain the expected events and correlation IDs.Steps to Reproduce
1. Collect Traces
The relevant profiler setup in the training script is configured as follows:
This produces
pytorch_et_rank0.json(schema1.1.0-chakra.0.0.4) andkineto_trace_rank0.json.2. Link and Convert
Run the following post-processing pipeline:
chakra_trace_link \ --rank 0 \ --chakra-host-trace ./pytorch_et_rank0.json \ --chakra-device-trace ./kineto_trace_rank0.json \ --output-file ./merged/merged_rank0.json chakra_converter PyTorch \ --input ./merged/merged_rank0.json \ --output ./chakra/chakra_rank.0.et --simulate chakra_jsonizer \ --input_filename ./chakra/chakra_rank.0.et \ --output_filename ./chakra_read/chakra_read_rank0.jsonEnvironment & Version Information:
1.1.0-chakra.0.0.4)3. Inspect the Output
In
chakra_read_rank0.json, taking three representative nodes as an example:aten::mm): CPU op, 59 µs,dataDeps: [358, ...]ampere_sgemm_128x32_sliced1x4_tn): GPU kernel, 98078 µs,dataDeps: [361],ctrlDeps: [361]ampere_sgemm_64x64_tn): GPU kernel, next sgemm in the trace,dataDeps: [<its own CPU launcher>]Node 362 does not appear in the
dataDepsorctrlDepsof any other node. Searching the entire JSON file for"362"as a dependency value returns zero results.Expected Behavior
I would expect some dependency path to exist between consecutive GPU kernels (such as node 362 and node 372). As it stands, all GPU kernel nodes become dependency-free simultaneously once their single CPU-launcher edge is satisfied. This leaves simulators with no timing info to prevent issuing them all at once, which does not reflect real execution.
Two possible architectural approaches come to mind, though I am unsure which is intended:
kernel[i+1]depends onkernel[i]), since CUDA guarantees stream-ordered execution.dataDepon a CPU launcher (e.g., node 363 depending on 361) would instead depend on that launcher's GPU child (e.g., node 363 depends on 362). This places the GPU kernel in the middle of the execution chain rather than leaving it dangling as a leaf.Attachments
The reduced input and output trace files are attached below for reference: