Skip to content

GPU kernel nodes (sgemm, NCCL) have no outgoing dependencies in chakra_trace_linker output #215

@XavierQuerol

Description

@XavierQuerol

Describe the Bug

I am running a simple single-GPU GPT training script with the PyTorch profiler to collect execution traces, which I then convert to Chakra ET format using chakra_trace_link and chakra_converter.

After inspecting the final chakra_read_rank0.json, I find that GPU kernel nodes (sgemm, elementwise kernels, etc.) are never listed as dataDeps or ctrlDeps of any other node in the graph. Every GPU kernel acts as a leaf. The only edges that exist go from CPU ops to their GPU kernel children—never in the other direction, and never between GPU kernels.

The same behavior occurs with NCCL collective operations. In practice, this causes incorrect behavior in simulators: because NCCL kernel nodes have no important incoming dependencies, they are immediately free to execute at the very start of the simulation. In a real training run, they should only start during the backpropagation phase after the relevant gradients have been computed (as in the real run)

I am not sure whether the problem originates in the chakra_trace_link step, the chakra_converter step, or in how the input traces are collected. However, I do not think it is a profiling issue, as the raw Kineto and PyTorch ET files appear to contain the expected events and correlation IDs.


Steps to Reproduce

1. Collect Traces

The relevant profiler setup in the training script is configured as follows:

from torch.profiler import ExecutionTraceObserver, profile
from torch.autograd.profiler import _ExperimentalConfig

et = ExecutionTraceObserver()
et.register_callback(pyet_path)

with profile(
    activities=[
        torch.profiler.ProfilerActivity.CPU,
        torch.profiler.ProfilerActivity.CUDA,
    ],
    schedule=torch.profiler.schedule(
        skip_first=2, wait=1, warmup=3, active=1, repeat=1
    ),
    record_shapes=True,
    with_flops=True,
    on_trace_ready=trace_handler,
    execution_trace_observer=et,
    experimental_config=_ExperimentalConfig(enable_cuda_sync_events=True),
) as prof:
    for step, batch in enumerate(train_dataloader):
        # ... forward / backward / optimizer step ...
        prof.step()

et.unregister_callback()

This produces pytorch_et_rank0.json (schema 1.1.0-chakra.0.0.4) and kineto_trace_rank0.json.

2. Link and Convert

Run the following post-processing pipeline:

chakra_trace_link \
    --rank 0 \
    --chakra-host-trace ./pytorch_et_rank0.json \
    --chakra-device-trace ./kineto_trace_rank0.json \
    --output-file ./merged/merged_rank0.json

chakra_converter PyTorch \
    --input ./merged/merged_rank0.json \
    --output ./chakra/chakra_rank.0.et --simulate

chakra_jsonizer \
    --input_filename  ./chakra/chakra_rank.0.et \
    --output_filename ./chakra_read/chakra_read_rank0.json

Environment & Version Information:

  • Chakra: 1.0.0 (via pip)
  • PyTorch: 2.7 (Schema: 1.1.0-chakra.0.0.4)

3. Inspect the Output

In chakra_read_rank0.json, taking three representative nodes as an example:

  • Node 361 (aten::mm): CPU op, 59 µs, dataDeps: [358, ...]
  • Node 362 (ampere_sgemm_128x32_sliced1x4_tn): GPU kernel, 98078 µs, dataDeps: [361], ctrlDeps: [361]
  • Node 372 (ampere_sgemm_64x64_tn): GPU kernel, next sgemm in the trace, dataDeps: [<its own CPU launcher>]

Node 362 does not appear in the dataDeps or ctrlDeps of any other node. Searching the entire JSON file for "362" as a dependency value returns zero results.


Expected Behavior

I would expect some dependency path to exist between consecutive GPU kernels (such as node 362 and node 372). As it stands, all GPU kernel nodes become dependency-free simultaneously once their single CPU-launcher edge is satisfied. This leaves simulators with no timing info to prevent issuing them all at once, which does not reflect real execution.

Two possible architectural approaches come to mind, though I am unsure which is intended:

  1. Intra-stream chaining: All kernels submitted to the same CUDA stream are linked sequentially (kernel[i+1] depends on kernel[i]), since CUDA guarantees stream-ordered execution.
  2. GPU kernel promoted into the CPU chain: Any node that currently has a dataDep on a CPU launcher (e.g., node 363 depending on 361) would instead depend on that launcher's GPU child (e.g., node 363 depends on 362). This places the GPU kernel in the middle of the execution chain rather than leaving it dangling as a leaf.

Attachments

The reduced input and output trace files are attached below for reference:

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions