GPU kernel nodes (sgemm, NCCL) have no outgoing dependencies in chakra_trace_linker output

## Describe the Bug
I am running a simple single-GPU GPT training script with the PyTorch profiler to collect execution traces, which I then convert to Chakra ET format using `chakra_trace_link` and `chakra_converter`. 

After inspecting the final `chakra_read_rank0.json`, I find that **GPU kernel nodes (sgemm, elementwise kernels, etc.) are never listed as `dataDeps` or `ctrlDeps` of any other node in the graph.** Every GPU kernel acts as a leaf. The only edges that exist go from CPU ops to their GPU kernel children—never in the other direction, and never between GPU kernels.

The same behavior occurs with **NCCL collective operations**. In practice, this causes incorrect behavior in simulators: because NCCL kernel nodes have no important incoming dependencies, they are immediately free to execute at the very start of the simulation. In a real training run, they should only start during the backpropagation phase after the relevant gradients have been computed (as in the real run)

I am not sure whether the problem originates in the `chakra_trace_link` step, the `chakra_converter` step, or in how the input traces are collected. However, I do not think it is a profiling issue, as the raw Kineto and PyTorch ET files appear to contain the expected events and correlation IDs.

---

## Steps to Reproduce

### 1. Collect Traces
The relevant profiler setup in the training script is configured as follows:

```python
from torch.profiler import ExecutionTraceObserver, profile
from torch.autograd.profiler import _ExperimentalConfig

et = ExecutionTraceObserver()
et.register_callback(pyet_path)

with profile(
    activities=[
        torch.profiler.ProfilerActivity.CPU,
        torch.profiler.ProfilerActivity.CUDA,
    ],
    schedule=torch.profiler.schedule(
        skip_first=2, wait=1, warmup=3, active=1, repeat=1
    ),
    record_shapes=True,
    with_flops=True,
    on_trace_ready=trace_handler,
    execution_trace_observer=et,
    experimental_config=_ExperimentalConfig(enable_cuda_sync_events=True),
) as prof:
    for step, batch in enumerate(train_dataloader):
        # ... forward / backward / optimizer step ...
        prof.step()

et.unregister_callback()
```
This produces `pytorch_et_rank0.json` (schema `1.1.0-chakra.0.0.4`) and `kineto_trace_rank0.json`.

### 2. Link and Convert
Run the following post-processing pipeline:

```bash
chakra_trace_link \
    --rank 0 \
    --chakra-host-trace ./pytorch_et_rank0.json \
    --chakra-device-trace ./kineto_trace_rank0.json \
    --output-file ./merged/merged_rank0.json

chakra_converter PyTorch \
    --input ./merged/merged_rank0.json \
    --output ./chakra/chakra_rank.0.et --simulate

chakra_jsonizer \
    --input_filename  ./chakra/chakra_rank.0.et \
    --output_filename ./chakra_read/chakra_read_rank0.json
```

**Environment & Version Information:**
* **Chakra:** 1.0.0 (via pip)
* **PyTorch:** 2.7 (Schema: `1.1.0-chakra.0.0.4`)

### 3. Inspect the Output
In `chakra_read_rank0.json`, taking three representative nodes as an example:
* **Node 361 (`aten::mm`):** CPU op, 59 µs, `dataDeps: [358, ...]`
* **Node 362 (`ampere_sgemm_128x32_sliced1x4_tn`):** GPU kernel, 98078 µs, `dataDeps: [361]`, `ctrlDeps: [361]`
* **Node 372 (`ampere_sgemm_64x64_tn`):** GPU kernel, next sgemm in the trace, `dataDeps: [<its own CPU launcher>]`

Node 362 **does not appear** in the `dataDeps` or `ctrlDeps` of any other node. Searching the entire JSON file for `"362"` as a dependency value returns zero results.

---

## Expected Behavior
I would expect some dependency path to exist between consecutive GPU kernels (such as node 362 and node 372). As it stands, all GPU kernel nodes become dependency-free simultaneously once their single CPU-launcher edge is satisfied. This leaves simulators with no timing info to prevent issuing them all at once, which does not reflect real execution.

Two possible architectural approaches come to mind, though I am unsure which is intended:
1. **Intra-stream chaining:** All kernels submitted to the same CUDA stream are linked sequentially (`kernel[i+1]` depends on `kernel[i]`), since CUDA guarantees stream-ordered execution.
2. **GPU kernel promoted into the CPU chain:** Any node that currently has a `dataDep` on a CPU launcher (e.g., node 363 depending on 361) would instead depend on that launcher's GPU child (e.g., node 363 depends on 362). This places the GPU kernel in the middle of the execution chain rather than leaving it dangling as a leaf.

---

## Attachments
The reduced input and output trace files are attached below for reference:

* [chakra_read_rank0.json](https://github.com/user-attachments/files/28309677/chakra_read_rank0.json)
* [pytorch_et_rank_reduced.json](https://github.com/user-attachments/files/28309678/pytorch_et_rank_reduced.json)
* [kineto_trace_reduced.json](https://github.com/user-attachments/files/28309676/kineto_trace_reduced.json)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU kernel nodes (sgemm, NCCL) have no outgoing dependencies in chakra_trace_linker output #215

Describe the Bug

Steps to Reproduce

1. Collect Traces

2. Link and Convert

3. Inspect the Output

Expected Behavior

Attachments

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

GPU kernel nodes (sgemm, NCCL) have no outgoing dependencies in chakra_trace_linker output #215

Description

Describe the Bug

Steps to Reproduce

1. Collect Traces

2. Link and Convert

3. Inspect the Output

Expected Behavior

Attachments

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions