Skip to content

Fix alias output naming and improve Spike I/O diagnostics#51

Open
hgt312 wants to merge 1 commit intoaws-neuron:mainfrom
hgt312:alias-output-naming-fix
Open

Fix alias output naming and improve Spike I/O diagnostics#51
hgt312 wants to merge 1 commit intoaws-neuron:mainfrom
hgt312:alias-output-naming-fix

Conversation

@hgt312
Copy link
Copy Markdown
Contributor

@hgt312 hgt312 commented Mar 31, 2026

Fix alias output naming and improve Spike I/O diagnostics

Problem

The tracer's output naming step (Step 3 in NKIPyKernel._build_code) used a truthiness check (if not r.backend_tensor.name) to decide whether to assign the canonical
output{idx} name. This failed when a non-aliased output tensor already carried a name from tracing (e.g. "intermediate0" from an np.add result). The tensor kept
its intermediate name, causing a mismatch between the NEFF I/O table and what callers pass to kernel(inputs={...}, outputs={...}).

This was discovered during sglang-nkipy integration with the stable neuronx-cc 2.23.6484.0 compiler, where three kernel call patterns broke:

  • Standalone aliased kernels (e.g. update_kv_cache): NEFF input renamed to kv_cache.must_alias_input but callers still passed kv_cache
  • Fused graphs with aliased params (e.g. prefill_pre_moe): aliased output named kv_cache shifted other outputs to output1, output2, ...
  • Broken-identity aliases (e.g. prefill_post_moe): mutated param passed through an NKI wrapper that breaks tensor identity, causing the tracer to auto-append a 3rd
    output

Debugging these required extracting NEFF I/O names from HLO protobuf binaries because Spike's _validate_io only produced a bare KeyError.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

  Non-aliased outputs that already had a name from tracing kept that name
  instead of the canonical "output{idx}", breaking NEFF I/O name matching.

  Also adds alias input auto-resolution and better _validate_io errors
  in Spike, plus tests for all alias naming patterns.
@hgt312 hgt312 requested a review from a team March 31, 2026 17:13
f"got {actual_dtype}"
)

_ALIAS_SUFFIX = ".must_alias_input"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to complicate the handling of alias naming in spike.
Spike is a wrapper on runtime, it should not need to know how we lower the function into NEFFs. It should only deal with what's available in the NEFFs.

This specific problem can be addressed at the user level? The caller can pass .must_alias_input in the input tensor list.

A proper solution can be in the NEFF lowering in NKIPy (but we want to make sure we are aligned with NKI)

model_core_id = self.model_ref.core_id

unknown_inputs = set(inputs) - set(self.input_tensors_info)
if unknown_inputs:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the checks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants