Skip to content

[Issue]: Systemic Root Causes #3713

@red1239109-cmd

Description

@red1239109-cmd

Problem Description

Issue 1: GetAlignment*() can return 0 leading to divide-by-zero / invalid tile distributions (SDC risk)

Problem Description

Several alignment helpers compute kVecLoad using integer division that can legally produce 0:

Example (GetAlignmentQ, similar pattern exists in GetAlignmentK, GetAlignmentOGrad, GetAlignmentBias):

constexpr index_t total_pixels = kMNPerBlock * kKPerBlock / kBlockSize;

constexpr index_t kVecLoad =
    ((total_pixels / kMaxVecLoad) >= kMinVecLoad) ? kMaxVecLoad : (total_pixels / kMinVecLoad);

return kVecLoad;

If total_pixels < kMinVecLoad, then (total_pixels / kMinVecLoad) == 0, so kVecLoad becomes 0.

Impact

  • Divide-by-zero in downstream computations (e.g. Q_VMEM_READ = ... / GetAlignmentQ<Problem>() in HotLoopScheduler).
  • Invalid alignment propagates into tile distributions and LDS layouts, potentially causing silent incorrect addressing / SDC under certain configurations.

Suggestion / Fix

Enforce a minimum of 1 and add divisibility/validity checks:

static_assert(total_pixels > 0, "total_pixels must be > 0");
constexpr index_t kVecLoadCandidate =
    ((total_pixels / kMaxVecLoad) >= kMinVecLoad) ? kMaxVecLoad : (total_pixels / kMinVecLoad);

constexpr index_t kVecLoad = (kVecLoadCandidate > 0) ? kVecLoadCandidate : 1;
return kVecLoad;

Optionally add:

static_assert((16 % sizeof(QDataType)) == 0, "vector load granularity mismatch");

Issue 2: Multiple tile distribution builders rely on implicit divisibility; missing static_assert allows silent truncation (SDC risk)

Problem Description

Many distribution constructors compute factors via integer division without verifying divisibility, e.g.:

constexpr index_t K0 = kKPerBlock / K1;
constexpr index_t N1 = get_warp_size() / K0;
constexpr index_t N2 = kNPerBlock / (N1 * N0);

Similar patterns exist in:

  • MakeQDramTileDistribution
  • MakeKDramTileDistribution
  • MakeVDramTileDistribution
  • MakeOGradDramTileDistribution
  • MakeBiasTileDistribution
  • MakePreXDramTileDistribution (e.g., M0 = MPerBlock / M1)

These assume exact divisibility but don’t consistently enforce it.

Impact

If any of the intermediate divisions truncates:

  • Thread↔element mapping becomes incorrect (tiles partially duplicated or skipped).
  • Leads to wrong memory addressing and may cause Silent Data Corruption rather than a compile-time failure.

Suggestion / Fix

Add compile-time guards near each derivation chain, e.g.:

static_assert(kKPerBlock % K1 == 0, "kKPerBlock must be divisible by alignment (K1)");
static_assert(get_warp_size() % K0 == 0, "warp_size must be divisible by K0");
static_assert(kNPerBlock % (N1 * N0) == 0, "kNPerBlock must be divisible by (N1*N0)");

Apply similarly for M-side computations (kMPerBlock % (M1*M0) == 0, etc.).


Issue 3: GetTransposedAlignment*() can silently produce 0 or truncated results (invalid vectorization / layout risk)

Problem Description

Transposed alignments are computed as:

return total_pixels / GetAlignmentQ<Problem>();

No checks ensure:

  • GetAlignmentQ is non-zero
  • total_pixels is divisible by GetAlignmentQ
  • result is non-zero

Impact

  • If alignment > total_pixels, the result becomes 0.
  • If not divisible, truncation occurs → layout mismatch.
  • Can break transposed LDS read/write descriptors and cause incorrect loads/stores → SDC risk.

Suggestion / Fix

Add compile-time checks:

static_assert(GetAlignmentQ<Problem>() > 0);
static_assert(total_pixels % GetAlignmentQ<Problem>() == 0,
              "total_pixels must be divisible by alignment for transposed path");
static_assert((total_pixels / GetAlignmentQ<Problem>()) > 0,
              "Transposed alignment must be > 0");

(Apply analogously to K/OGrad/Bias transposed alignments.)


Issue 4: MakeXLdsBlockDescriptor() hardcodes DataTypeSize = 2 (type fragility / incorrect LDS layout risk)

Problem Description

MakeXLdsBlockDescriptor assumes FP16/BF16 by hardcoding:

constexpr auto DataTypeSize = 2; // sizeof(F16/BF16)

But the policy templates are used with Problem::QDataType/KDataType/VDataType/... which may vary (e.g., fp8 / int8 / fp32 in mixed precision or future paths).

Impact

If DataType is not 2 bytes:

  • LDS layer calculations (MNLdsLayer) become wrong.
  • Descriptor strides/lengths mismatch real element sizes.
  • Can cause incorrect LDS indexing and memory overlap → SDC / race-like symptoms.

Suggestion / Fix

Replace the hardcoded constant with sizeof(DataType) (or a template parameter) and add guards:

constexpr auto DataTypeSize = sizeof(DataType);
static_assert(DataTypeSize == 2, "This descriptor assumes 2-byte types; update for generic types.");

Or fully generalize computations to support multiple sizes safely.


Issue 5: GetSmemSize() uses max(stage sizes) implying aggressive LDS reuse; correctness depends on perfect pipeline separation (race/SDC risk)

Problem Description

Shared memory size is computed as:

return max(smem_size_stage0_0, smem_size_stage0_1, smem_size_stage1);

This indicates stages reuse the same LDS region rather than allocating disjoint memory (sum). That is safe only if all stage transitions are strictly separated by correct waits/barriers.

Impact

If any async pipeline timing/barrier condition is incorrect (especially under custom tile configs):

  • One stage can overwrite LDS data still being consumed by another stage.
  • Leads to hard-to-debug race conditions / silent data corruption.

Suggestion / Fix

  • Provide a “safe mode” to disable reuse (allocate disjoint regions using sum) for correctness validation.
  • Add conservative barriers/waitcnt at stage boundaries in debug builds.
  • At minimum, document + enforce invariants guaranteeing no overlap.

Issue 6: HotLoopScheduler can compute 0 “per-step” values (scheduler degenerates / timing assumptions break)

Problem Description

The scheduler computes:

constexpr index_t MFMA_PER_VMEM_READ = MFMA_INST / VMEM_READ_INST;
constexpr index_t LDS_READ_PER_MFMA  = LDS_READ_INST / MFMA_INST;

If VMEM_READ_INST > MFMA_INST or LDS_READ_INST < MFMA_INST, the computed values can become 0, which changes the intended scheduling pattern.

Impact

  • Scheduler may fail to distribute barriers as intended (performance regression).
  • If correctness relies (even implicitly) on certain ordering/timing properties, degenerate scheduling can increase hazard exposure and make existing race windows worse.

Suggestion / Fix

Clamp to at least 1 where appropriate:

constexpr index_t MFMA_PER_VMEM_READ = max<index_t>(1, MFMA_INST / VMEM_READ_INST);
constexpr index_t LDS_READ_PER_MFMA  = max<index_t>(1, LDS_READ_INST / MFMA_INST);

Following up on the previous report, I have performed a deeper architectural audit and identified the following systemic root causes. These issues explain why the kernel exhibits unpredictable behavior and SDC under custom configurations.

Operating System

Ubuntu 22.04 LTS

CPU

AMD Ryzen 9 7950X

GPU

AMD Radeon RX 7900 XT

Other

No response

ROCm Version

ROCm 6.0.0

ROCm Component

Composable Kernel

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions