[Issue]: Systemic Root Causes

### Problem Description

## Issue 1: `GetAlignment*()` can return 0 leading to divide-by-zero / invalid tile distributions (SDC risk)

### Problem Description

Several alignment helpers compute `kVecLoad` using integer division that can legally produce **0**:

Example (`GetAlignmentQ`, similar pattern exists in `GetAlignmentK`, `GetAlignmentOGrad`, `GetAlignmentBias`):

```cpp
constexpr index_t total_pixels = kMNPerBlock * kKPerBlock / kBlockSize;

constexpr index_t kVecLoad =
    ((total_pixels / kMaxVecLoad) >= kMinVecLoad) ? kMaxVecLoad : (total_pixels / kMinVecLoad);

return kVecLoad;
```

If `total_pixels < kMinVecLoad`, then `(total_pixels / kMinVecLoad) == 0`, so `kVecLoad` becomes 0.

### Impact

* **Divide-by-zero** in downstream computations (e.g. `Q_VMEM_READ = ... / GetAlignmentQ<Problem>()` in `HotLoopScheduler`).
* Invalid alignment propagates into tile distributions and LDS layouts, potentially causing **silent incorrect addressing / SDC** under certain configurations.

### Suggestion / Fix

Enforce a minimum of 1 and add divisibility/validity checks:

```cpp
static_assert(total_pixels > 0, "total_pixels must be > 0");
constexpr index_t kVecLoadCandidate =
    ((total_pixels / kMaxVecLoad) >= kMinVecLoad) ? kMaxVecLoad : (total_pixels / kMinVecLoad);

constexpr index_t kVecLoad = (kVecLoadCandidate > 0) ? kVecLoadCandidate : 1;
return kVecLoad;
```

Optionally add:

```cpp
static_assert((16 % sizeof(QDataType)) == 0, "vector load granularity mismatch");
```

---

## Issue 2: Multiple tile distribution builders rely on implicit divisibility; missing `static_assert` allows silent truncation (SDC risk)

### Problem Description

Many distribution constructors compute factors via integer division without verifying divisibility, e.g.:

```cpp
constexpr index_t K0 = kKPerBlock / K1;
constexpr index_t N1 = get_warp_size() / K0;
constexpr index_t N2 = kNPerBlock / (N1 * N0);
```

Similar patterns exist in:

* `MakeQDramTileDistribution`
* `MakeKDramTileDistribution`
* `MakeVDramTileDistribution`
* `MakeOGradDramTileDistribution`
* `MakeBiasTileDistribution`
* `MakePreXDramTileDistribution` (e.g., `M0 = MPerBlock / M1`)

These assume exact divisibility but don’t consistently enforce it.

### Impact

If any of the intermediate divisions truncates:

* Thread↔element mapping becomes incorrect (tiles partially duplicated or skipped).
* Leads to wrong memory addressing and may cause **Silent Data Corruption** rather than a compile-time failure.

### Suggestion / Fix

Add compile-time guards near each derivation chain, e.g.:

```cpp
static_assert(kKPerBlock % K1 == 0, "kKPerBlock must be divisible by alignment (K1)");
static_assert(get_warp_size() % K0 == 0, "warp_size must be divisible by K0");
static_assert(kNPerBlock % (N1 * N0) == 0, "kNPerBlock must be divisible by (N1*N0)");
```

Apply similarly for M-side computations (`kMPerBlock % (M1*M0) == 0`, etc.).

---

## Issue 3: `GetTransposedAlignment*()` can silently produce 0 or truncated results (invalid vectorization / layout risk)

### Problem Description

Transposed alignments are computed as:

```cpp
return total_pixels / GetAlignmentQ<Problem>();
```

No checks ensure:

* `GetAlignmentQ` is non-zero
* `total_pixels` is divisible by `GetAlignmentQ`
* result is non-zero

### Impact

* If alignment > total_pixels, the result becomes **0**.
* If not divisible, truncation occurs → layout mismatch.
* Can break transposed LDS read/write descriptors and cause incorrect loads/stores → **SDC risk**.

### Suggestion / Fix

Add compile-time checks:

```cpp
static_assert(GetAlignmentQ<Problem>() > 0);
static_assert(total_pixels % GetAlignmentQ<Problem>() == 0,
              "total_pixels must be divisible by alignment for transposed path");
static_assert((total_pixels / GetAlignmentQ<Problem>()) > 0,
              "Transposed alignment must be > 0");
```

(Apply analogously to K/OGrad/Bias transposed alignments.)

---

## Issue 4: `MakeXLdsBlockDescriptor()` hardcodes `DataTypeSize = 2` (type fragility / incorrect LDS layout risk)

### Problem Description

`MakeXLdsBlockDescriptor` assumes FP16/BF16 by hardcoding:

```cpp
constexpr auto DataTypeSize = 2; // sizeof(F16/BF16)
```

But the policy templates are used with `Problem::QDataType/KDataType/VDataType/...` which may vary (e.g., fp8 / int8 / fp32 in mixed precision or future paths).

### Impact

If DataType is not 2 bytes:

* LDS layer calculations (`MNLdsLayer`) become wrong.
* Descriptor strides/lengths mismatch real element sizes.
* Can cause incorrect LDS indexing and memory overlap → **SDC / race-like symptoms**.

### Suggestion / Fix

Replace the hardcoded constant with `sizeof(DataType)` (or a template parameter) and add guards:

```cpp
constexpr auto DataTypeSize = sizeof(DataType);
static_assert(DataTypeSize == 2, "This descriptor assumes 2-byte types; update for generic types.");
```

Or fully generalize computations to support multiple sizes safely.

---

## Issue 5: `GetSmemSize()` uses `max(stage sizes)` implying aggressive LDS reuse; correctness depends on perfect pipeline separation (race/SDC risk)

### Problem Description

Shared memory size is computed as:

```cpp
return max(smem_size_stage0_0, smem_size_stage0_1, smem_size_stage1);
```

This indicates **stages reuse the same LDS region** rather than allocating disjoint memory (sum). That is safe only if all stage transitions are strictly separated by correct waits/barriers.

### Impact

If any async pipeline timing/barrier condition is incorrect (especially under custom tile configs):

* One stage can overwrite LDS data still being consumed by another stage.
* Leads to hard-to-debug **race conditions / silent data corruption**.

### Suggestion / Fix

* Provide a “safe mode” to disable reuse (allocate disjoint regions using sum) for correctness validation.
* Add conservative barriers/waitcnt at stage boundaries in debug builds.
* At minimum, document + enforce invariants guaranteeing no overlap.

---

## Issue 6: `HotLoopScheduler` can compute 0 “per-step” values (scheduler degenerates / timing assumptions break)

### Problem Description

The scheduler computes:

```cpp
constexpr index_t MFMA_PER_VMEM_READ = MFMA_INST / VMEM_READ_INST;
constexpr index_t LDS_READ_PER_MFMA  = LDS_READ_INST / MFMA_INST;
```

If `VMEM_READ_INST > MFMA_INST` or `LDS_READ_INST < MFMA_INST`, the computed values can become 0, which changes the intended scheduling pattern.

### Impact

* Scheduler may fail to distribute barriers as intended (performance regression).
* If correctness relies (even implicitly) on certain ordering/timing properties, degenerate scheduling can increase hazard exposure and make existing race windows worse.

### Suggestion / Fix

Clamp to at least 1 where appropriate:

```cpp
constexpr index_t MFMA_PER_VMEM_READ = max<index_t>(1, MFMA_INST / VMEM_READ_INST);
constexpr index_t LDS_READ_PER_MFMA  = max<index_t>(1, LDS_READ_INST / MFMA_INST);
```

---
Following up on the previous report, I have performed a deeper architectural audit and identified the following systemic root causes. These issues explain why the kernel exhibits unpredictable behavior and SDC under custom configurations.


### Operating System

Ubuntu 22.04 LTS

### CPU

AMD Ryzen 9 7950X

### GPU

AMD Radeon RX 7900 XT

### Other

_No response_

### ROCm Version

ROCm 6.0.0

### ROCm Component

Composable Kernel

### Steps to Reproduce

_No response_

### (Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

_No response_

### Additional Information

_No response_

[Issue]: Systemic Root Causes #3713

Description

Problem Description

Issue 1: GetAlignment*() can return 0 leading to divide-by-zero / invalid tile distributions (SDC risk)

Problem Description

Impact

Suggestion / Fix

Issue 2: Multiple tile distribution builders rely on implicit divisibility; missing static_assert allows silent truncation (SDC risk)

Problem Description

Impact

Suggestion / Fix

Issue 3: GetTransposedAlignment*() can silently produce 0 or truncated results (invalid vectorization / layout risk)

Problem Description

Impact

Suggestion / Fix

Issue 4: MakeXLdsBlockDescriptor() hardcodes DataTypeSize = 2 (type fragility / incorrect LDS layout risk)

Problem Description

Impact

Suggestion / Fix

Issue 5: GetSmemSize() uses max(stage sizes) implying aggressive LDS reuse; correctness depends on perfect pipeline separation (race/SDC risk)

Problem Description

Impact

Suggestion / Fix

Issue 6: HotLoopScheduler can compute 0 “per-step” values (scheduler degenerates / timing assumptions break)

Problem Description

Impact

Suggestion / Fix

Operating System

CPU

GPU

Other

ROCm Version

ROCm Component

Steps to Reproduce

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

Additional Information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Issue 1: `GetAlignment*()` can return 0 leading to divide-by-zero / invalid tile distributions (SDC risk)

Issue 2: Multiple tile distribution builders rely on implicit divisibility; missing `static_assert` allows silent truncation (SDC risk)

Issue 3: `GetTransposedAlignment*()` can silently produce 0 or truncated results (invalid vectorization / layout risk)

Issue 4: `MakeXLdsBlockDescriptor()` hardcodes `DataTypeSize = 2` (type fragility / incorrect LDS layout risk)

Issue 5: `GetSmemSize()` uses `max(stage sizes)` implying aggressive LDS reuse; correctness depends on perfect pipeline separation (race/SDC risk)

Issue 6: `HotLoopScheduler` can compute 0 “per-step” values (scheduler degenerates / timing assumptions break)