-
Notifications
You must be signed in to change notification settings - Fork 270
Description
Problem Description
Issue 1: GetAlignment*() can return 0 leading to divide-by-zero / invalid tile distributions (SDC risk)
Problem Description
Several alignment helpers compute kVecLoad using integer division that can legally produce 0:
Example (GetAlignmentQ, similar pattern exists in GetAlignmentK, GetAlignmentOGrad, GetAlignmentBias):
constexpr index_t total_pixels = kMNPerBlock * kKPerBlock / kBlockSize;
constexpr index_t kVecLoad =
((total_pixels / kMaxVecLoad) >= kMinVecLoad) ? kMaxVecLoad : (total_pixels / kMinVecLoad);
return kVecLoad;If total_pixels < kMinVecLoad, then (total_pixels / kMinVecLoad) == 0, so kVecLoad becomes 0.
Impact
- Divide-by-zero in downstream computations (e.g.
Q_VMEM_READ = ... / GetAlignmentQ<Problem>()inHotLoopScheduler). - Invalid alignment propagates into tile distributions and LDS layouts, potentially causing silent incorrect addressing / SDC under certain configurations.
Suggestion / Fix
Enforce a minimum of 1 and add divisibility/validity checks:
static_assert(total_pixels > 0, "total_pixels must be > 0");
constexpr index_t kVecLoadCandidate =
((total_pixels / kMaxVecLoad) >= kMinVecLoad) ? kMaxVecLoad : (total_pixels / kMinVecLoad);
constexpr index_t kVecLoad = (kVecLoadCandidate > 0) ? kVecLoadCandidate : 1;
return kVecLoad;Optionally add:
static_assert((16 % sizeof(QDataType)) == 0, "vector load granularity mismatch");Issue 2: Multiple tile distribution builders rely on implicit divisibility; missing static_assert allows silent truncation (SDC risk)
Problem Description
Many distribution constructors compute factors via integer division without verifying divisibility, e.g.:
constexpr index_t K0 = kKPerBlock / K1;
constexpr index_t N1 = get_warp_size() / K0;
constexpr index_t N2 = kNPerBlock / (N1 * N0);Similar patterns exist in:
MakeQDramTileDistributionMakeKDramTileDistributionMakeVDramTileDistributionMakeOGradDramTileDistributionMakeBiasTileDistributionMakePreXDramTileDistribution(e.g.,M0 = MPerBlock / M1)
These assume exact divisibility but don’t consistently enforce it.
Impact
If any of the intermediate divisions truncates:
- Thread↔element mapping becomes incorrect (tiles partially duplicated or skipped).
- Leads to wrong memory addressing and may cause Silent Data Corruption rather than a compile-time failure.
Suggestion / Fix
Add compile-time guards near each derivation chain, e.g.:
static_assert(kKPerBlock % K1 == 0, "kKPerBlock must be divisible by alignment (K1)");
static_assert(get_warp_size() % K0 == 0, "warp_size must be divisible by K0");
static_assert(kNPerBlock % (N1 * N0) == 0, "kNPerBlock must be divisible by (N1*N0)");Apply similarly for M-side computations (kMPerBlock % (M1*M0) == 0, etc.).
Issue 3: GetTransposedAlignment*() can silently produce 0 or truncated results (invalid vectorization / layout risk)
Problem Description
Transposed alignments are computed as:
return total_pixels / GetAlignmentQ<Problem>();No checks ensure:
GetAlignmentQis non-zerototal_pixelsis divisible byGetAlignmentQ- result is non-zero
Impact
- If alignment > total_pixels, the result becomes 0.
- If not divisible, truncation occurs → layout mismatch.
- Can break transposed LDS read/write descriptors and cause incorrect loads/stores → SDC risk.
Suggestion / Fix
Add compile-time checks:
static_assert(GetAlignmentQ<Problem>() > 0);
static_assert(total_pixels % GetAlignmentQ<Problem>() == 0,
"total_pixels must be divisible by alignment for transposed path");
static_assert((total_pixels / GetAlignmentQ<Problem>()) > 0,
"Transposed alignment must be > 0");(Apply analogously to K/OGrad/Bias transposed alignments.)
Issue 4: MakeXLdsBlockDescriptor() hardcodes DataTypeSize = 2 (type fragility / incorrect LDS layout risk)
Problem Description
MakeXLdsBlockDescriptor assumes FP16/BF16 by hardcoding:
constexpr auto DataTypeSize = 2; // sizeof(F16/BF16)But the policy templates are used with Problem::QDataType/KDataType/VDataType/... which may vary (e.g., fp8 / int8 / fp32 in mixed precision or future paths).
Impact
If DataType is not 2 bytes:
- LDS layer calculations (
MNLdsLayer) become wrong. - Descriptor strides/lengths mismatch real element sizes.
- Can cause incorrect LDS indexing and memory overlap → SDC / race-like symptoms.
Suggestion / Fix
Replace the hardcoded constant with sizeof(DataType) (or a template parameter) and add guards:
constexpr auto DataTypeSize = sizeof(DataType);
static_assert(DataTypeSize == 2, "This descriptor assumes 2-byte types; update for generic types.");Or fully generalize computations to support multiple sizes safely.
Issue 5: GetSmemSize() uses max(stage sizes) implying aggressive LDS reuse; correctness depends on perfect pipeline separation (race/SDC risk)
Problem Description
Shared memory size is computed as:
return max(smem_size_stage0_0, smem_size_stage0_1, smem_size_stage1);This indicates stages reuse the same LDS region rather than allocating disjoint memory (sum). That is safe only if all stage transitions are strictly separated by correct waits/barriers.
Impact
If any async pipeline timing/barrier condition is incorrect (especially under custom tile configs):
- One stage can overwrite LDS data still being consumed by another stage.
- Leads to hard-to-debug race conditions / silent data corruption.
Suggestion / Fix
- Provide a “safe mode” to disable reuse (allocate disjoint regions using sum) for correctness validation.
- Add conservative barriers/waitcnt at stage boundaries in debug builds.
- At minimum, document + enforce invariants guaranteeing no overlap.
Issue 6: HotLoopScheduler can compute 0 “per-step” values (scheduler degenerates / timing assumptions break)
Problem Description
The scheduler computes:
constexpr index_t MFMA_PER_VMEM_READ = MFMA_INST / VMEM_READ_INST;
constexpr index_t LDS_READ_PER_MFMA = LDS_READ_INST / MFMA_INST;If VMEM_READ_INST > MFMA_INST or LDS_READ_INST < MFMA_INST, the computed values can become 0, which changes the intended scheduling pattern.
Impact
- Scheduler may fail to distribute barriers as intended (performance regression).
- If correctness relies (even implicitly) on certain ordering/timing properties, degenerate scheduling can increase hazard exposure and make existing race windows worse.
Suggestion / Fix
Clamp to at least 1 where appropriate:
constexpr index_t MFMA_PER_VMEM_READ = max<index_t>(1, MFMA_INST / VMEM_READ_INST);
constexpr index_t LDS_READ_PER_MFMA = max<index_t>(1, LDS_READ_INST / MFMA_INST);Following up on the previous report, I have performed a deeper architectural audit and identified the following systemic root causes. These issues explain why the kernel exhibits unpredictable behavior and SDC under custom configurations.
Operating System
Ubuntu 22.04 LTS
CPU
AMD Ryzen 9 7950X
GPU
AMD Radeon RX 7900 XT
Other
No response
ROCm Version
ROCm 6.0.0
ROCm Component
Composable Kernel
Steps to Reproduce
No response
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
No response