Skip to content

feat(docs): Add A100 validation documentation#157

Open
MengjieLee wants to merge 1 commit into
RL-Align:mainfrom
MengjieLee:feat/a100-notebook-validation
Open

feat(docs): Add A100 validation documentation#157
MengjieLee wants to merge 1 commit into
RL-Align:mainfrom
MengjieLee:feat/a100-notebook-validation

Conversation

@MengjieLee

@MengjieLee MengjieLee commented Jun 18, 2026

Copy link
Copy Markdown

Document a reproducible NVIDIA A100 smoke validation run and exclude local agent workspace artifacts from version control.

Summary by CodeRabbit

  • Documentation
    • Added Hardware Profiling Guide providing a repeatable workflow for profiling RL-Kernel on NVIDIA Ampere GPUs, covering system setup, GPU configuration, smoke testing, profiling execution, workload validation, and guidance for sharing profiling results effectively.

@coderabbitai

coderabbitai Bot commented Jun 18, 2026

Copy link
Copy Markdown

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a new docs/getting_started/hardware-profiling.md page documenting a repeatable NVIDIA Ampere GPU profiling workflow for RL-Kernel, covering environment inspection, GPU selection, smoke tests, representative profiling runs, strict fused dispatch validation, and reporting guidance. The page is registered in docs/.nav.yml.

Changes

Hardware Profiling Guide

Layer / File(s) Summary
Nav wiring and guide purpose/scope
docs/.nav.yml, docs/getting_started/hardware-profiling.md
Adds the nav entry linking to the new guide and defines its purpose as a profiling checklist, with explicit scope limits on supported GPU/software configurations.
Environment inspection, GPU selection, and smoke tests
docs/getting_started/hardware-profiling.md
Documents nvidia-smi and Python/PyTorch/CUDA probe steps, CUDA_VISIBLE_DEVICES-based single-GPU isolation, and pytest smoke test commands for dispatch and profiler paths.
Profiling runs, strict dispatch validation, and reporting
docs/getting_started/hardware-profiling.md
Covers smoke-scale and representative-shape run_profile_suite.py invocations, CSV/JSON report field interpretation, strict fused LogP dispatch validation via --require-fused-logp, and required metadata elements for reporting results.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~5 minutes

Poem

🐇 A guide for the GPU, clear and bright,
No artifacts committed — kept out of sight!
Check your CUDA, select a lone card,
Profile your kernels, it isn't too hard.
Hop through the steps, then share what you found —
The rabbit approves, let the benchmarks resound! 🚀

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Title check ⚠️ Warning The PR title mentions A100 validation documentation, but the actual changes describe a hardware profiling guide for NVIDIA Ampere GPUs (not specifically A100) and do not include any version control configuration changes as mentioned in PR objectives. Update the title to accurately reflect the main changes: consider 'feat(docs): Add hardware profiling guide for NVIDIA Ampere GPUs' or similar to match the actual content of the hardware-profiling.md file.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
docs/getting_started/nvidia-a100-validation/index.md (1)

128-132: ⚡ Quick win

Improve sentence variety in scope-limitation section.

Lines 128–132 begin three successive sentences with "This does not," which affects readability. Consider rewording one or two for variety while preserving clarity.

Example revision:

  • Line 129: "This does not validate AMD ROCm..." → "AMD ROCm is not validated..."
  • Line 130: "This does not validate H100..." → "H100, SM90, and TMA-specific behavior are not validated..."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/getting_started/nvidia-a100-validation/index.md` around lines 128 - 132,
The scope-limitation section contains multiple consecutive bullet points that
begin with "This does not," which creates repetitive sentence structure and
reduces readability. Reword at least two of the bullet points (such as the ones
about AMD ROCm validation and H100/SM90 validation) to use alternative sentence
constructions, such as passive voice ("AMD ROCm is not validated...") or
different grammatical structures, while preserving the original meaning and
clarity of each limitation statement.

Source: Linters/SAST tools

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@docs/getting_started/nvidia-a100-validation/index.md`:
- Around line 128-132: The scope-limitation section contains multiple
consecutive bullet points that begin with "This does not," which creates
repetitive sentence structure and reduces readability. Reword at least two of
the bullet points (such as the ones about AMD ROCm validation and H100/SM90
validation) to use alternative sentence constructions, such as passive voice
("AMD ROCm is not validated...") or different grammatical structures, while
preserving the original meaning and clarity of each limitation statement.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ebeb648e-b7e9-4aca-a365-0fcd436dc1fa

📥 Commits

Reviewing files that changed from the base of the PR and between b9b67a9 and 01fcfaa.

📒 Files selected for processing (4)
  • .gitignore
  • docs/.nav.yml
  • docs/getting_started/nvidia-a100-validation/a100_benchmark_notes.ipynb
  • docs/getting_started/nvidia-a100-validation/index.md

Comment thread docs/getting_started/nvidia-a100-validation/a100_benchmark_notes.ipynb Outdated
@MengjieLee MengjieLee force-pushed the feat/a100-notebook-validation branch from 01fcfaa to 38ffe31 Compare June 22, 2026 07:47

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
docs/getting_started/nvidia-a100-validation/index.md (1)

126-133: 🧹 Nitpick | 🔵 Trivial | ⚡ Quick win

Reduce repetitive "This does not..." sentence beginnings in the scope section.

The section "What This Does Not Claim" (lines 128–133) lists five scope limitations, four of which begin with "This does not." This repetition diminishes readability. Consider rephrasing 1–2 items to vary the sentence structure.

Example refactoring:

  • "This does not validate AMD ROCm; ROCm was unavailable on this node." → keep as-is
  • "This does not validate H100, SM90, or TMA-specific fused LogP behavior; A100 is SM80." → "H100, SM90, and TMA-specific fused LogP behavior remain unvalidated; this run only covers A100 (SM80)."
  • "This does not reproduce the full benchmark tables in the project README." → keep as-is
  • "This does not claim that every CUDA, driver, or PyTorch combination is supported." → "Support across all CUDA, driver, and PyTorch combinations is not claimed."
  • "This does not validate strict fused mode with --require-fused-logp; ..." → keep as-is
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/getting_started/nvidia-a100-validation/index.md` around lines 126 - 133,
In the "What This Does Not Claim" section, reduce repetitive sentence structure
by rephrasing 1-2 of the five bullet points to vary away from the "This does
not..." opening. Specifically, consider restructuring the second bullet point
about H100, SM90, and TMA-specific fused LogP behavior to emphasize what was
covered instead (A100 SM80), and optionally rephrase the fourth bullet point
about CUDA, driver, and PyTorch support combinations using passive voice or
alternative phrasing. Maintain the factual content and clarity while improving
readability through varied sentence construction.

Source: Linters/SAST tools

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@docs/getting_started/nvidia-a100-validation/index.md`:
- Around line 126-133: In the "What This Does Not Claim" section, reduce
repetitive sentence structure by rephrasing 1-2 of the five bullet points to
vary away from the "This does not..." opening. Specifically, consider
restructuring the second bullet point about H100, SM90, and TMA-specific fused
LogP behavior to emphasize what was covered instead (A100 SM80), and optionally
rephrase the fourth bullet point about CUDA, driver, and PyTorch support
combinations using passive voice or alternative phrasing. Maintain the factual
content and clarity while improving readability through varied sentence
construction.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 30fce235-5204-482d-904b-e8a73fa3781b

📥 Commits

Reviewing files that changed from the base of the PR and between 01fcfaa and 38ffe31.

📒 Files selected for processing (4)
  • .gitignore
  • docs/.nav.yml
  • docs/getting_started/nvidia-a100-validation/a100_benchmark_notes.ipynb
  • docs/getting_started/nvidia-a100-validation/index.md
✅ Files skipped from review due to trivial changes (3)
  • docs/.nav.yml
  • .gitignore
  • docs/getting_started/nvidia-a100-validation/a100_benchmark_notes.ipynb

@MengjieLee MengjieLee force-pushed the feat/a100-notebook-validation branch from 38ffe31 to 89b97fc Compare June 22, 2026 07:52
@Flink-ddd

Copy link
Copy Markdown
Collaborator

please resolve CI error first, Thanks.

@MengjieLee MengjieLee force-pushed the feat/a100-notebook-validation branch from 6d39e9c to a9014d3 Compare June 23, 2026 03:30
@MengjieLee MengjieLee changed the title Add A100 validation documentation feat(docs): Add A100 validation documentation Jun 23, 2026
@MengjieLee

Copy link
Copy Markdown
Author

please resolve CI error first, Thanks.

Done!
Missing new line at the bottom.

@Flink-ddd Flink-ddd left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the A100 validation PR @MengjieLee

Two changes needed before merge:

.gitignore cleanup: Revert the additions of .agents/, .claude/, etc. These are local tool artifacts and should be handled by your global gitignore.

Doc format: We avoid committing static .ipynb files with hardcoded execution logs as they rot quickly. Please convert this into a generalized "Hardware Profiling Guide" in pure Markdown that explains the process of profiling on Ampere.

Please update the PR when you have a chance.

@MengjieLee MengjieLee force-pushed the feat/a100-notebook-validation branch from a9014d3 to ea9655a Compare June 23, 2026 14:43
Document a reproducible NVIDIA A100 smoke validation run and exclude local agent workspace artifacts from version control.
@MengjieLee MengjieLee force-pushed the feat/a100-notebook-validation branch from ea9655a to 3e3394b Compare June 23, 2026 15:18
@MengjieLee MengjieLee requested a review from Flink-ddd June 23, 2026 15:18
@MengjieLee

MengjieLee commented Jun 23, 2026

Copy link
Copy Markdown
Author

Thanks for the A100 validation PR @MengjieLee

Two changes needed before merge:

.gitignore cleanup: Revert the additions of .agents/, .claude/, etc. These are local tool artifacts and should be handled by your global gitignore.

Doc format: We avoid committing static .ipynb files with hardcoded execution logs as they rot quickly. Please convert this into a generalized "Hardware Profiling Guide" in pure Markdown that explains the process of profiling on Ampere.

Please update the PR when you have a chance.

Done! @Flink-ddd

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants