Skip to content

Tutorials/Accelerated Python: Fixes and improvements to memory spaces, asynchrony, and kernel authoring notebooks#148

Open
brycelelbach wants to merge 18 commits intomainfrom
fix/accelerated-python-memory-space-asynchrony-and-kernel-notebooks
Open

Tutorials/Accelerated Python: Fixes and improvements to memory spaces, asynchrony, and kernel authoring notebooks#148
brycelelbach wants to merge 18 commits intomainfrom
fix/accelerated-python-memory-space-asynchrony-and-kernel-notebooks

Conversation

@brycelelbach
Copy link
Collaborator

@brycelelbach brycelelbach commented Mar 3, 2026

Summary

Fixes and improvements across four Accelerated Python tutorial notebooks (and their solutions): Memory Spaces, Asynchrony, Copy Kernel Authoring, and Book Histogram Kernel Authoring.

Memory Spaces (Power Iteration)

  • Fix benchmark to compare CPU wall-clock times for both host and device, and revert to using the same input matrix for both.
  • Unify benchmark reporting to use milliseconds with mean ± relative stdev format.
  • Remove inline benchmarking from sections 3 and 4.
  • Separate eigvals into its own cell, fix capitalization, and restore eigvals timing.
  • Remove outdated note about cupy.linalg.eigvals not being implemented.
  • Remove unnecessary isinstance check before cp.asarray().
  • Rename estimate_device_exercise / generate_device_exercise back to estimate_device / generate_device.
  • Re-add checkpoint I/O to power iteration.

Asynchrony (Power Iteration)

  • Fix compute step NVTX annotations with accurate step ranges and per-step regions.
  • Limit warmup calls to 1 step to avoid unnecessary computation.

Cross-Cutting (Memory Spaces + Asynchrony)

  • Unify benchmark reporting to use milliseconds with mean ± relative stdev format.
  • Fix typos in example usage of cupyx.profiler.profile and NVTX.

Kernel Authoring (Copy)

  • Add configurable correctness check to the copy kernel launch function.
  • Add correctness check cell before profiling copy_blocked kernel.
  • Add output mode to copy kernel scripts to print problem size and dtype.
  • Apply title, TOC, section header, and text updates to solution notebooks.

Kernel Authoring (Book Histogram)

  • Improve plot formatting with titles and dataset size display.
  • Display dataset size in megabytes instead of bytes.
  • Apply title, TOC, section header, and text updates to solution notebooks.

@copy-pr-bot
Copy link

copy-pr-bot bot commented Mar 3, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 3, 2026

❌ Commit Signature Check Failed

Found 1 unsigned commit(s):

🔗 View workflow run logs

  • f4add4b: Tutorials/Accelerated Python/Kernel Authoring: Add configurable correctness check to copy kernel launch function. (unsigned)

All commits must be signed

How to fix:

  1. Configure commit signing (if not already done):

    # For GPG signing
    git config --global commit.gpgsign true
    
    # Or for SSH signing (Git 2.34+)
    git config --global gpg.format ssh
    git config --global user.signingkey ~/.ssh/id_ed25519.pub
  2. Re-sign your commits:

    git rebase -i origin/main --exec "git commit --amend --no-edit -S"
    git push --force-with-lease

📚 GitHub documentation on signing commits

@brycelelbach
Copy link
Collaborator Author

/ok to test e9e650d

@github-actions
Copy link
Contributor

github-actions bot commented Mar 3, 2026

❌ Link Check Failed

Broken links were detected in this PR.

Please check the workflow run logs for details on which links are broken.

Common fixes:

  1. Typo in URL - Check for spelling mistakes in the link
  2. Outdated link - The page may have moved or been deleted
  3. Relative path issue - Ensure relative links use the correct path
  4. External site down - If the external site is temporarily down, you can add it to brev/.lycheeignore

To test links locally:

./brev/test-links.bash .

📚 Lychee documentation

@brycelelbach brycelelbach force-pushed the fix/accelerated-python-memory-space-asynchrony-and-kernel-notebooks branch from 1a8ecc0 to bc7da88 Compare March 3, 2026 20:48
@brycelelbach
Copy link
Collaborator Author

/ok to test eab9896

… plot formatting, add title and dataset size display.
…ctness check to copy kernel launch function.
… dataset size in megabytes instead of bytes.

Made-with: Cursor
…cell before profiling copy_blocked kernel.

Add a verification cell that runs the script immediately after the %%writefile cell, matching the pattern used in the book histogram notebook.

Made-with: Cursor
…ion header, and text updates to solution notebooks.

These changes were made to the exercise notebooks in 2f5c4fc but
were not applied to the corresponding solution notebooks.

Made-with: Cursor
…y kernel scripts to print problem size and dtype.

Made-with: Cursor
…power iteration.

The savetxt checkpoint I/O was removed from the 05 memory spaces
notebooks in 2f5c4fc. This I/O is needed to set up the narrative for
Notebook 06 (Asynchrony), whose baseline is the synchronous
device-to-host copy + file write pattern introduced here.

Made-with: Cursor
…ance check before cp.asarray().

The isinstance(A, np.ndarray) guard added in 2f5c4fc is unnecessary
because cp.asarray() already handles both cases: it copies a host
array to the GPU, and is a no-op when the array is already on the
GPU. Teaching users to call cp.asarray() unconditionally is the
intended lesson.

Made-with: Cursor
…ercise and generate_device_exercise back to estimate_device and generate_device.

The _exercise suffix was added in 2f5c4fc but breaks the naming
symmetry with estimate_host and generate_host. The host/device naming
convention is cleaner and mirrors the pattern used in the Notebook 06
(Asynchrony) notebooks.

Made-with: Cursor
… to avoid unnecessary computation.

Made-with: Cursor
…e the same input matrix for host and device.

Different matrices converge at different rates, so it's only valid to
benchmark on the same inputs. Use A_host for both host and device benchmarks
instead of comparing A_host (CPU) against A_device (GPU-generated).

Made-with: Cursor
…econds with mean ± relative stdev format.

- 05 Memory Spaces: Use cupyx.profiler.benchmark with mean/stdev/runs format.
- 06 Asynchrony: Use time.perf_counter for single-run timing in ms.
- 40/41 Kernel Authoring: Convert benchmark output from seconds to ms.
- Rename timing variable from D to T across all notebooks.

Made-with: Cursor
… own cell, fix capitalization, restore eigvals timing.

- Split expensive np.linalg.eigvals call into a dedicated cell timed with
  time.perf_counter.
- Report eigvals timing alongside host/device benchmarks.
- Capitalize print labels consistently (Power Iteration, Relative Error).
- Use "Timing Host"/"Timing Device" instead of "Timing CPU"/"Timing GPU".

Made-with: Cursor
…tions with accurate step ranges and per-step regions.

Made-with: Cursor
…g from sections 3 and 4.

Restore the original style from before 2f5c4fc: just call the functions, print
the estimates, and show both matrices side by side. Benchmarking belongs in
section 5 where cupyx.profiler.benchmark is used properly.

Made-with: Cursor
…t cupy.linalg.eigvals not being implemented.

Made-with: Cursor
…CPU wall-clock times for both host and device.

The benchmarking cell was using .gpu_times[0] for the device benchmark
but .cpu_times for the host benchmark, which is an apples-to-oranges
comparison. The GPU time measures only device execution, excluding kernel
launch overhead, synchronization, and other CPU-side costs. The CPU time
(wall-clock) is the end-to-end time, which is the fair metric for both.
This was introduced in 0c365ed.

Made-with: Cursor
@brycelelbach brycelelbach force-pushed the fix/accelerated-python-memory-space-asynchrony-and-kernel-notebooks branch from eab9896 to 7f7e7a4 Compare March 3, 2026 22:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant