Skip to content

fix: preserve system IDs during inflight refill#113

Open
architdatar wants to merge 2 commits into
NVIDIA:mainfrom
architdatar:fix/112-refill-system-id
Open

fix: preserve system IDs during inflight refill#113
architdatar wants to merge 2 commits into
NVIDIA:mainfrom
architdatar:fix/112-refill-system-id

Conversation

@architdatar

@architdatar architdatar commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Description

Preserve sampler-assigned system_id values for replacement systems during inflight refill. BaseDynamics.refill_check() now seeds rebuilt bookkeeping tensors from values already present on the appended result batch before restoring values for systems that stayed active.

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Performance improvement
  • Documentation update
  • Refactoring (no functional changes)
  • CI/CD or infrastructure change

Related Issues

Fixes #112

Changes Made

  • Preserve system_id values assigned by SizeAwareSampler during refill_check() for inflight batching.
  • Preserve system_id values for systems already in batch.
  • Add test to verify this workflow.

Testing

  • Unit tests pass locally (uv run pytest test/dynamics/test_inflight.py::TestRefillCheck)
  • Linting passes (make lint)
  • New tests added for new functionality meets coverage expectations?
  • Added regression test for system_id preservation during inflight refill.

Checklist

  • I have read and understand the Contributing Guidelines
  • I have updated the CHANGELOG.md
  • I have performed a self-review of my code
  • I have added docstrings to new functions/classes
  • I have updated the documentation (if applicable)

Additional Notes

This is a Python-only toolkit fix. No nvalchemi-toolkit-ops changes are required.

Signed-off-by: architd <architd@nvidia.com>
@copy-pr-bot

copy-pr-bot Bot commented Jun 15, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@architdatar architdatar marked this pull request as ready for review June 15, 2026 22:16
@greptile-apps

greptile-apps Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR fixes a bug where BaseDynamics.refill_check() was overwriting sampler-assigned system_id values on replacement systems with defaults when rebuilding bookkeeping tensors. The fix seeds the new tensor from existing values on the appended result batch before restoring the prefix for systems that stayed active.

  • base.py: Introduces a seed step (new_tensor.copy_(result_vals)) before the existing prefix-overwrite, so replacement systems' bookkeeping values (especially system_id) survive the rebuild. Shape mismatches now raise a RuntimeError rather than being silently skipped.
  • test_inflight.py: Adds test_refill_preserves_replacement_system_id (all-graduated path) and test_refill_preserves_mixed_system_ids (one remaining, one replacement) to guard against regressions on both branches of the fix.

Important Files Changed

Filename Overview
nvalchemi/dynamics/base.py Adds a seed-then-overwrite pass for bookkeeping tensors in refill_check: values already on the appended result batch (including sampler-assigned system_id for replacements) are copied first, then the remaining-systems prefix is restored from the original batch. Shape mismatch now raises RuntimeError instead of being silently skipped, improving diagnosability.
test/dynamics/test_inflight.py Adds two regression tests: one for the all-graduated case (both systems exit, one replacement appended) and one for the mixed case (one system remains, one replacement appended). Both assert that sampler-assigned system_id values are preserved for replacement systems and that remaining systems keep their original IDs.

Reviews (2): Last reviewed commit: "fix: validate refill bookkeeping and cov..." | Re-trigger Greptile

Comment thread nvalchemi/dynamics/base.py
Comment thread test/dynamics/test_inflight.py
Signed-off-by: architd <architd@nvidia.com>

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add another test case where not all samples are graduating?

Essentially, we want to make sure that the refill case where only some slots are being replaced that the system ids are also being preserved then

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I believe that this case is covered in the latest commit (dbcef64) by test_refill_preserves_mixed_system_ids.

That test starts with initial system_ids [0, 1], marks only the first sample as graduated with status = [[1], [0]], then verifies the refill result preserves the remaining system ID and the replacement ID as [1, 2].

Please let me know if you have a different partial-refill scenario in mind.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

🐛[BUG]: BaseDynamics.refill_check overwrites replacement system_id during inflight refill

2 participants