Adding training functionalities to Toolkit#108
Open
laserkelvin wants to merge 328 commits into
Open
Conversation
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Brings in 5 upstream commits from main: - a85db34 Refactor hook contexts (NVIDIA#93) - splits HookContext into base + DynamicsContext + TrainContext - 84d8119 chore: bumping torch minimum version to 2.8 (NVIDIA#85) - 8f7e628 fix(dynamics): MTK NPT/NPH barostat thermostat coupling (NVIDIA#90) - 001f1cb fix(models): tensile-positive stress convention (NVIDIA#87) - 7fe7756 fix(models): merge force and stress autograd (NVIDIA#88) This propagates the new TrainContext shape to all stacked PRs in the training-epic series (#4, #5, #6, #7, #8, #9). Stacked PR branches will need to be rebased or merged on top of this updated training-epic.
Brings training-epic up to date with origin/main on this branch: - a85db34 Refactor hook contexts (NVIDIA#93) - HookContext/DynamicsContext/TrainContext split - 84d8119 chore: torch>=2.8 (NVIDIA#85) - 8f7e628 fix(dynamics): MTK NPT/NPH barostat (NVIDIA#90) - 001f1cb fix(models): tensile-positive stress (NVIDIA#87) - 7fe7756 fix(models): merge force and stress autograd (NVIDIA#88) Replaces an earlier direct origin/main merge to ensure a single merge base between this branch and training-epic, so the PR diff displays the true contribution scope (training primitives only, not the upstream merge churn). # Conflicts: # nvalchemi/hooks/_context.py # test/hooks/test_context.py
Add supporting functions for upcoming `TrainingStrategy`
Bring in PR #4 (training runtime primitives) and PR NVIDIA#93 (hook context refactor). Conflict resolution: - nvalchemi/hooks/_context.py: take upstream's split (HookContext base + DynamicsContext / TrainContext subclasses); keep our additions on TrainContext only: * grad_scaler: torch.amp.GradScaler | None = None * optimizers / lr_schedulers default to empty list (field(default_factory=list)) instead of None, so the orchestrator's gated-op consumers can iterate without None guards. - test/hooks/test_context.py: take upstream verbatim, flip optimizers/lr_schedulers default assertions to == [], cover grad_scaler default + populated cases, and add test_optimizers_default_is_independent_per_instance to guard against shared-list aliasing. Strategy + orchestrator wiring: - TrainingStrategy._build_context now returns TrainContext and passes model=self.models["main"] to preserve the legacy ctx.model alias for hooks that read a single main model (upstream PR NVIDIA#93 decoupled model from models, so we re-establish the alias at the producer rather than via a property). - TrainingUpdateHook / TrainingUpdateOrchestrator type hints narrowed from HookContext to TrainContext (no runtime change; TrainContext IS-A HookContext). Verification: 146 targeted tests / 462 training / 1071 hooks+dynamics passing; make lint + make interrogate green.
…strategy-orchestration Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Adds archetypal hook that drives torch.amp autocast and GradScaler via the DO_BACKWARD and DO_OPTIMIZER_STEP stages, keeping TrainingStrategy AMP-agnostic. Supports fp32/bf16/fp16 with skip-safe scheduler gating and accepts both torch.dtype objects and canonical dtype strings. Tests consolidated via precision x device parametrized fixtures plus a CUDA end-to-end case that exercises real autocast and GradScaler without mocking.
Pulls AtomicData/Batch/Dataset/model/optimizer/strategy construction into test/training/conftest.py as pure-value fixtures backed by private _build_* helpers, removing the cross-module import between test_mixed_precision and test_strategy. Adds an autouse fixture that seeds torch (and CUDA when visible) to 0 before each test, dropping 20 inline torch.manual_seed(0) calls. Training-fn symbols stay in test_strategy.py to preserve spec round-trip identity assertions.
Introduce a hook that lazily maintains a torch.optim.swa_utils.AveragedModel over a selected training model at TrainingStage.AFTER_OPTIMIZER_STEP. Keeps the hook a pure observer (no backward, no grad/optimizer/scheduler mutation, no ctx.models mutation) and exposes averaged_model / get_averaged_model for explicit eval and checkpoint use. state_dict / load_state_dict land in a follow-up step.
…ration Add `TrainingStrategy` orchestration
Add shared profiling hooks for training and dynamics
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Adding workflow reporter abstraction
update pipeline to be compatible with ema
Signed-off-by: Ying Shi Teh <yteh@nvidia.com>
Fix unweighted validation loss reporting
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Add Fix from NGNP Integration
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
ys-teh
reviewed
Jun 16, 2026
| left=_loss_weight_to_spec(weight.left), | ||
| right=_loss_weight_to_spec(weight.right), | ||
| ) | ||
| if hasattr(weight, "model_dump"): |
Collaborator
There was a problem hiding this comment.
I notice that for objects that lack model_dump(), strategy.json can be missing/problematic and checkpoint resume can fail. Maybe raise an explicit error when a weight schedule isn't spec-serializable (e.g., require subclassing _BaseWeightSchedule) when initializing?
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Introduce a user-facing guide for the training API, walking from a minimal script through the strategy lifecycle, configuration, counters, the per-batch forward/loss/backward/update path, optimizer orchestration and update hooks, validation and logging, and checkpointing. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Distill the prose style of the training strategy guide into the distributed docs: an intention-first preamble, bridge-and-motivate section openings, and a note for the single-process no-op behavior. Also correct the manager initialization ordering and tighten the data loader/sampler walkthrough. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Rewrite the checkpoint overview and manual save/restart walkthrough for clarity, and add a "Serialization scope" section that spells out what a checkpoint manifest can and cannot embed: importable reconstruction specs and JSON/registered-type arguments versus non-importable callables (the training function being the common case), non-serializable spec arguments, and runtime objects such as hooks and dataloaders that must be re-supplied at load time. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- validation.rst: fix a dropped-word sentence and an em-dash spacing typo. - hooks.rst: note that most users register concrete TrainingUpdateHook instances directly, and that the orchestrator is created automatically. - losses.rst: add a short composition overview and a runnable example ahead of the autosummary tables. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
List the training strategy guide in the userguide landing page and add it to the toctree so it is reachable from navigation. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Convert prose triple-hyphens to spaced unicode em-dashes so the hooks guide matches the em-dash convention used across the other userguide pages. The ASCII em-dash inside the code-block comment is left as-is. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Call out that training examples are labeled explicitly in the intermediate example list, and flag splitting training workflows into a dedicated section if the collection grows. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com>
Signed-off-by: Kelvin Lee <kinlongkelvi@nvidia.com> # Conflicts: # CHANGELOG.md
Collaborator
Author
|
/ok to test 6f08683 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
ALCHEMI Toolkit Pull Request
Description
This PR introduces the core functionalities required to support training and fine-tuning of models in
nvalchemi-toolkit.This PR is still a WIP - do not merge!
Type of Change
Related Issues
Changes Made
create_model_specmethods and dynamic pydantic model creation forpickle-less serialization of configurationTrainingStrategypydantic model as a recipe validation and loop executor. The execution is highly modular and extendible, allowing for (hopefully) arbitrarily complex training workflows to be built, and not limited to MLIPsFineTuningStrategythat specializesTrainingStrategyfor...fine-tuning workflows by making pre-existing checkpoints and layer addition/modification integral to the workflowTesting
make pytest)make lint)Checklist
Additional Notes
Tip
This repository uses Greptile, an AI code review service, to help conduct
pull request reviews. We encourage contributors to read and consider suggestions
made by Greptile, but note that human maintainers will provide the necessary
reviews for merging: Greptile's comments are not a qualitative judgement
of your code, nor is it an indication that the PR will be accepted/rejected.
We encourage the use of emoji reactions to Greptile comments, depending on
their usefulness and accuracy.