Fixing up model wrapping and tracking of metrics code, learning scheduler by mtauraso · Pull Request #707 · lincc-frameworks/hyrax

mtauraso · 2026-02-17T19:20:07Z

This is a follow-up on #706 which was a targeted fix for the bug reported in dirac slack: https://uw-dirac.slack.com/archives/C08F5FLEY5A/p1771350453909469

This is the more complete/thorough fix.

To avoid putting data/calling functions on the wrong model, the create_* methods now have two local variables "model" and "wrapped_model" In the case where there is no wrapping done by idist.auto_model() these are the same. I've tried to update the relevant accesses, but I want both @drewoldag and @SamSandwich07 to sign off before I merge.

I've also removed local variables of scheduler and optimizer in favor of model.scheduler and model.optimizer which are always on the inner model.

This has also exposed a larger issue that our CI doesn't have real GPUs, and we would have caught the original bug, and perhaps several other issues where we memoize data onto the model if we had CI with even an old and cheap GPU.

…uler

Copilot

Pull request overview

This PR fixes a critical bug in model wrapping for distributed training in Hyrax. The issue stems from PyTorch Ignite's idist.auto_model() which wraps models for distributed execution, but the previous code was accessing optimizer, scheduler, and storing state on the wrapped model instead of the unwrapped model. This caused failures in GPU/distributed scenarios.

Changes:

Introduced explicit wrapped_model and model variables in all engine creation functions to distinguish between wrapped (for execution) and unwrapped (for state access) models
Changed optimizer and scheduler access from local variables to direct attribute access on the unwrapped model (model.optimizer, model.scheduler)
Ensured all model state modifications (metrics, learning rate history) are performed on the unwrapped model

codecov · 2026-02-17T19:30:36Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 64.15%. Comparing base (4c93171) to head (5eacc3e).

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #707      +/-   ##
==========================================
- Coverage   64.17%   64.15%   -0.02%     
==========================================
  Files          61       61              
  Lines        5892     5890       -2     
==========================================
- Hits         3781     3779       -2     
  Misses       2111     2111

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

…model actually end up on the wrapped model

github-actions · 2026-02-17T20:50:52Z

Before [`4c93171`]	After [`f8cca71`]	Ratio	Benchmark (Parameter)
failed	failed	n/a	vector_db_benchmarks.VectorDBInsertBenchmarks.time_load_vector_db(16384, 'qdrant')
6.71±0.01s	7.13±0.09s	1.06	vector_db_benchmarks.VectorDBInsertBenchmarks.time_load_vector_db(256, 'qdrant')
9.31±0.04ms	9.64±0.09ms	1.04	vector_db_benchmarks.VectorDBSearchBenchmarks.time_search_by_vector_many_shards(64, 'chromadb')
12.3±0.01s	12.5±0.1s	1.02	vector_db_benchmarks.VectorDBInsertBenchmarks.time_load_vector_db(2048, 'qdrant')
1.94±0.01s	1.95±0.01s	1.01	benchmarks.time_database_connection_help
200±0.9ms	202±1ms	1.01	benchmarks.time_import
38.1±0.1ms	38.5±0.3ms	1.01	benchmarks.time_nb_obj_construct
1.92±0.01s	1.93±0.01s	1.01	benchmarks.time_prepare_help
1.93±0.01s	1.95±0.01s	1.01	benchmarks.time_train_help
3.91G	3.94G	1.01	vector_db_benchmarks.VectorDBInsertBenchmarks.peakmem_load_vector_db(16384, 'qdrant')

Click here to view all benchmarks.

drewoldag

This looks good to me.

Fixing up model wrapping and tracking of metrics code, learning sched…

9b22b31

…uler

mtauraso requested review from SamSandwich07, Copilot and drewoldag February 17, 2026 19:20

Copilot started reviewing on behalf of mtauraso February 17, 2026 19:20 View session

Copilot AI reviewed Feb 17, 2026

View reviewed changes

Adding A test that fails when the data members we want to put on the …

5eacc3e

…model actually end up on the wrapped model

mtauraso mentioned this pull request Feb 17, 2026

Fix AttributeError: 'DataParallel' object has no attribute 'scheduler' #705

Closed

4 tasks

drewoldag approved these changes Feb 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixing up model wrapping and tracking of metrics code, learning scheduler#707

Fixing up model wrapping and tracking of metrics code, learning scheduler#707
mtauraso wants to merge 2 commits intomainfrom
mtauraso/fixup-model-vs-wrapped-model

mtauraso commented Feb 17, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

codecov bot commented Feb 17, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Feb 17, 2026

Uh oh!

drewoldag left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

mtauraso commented Feb 17, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

codecov bot commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions bot commented Feb 17, 2026

Uh oh!

drewoldag left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

codecov bot commented Feb 17, 2026 •

edited

Loading