Summary
The extended CPS build retrains all QRF models from scratch every time, even when only calibration weights change. Since the QRF imputation depends on source CPS + PUF data (not weights), the fitted models could be cached and reused.
Current cost
- 85+ variable sequential QRF on ~20K PUF records: ~30-60 min
- Additional QRF calls for weeks_unemployed, retirement contributions, SS sub-components
- This runs on every
make data or CI build
Proposed approach
- Serialize fitted QRF models (e.g. pickle/joblib) keyed by a hash of the training data
- On rebuild, check if source data hash matches cached model — if so, skip training and just predict
- microimpute could potentially support this natively (save/load fitted models)
- Could also cache the full
extended_cps_2024.h5 and only rebuild when CPS/PUF inputs change
Context
Related to the sequential QRF migration in #594 — now that all 85 variables are in a single fit() call, caching the one fitted model would skip the entire training phase.
Summary
The extended CPS build retrains all QRF models from scratch every time, even when only calibration weights change. Since the QRF imputation depends on source CPS + PUF data (not weights), the fitted models could be cached and reused.
Current cost
make dataor CI buildProposed approach
extended_cps_2024.h5and only rebuild when CPS/PUF inputs changeContext
Related to the sequential QRF migration in #594 — now that all 85 variables are in a single fit() call, caching the one fitted model would skip the entire training phase.