Skip to content

PE-US rebuild smoke expects missing policyengine_us_data/storage/soi.csv #147

@anth-volk

Description

@anth-volk

The local PE-US rebuild smoke run progressed through CPS download/cache and PUF/demographics loading, then failed while loading the PUF provider because the local policyengine-us-data checkout does not contain policyengine_us_data/storage/soi.csv.

Command shape:

uv run --no-sync python -m microplex_us.pipelines.pe_us_data_rebuild_checkpoint \
  --output-root artifacts/local_us_microplex_smoke \
  --version-id local-smoke-v1 \
  --baseline-dataset /Users/administrator/Documents/PolicyEngine/policyengine-us-data/policyengine_us_data/storage/enhanced_cps_2024.h5 \
  --targets-db /Users/administrator/Documents/PolicyEngine/calibration-diagnostics/.artifacts/policy_data.db \
  --policyengine-us-data-repo /Users/administrator/Documents/PolicyEngine/policyengine-us-data \
  --calibration-backend microcalibrate \
  --donor-imputer-backend zi_qrf \
  --policyengine-materialize-batch-size 100000 \
  --cps-sample-n 1000 --puf-sample-n 1000 --donor-sample-n 1000 \
  --n-synthetic 1000 \
  --defer-policyengine-harness \
  --defer-policyengine-native-score \
  --defer-native-audit \
  --defer-imputation-ablation

Failure:

FileNotFoundError: Could not find PE SOI file at /Users/administrator/Documents/PolicyEngine/policyengine-us-data/policyengine_us_data/storage/soi.csv

Relevant traceback:

microplex_us/data_sources/puf.py:2253 load_frame
microplex_us/data_sources/puf.py:1970 _build_puf_tax_units
microplex_us/data_sources/puf.py:587 uprate_raw_puf_pe_style
microplex_us/data_sources/puf.py:486 _resolve_pe_soi_path

Observed local state:

  • policyengine_us_data/storage/calibration_targets/soi_targets.csv exists and appears to have the schema used by _get_pe_soi_aggregate (Year, Variable, Filing status, AGI lower bound, AGI upper bound, Count, Taxable only, Value).
  • policyengine_us_data/storage/download_prerequisites.py downloads PUF/demographics/np2023/geography prerequisites, but does not download or materialize soi.csv.
  • The rebuild checkpoint CLI exposes --puf-path and --puf-demographics-path, but not a --soi-path, even though PUFSourceProvider has a soi_path field/filter.

This makes a fresh local policyengine-us-data checkout insufficient for the documented --policyengine-us-data-repo path unless soi.csv is manually supplied at the legacy location.

Local workaround I will try next: copy storage/calibration_targets/soi_targets.csv to storage/soi.csv in the local policyengine-us-data checkout and rerun the smoke command.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions