Skip to content

Donor survey loader subprocess uses local policyengine-us-data venv missing runtime deps #151

@anth-volk

Description

@anth-volk

On a clean-main PE-US rebuild smoke run, after working around missing PE-US-data prerequisite files (#147, #148), the pipeline got through CPS/PUF loading and failed while loading the ACS donor provider.

Clean-main worktree used for this run:

/Users/administrator/Documents/PolicyEngine/worktrees/microplex-us/fix-pe-rebuild-smoke-issues

Command shape:

python -m microplex_us.pipelines.pe_us_data_rebuild_checkpoint \
  --output-root artifacts/local_us_microplex_smoke \
  --version-id local-smoke-v1 \
  --baseline-dataset /Users/administrator/Documents/PolicyEngine/policyengine-us-data/policyengine_us_data/storage/enhanced_cps_2024.h5 \
  --targets-db /Users/administrator/Documents/PolicyEngine/calibration-diagnostics/.artifacts/policy_data.db \
  --policyengine-us-data-repo /Users/administrator/Documents/PolicyEngine/policyengine-us-data \
  --calibration-backend microcalibrate \
  --donor-imputer-backend zi_qrf \
  --policyengine-materialize-batch-size 100000 \
  --cps-sample-n 1000 --puf-sample-n 1000 --donor-sample-n 1000 \
  --n-synthetic 1000 \
  --defer-policyengine-harness \
  --defer-policyengine-native-score \
  --defer-native-audit \
  --defer-imputation-ablation

Progress before failure:

Loading processed CPS ASEC 2023 from ~/.cache/microplex/cps_asec_2023_processed_v20260601_ecps_spm_takeup_inputs.parquet
Loading PUF from ~/.cache/microplex/puf_2015.csv...
  Raw records: 207,692
Loading demographics from ~/.cache/microplex/demographics_2015.csv...
  After demographics merge: 207,692
Expanded 1,000 tax units to 1,921 persons
Loading processed CPS ASEC 2023 from ~/.cache/microplex/cps_asec_2023_processed_v20260601_ecps_spm_takeup_inputs.parquet

Failure from the ACS donor loader subprocess:

Traceback (most recent call last):
  File "<string>", line 6, in <module>
    import numpy as np
ModuleNotFoundError: No module named 'numpy'

The parent traceback shows the subprocess command uses:

/Users/administrator/Documents/PolicyEngine/policyengine-us-data/.venv/bin/python

Relevant traceback:

microplex_us/data_sources/donor_surveys.py:880 load_frame
microplex_us/data_sources/donor_surveys.py:598 _default_acs_tables_loader
microplex_us/data_sources/donor_surveys.py:572 _run_policyengine_dataset_loader_from_spec
microplex_us/data_sources/donor_surveys.py:539 _run_policyengine_dataset_loader
subprocess.CalledProcessError

For a fresh smoke command, --policyengine-us-data-repo is not enough if the sibling checkout has a stale or incomplete .venv. The CLI does expose --policyengine-us-data-python, so the workaround is to rerun with that pointing at the active smoke environment's Python. A better contract may be to default to sys.executable unless the caller explicitly supplies a PE-US-data Python, or to validate the selected subprocess Python for numpy, pandas, and policyengine_us_data before starting the expensive source load.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions