diff --git a/.github/codex/prompts/pr_review.md b/.github/codex/prompts/pr_review.md index 43736c1e..33805f78 100644 --- a/.github/codex/prompts/pr_review.md +++ b/.github/codex/prompts/pr_review.md @@ -56,7 +56,8 @@ When reviewing new features or code paths, specifically check: ## Deferred Work Acceptance -This project tracks deferred technical debt in `TODO.md` under "Tech Debt from Code Reviews." +This project tracks deferred technical debt in `TODO.md` under "Deferred / Documented" +(blocked items, sub-grouped by blocker) and shippable items under "Actionable Backlog." - If a limitation is already tracked in `TODO.md` with a PR reference, it is NOT a blocker. - If a PR ADDS a new `TODO.md` entry for deferred work, that counts as properly tracking @@ -96,7 +97,7 @@ Apply the assessment based on the HIGHEST severity of UNMITIGATED findings: A finding is MITIGATED (does not count toward assessment) if: - The deviation is documented in `docs/methodology/REGISTRY.md` with a Note/Deviation label -- The limitation is tracked in `TODO.md` under "Tech Debt from Code Reviews" +- The limitation is tracked in `TODO.md` under "Deferred / Documented" or "Actionable Backlog" - The PR itself adds a TODO.md entry or REGISTRY.md note for the issue - The finding is about an implementation choice between valid numerical approaches diff --git a/CLAUDE.md b/CLAUDE.md index fede87d2..d07a828b 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -107,8 +107,23 @@ wording will cause a P1 finding ("undocumented methodology deviation"). **TODO.md format** — for deferring P2/P3 items only (P0/P1 cannot be deferred): -Add a row to the table in `TODO.md` under "Tech Debt from Code Reviews" in the appropriate -category (`Methodology/Correctness`, `Performance`, or `Testing/Docs`): +Add a row to `TODO.md`. If the item is genuinely shippable (clear path, no external +blocker), put it under **Actionable Backlog** in the appropriate sub-section +(`Methodology / correctness`, `Performance`, or `Testing / docs`). If it is blocked, put it +under **Deferred / Documented** in the matching blocker sub-section (`Paper-gated / needs +methodology derivation`, `Needs external reference (R / Stata / Julia)`, `Parked — pending +user demand / out of scope`, or `Won't-fix / waived`). Either way the AI reviewer's +deviation-grep resolves on the row's `Location` + reason text. The two buckets use +different table shapes — Actionable rows carry an `Effort` column, Deferred rows a `PR` +column: + +Actionable Backlog: + +| Issue | Location | Origin | Effort | Priority | +|-------|----------|--------|--------|----------| +| Description of the work item | `file.py` | #NNN | Quick/Mid/Heavy | Medium/Low | + +Deferred / Documented: | Issue | Location | PR | Priority | |-------|----------|----|----------| diff --git a/TODO.md b/TODO.md index 679869f2..72b298c4 100644 --- a/TODO.md +++ b/TODO.md @@ -4,11 +4,162 @@ Internal tracking for technical debt, known limitations, and maintenance tasks. For the public feature roadmap, see [ROADMAP.md](ROADMAP.md). +## How this file is organized + +- **[Actionable Backlog](#actionable-backlog)** — work with a clear implementation path and + no external blocker. **Pull from here.** Effort (`Quick` ≤1 day · `Mid` 3-10 CI rounds · + `Heavy` derivation-free but large) is noted per row; `Priority` is carried from the + originating PR review. +- **[Deferred / Documented](#deferred--documented)** — known gaps that are **not currently + actionable**: blocked on a methodology derivation, on external tooling (R / Stata / Julia) + absent from CI, parked pending user demand / out of paper scope, or explicitly won't-fix. + Retained for provenance and AI-review deviation-documentation — **do not pull from here + without first clearing the named blocker.** +- **[Reference / Status](#reference--status)** — not backlog: user-facing limitations, + module-size monitoring, deprecations, and current-state notes. + +The `Origin` column (Actionable tables) and the `PR` column (Deferred tables) both point to the originating PR number or review tag. + +--- + +## Actionable Backlog + +### Methodology / correctness + +| Issue | Location | Origin | Effort | Priority | +|-------|----------|--------|--------|----------| +| `SyntheticControl` cv: thread an `"infeasible"` reason-code from `_outer_solve_V_cv()` / `_placebo_fit_unit()` so `in_space_placebo()` / `leave_one_out()` distinguish a structural cv-refit exclusion (donor-indistinguishable re-aggregated window) from a genuine inner-solver non-convergence — mirror the split `in_time_placebo()` already emits. Warnings already distinguish the two causes; only the machine-readable status/count is missing. | `synthetic_control.py`, `synthetic_control_results.py` | follow-up | Mid | Low | +| `CallawaySantAnna`: materialize NaN entries for non-estimable `(g,t)` cells in `group_time_effects` (currently omitted with a consolidated warning); requires updating downstream consumers (event study, `balance_e`, aggregation). | `staggered.py` | #256 | Mid | Low | +| `CallawaySantAnna` / `StaggeredTripleDifference` fit their covariate OR nuisance via estimator-local `cho_solve(X'X)` / `scipy.lstsq(cond=1e-7)` that bypass `solve_ols`, so they are NOT scale-equilibrated — a large-scale covariate can perturb the nuisance fit (`TripleDifference`'s OR fit already routes through `solve_ols`). Route the local OR fits through the shared scale-robust solver (or equilibrate locally). | `staggered.py`, `staggered_triple_diff.py` | covariate-review | Mid | Medium | +| Adopt the shared `_rank_guarded_inv` for the *structural* (non-covariate) matrix inverses sharing the `LinAlgError`-only fallback that can go near-singular: `continuous_did.py:1056` (dose B-spline), `spillover.py:3371` (ring-solve, partially guarded), `two_stage.py:3154` (TSL Stage-2 variance), `imputation.py:2403`, `had.py:2413`, `conley.py:1109`. These invert internal bases users cannot perturb with `covariates=` (distinct from the already-fixed covariate-triggered SE bug); the helper is the seam. | `continuous_did.py`, `spillover.py`, `two_stage.py`, `imputation.py`, `had.py`, `conley.py` | dr-or-se-rank-guard | Mid | Low | +| Survey-design resolution / collapse patterns are inconsistent across panel estimators — `ContinuousDiD` rebuilds unit-level design in SE code, `EfficientDiD` builds once in `fit()`, `StackedDiD` re-resolves on stacked data. Extract shared helpers for panel-to-unit collapse, post-filter re-resolution, and metadata recomputation. | `continuous_did.py`, `efficient_did.py`, `stacked_did.py` | #226 | Mid | Low | +| `SyntheticControl` remaining ADH-2015 §4 items: the regression-weight `W^reg = X_0'(X_0 X_0')^{-1} X_1` extrapolation diagnostic (flag implied OLS weights outside `[0,1]`) and sparse-SC subset search (`l < J`, holding `V` fixed). LOO, in-time placebo, CV `V`-selection, and inverse-variance `V` have landed; these two are the deferred tail. | `synthetic_control.py`, `synthetic_control_results.py` | ADH-2015 | Mid | Low | +| `SyntheticControl` conformal (CWZ 2021) extensions: (a) one-sided / signed-`t` variants (§7); (b) covariates in the conformal proxy (`X_jt`, eqs 4/6 — current proxy is outcomes-only); (c) AR / innovation-permutation path (Lemmas 5-7) for time-series proxies. The joint test, pointwise CIs, and average-effect CI have landed. | `conformal.py`, `synthetic_control_results.py` | CWZ-2021 | Heavy | Low | +| `ContinuousDiD` CGBS-2024 extensions (matches R `contdid` v0.1.0 deferral set): (a) `covariates=` kwarg; (b) discrete-treatment saturated regression (integer dose currently warned, not routed to per-level coefficients); (c) lowest-dose-as-control per Remark 3.1 when `P(D=0)=0`. REGISTRY `## ContinuousDiD` → Implementation Checklist marks these `[ ]`. | `continuous_did.py` | CGBS-2024 | Heavy | Low | +| `EfficientDiD` survey-weighted Silverman bandwidth in conditional Omega* — `_silverman_bandwidth()` uses unweighted mean/std; survey-weighted statistics better reflect the population distribution (second-order refinement). | `efficient_did_covariates.py` | — | Quick | Low | +| Survey sandwich SE is not exactly invariant to zero-weight (subpopulation / padded) rows: `_compute_stratified_psu_meat`'s finite-sample correction counts zero-weight units as PSUs, so padding shifts the SE ~2e-4 relative. Point estimate is exactly invariant. Fix: count only positive-weight PSUs in the correction (cross-cutting across all survey-enabled estimators). | `survey.py` (`_compute_stratified_psu_meat`) | PR-B | Mid | Low | +| `ImputationDiD` LOO conservative-variance refinement (BJS 2024 Supp. Appendix A.9) — a finite-sample improvement to the auxiliary-model residuals reducing overfit of `tau_tilde_g` to `epsilon`. Asymptotic Theorem-3 variance is implemented and matches R `didimputation` (which also omits LOO by default). | `imputation.py` | imputation-validation | Mid | Low | +| Multi-absorb weighted demeaning needs iterative alternating projections for `N > 1` absorbed FE with survey weights; unweighted multi-absorb also uses single-pass (exact only for balanced panels). | `estimators.py` | #218 | Heavy | Medium | +| `TwoWayFixedEffects(vcov_type in {hc2, hc2_bm})` with replicate-weight designs raises `NotImplementedError` (`twfe.py:~233`). The replicate path re-demeans per replicate, which doesn't compose with the full-dummy HC2/HC2-BM build — a correct impl needs per-replicate full-dummy refit. Workaround: `hc1` for replicate-weight CR1. | `twfe.py::fit` | follow-up | Heavy | Low | +| TWFE's HC2/HC2-BM inline full-dummy build (`twfe.py:280-315`) duplicates the dummy-construction logic in `DifferenceInDifferences(fixed_effects=...)` (`estimators.py:478-486`). Extract a shared helper, or delegate TWFE's HC2/HC2-BM path to DiD's `fixed_effects=` branch (with TWFE-specific cluster-default threading), to reduce drift risk on FE naming / survey behavior / result-surface conventions. Substantive refactor — touches both estimators. | `twfe.py::fit`, `estimators.py::DifferenceInDifferences.fit` | follow-up | Heavy | Low | +| Decide whether to formally deprecate `CallawaySantAnna.cluster=X` in favor of `survey_design=SurveyDesign(psu=X)` (the bare-cluster path already synthesizes a minimal SurveyDesign). Two equivalent paths = redundant surface. Mirrors the question for ImputationDiD / EfficientDiD / TwoStageDiD. | `staggered.py` | follow-up | Mid | Low | +| `HeterogeneousAdoptionDiD` continuous paths: thread `cluster=` through `bias_corrected_local_linear` (the Phase-1c wrapper already supports cluster; Phase 2a ignores it with a `UserWarning`). | `had.py`, `local_linear.py` | Phase 2a | Mid | Low | +| `SpilloverDiDResults` not registered in `DiagnosticReport`'s `_APPLICABILITY` / `_PT_METHOD` tables, so `DiagnosticReport(spillover_result)` doesn't route to event-study diagnostics. Decide which diagnostics apply (PT, pre-trends power, heterogeneity, design-effect) and add an end-to-end test. | `diagnostic_report.py` | Wave C | Mid | Low | + +### Performance + +Consolidates the former `#### Performance` tech-debt table and the standalone +`## Performance Optimizations` section. (Speculative / low-value perf notes — numba JIT, +generic sparse-FE, QR+SVD rank-detection redundancy, `check_finite` bypass — moved to +[Deferred → Parked](#parked--pending-user-demand--out-of-scope).) + +| Issue | Location | Origin | Effort | Priority | +|-------|----------|--------|--------|----------| +| `ImputationDiD` conservative-variance projection (`_compute_v_untreated_with_covariates`) rebuilds A0/A1 and refactorizes `A0'WA0` for EVERY estimand target (overall, each ES horizon, each group, bootstrap precompute). `A0'WA0` is target-invariant; cache the design + a single factorization per `fit()` and solve only the target-specific RHS `A1'w`. | `imputation.py` | #141 | Mid | Low | +| `ImputationDiD` dense `(A0'A0).toarray()` scales `O((U+T+K)^2)` — OOM risk on large panels (only triggers when the sparse solver fails). Needs an alternative dense fallback or richer sparse strategy. | `imputation.py` | #141 | Heavy | Medium | +| `LinearRegression.fit()` pays the CR2 cost twice on the weighted `hc2_bm` path: once in `solve_ols(..., return_vcov=True)` and again via `compute_robust_vcov(..., return_dof=True)` for `_bm_dof`. Fix: thread `return_dof` through `solve_ols`, or cache the per-cluster `A_g` / `MUWTWUM` precomputes. (CI codex P3 on #475.) | `linalg.py` | #475 | Mid | Low | +| MPD `cluster+hc2_bm` computes CR2 precomputes twice — `solve_ols → _compute_cr2_bm` for vcov+DOF, then `_compute_cr2_bm_contrast_dof` for the post-period-average contrast DOF. Both rebuild `H`, `M`, per-cluster `A_g`. Plumb the contrast DOF through the vcov path or share via a cached helper. | `linalg.py`, `estimators.py::MultiPeriodDiD.fit` | follow-up | Mid | Low | +| CR2 Bell-McCaffrey DOF uses a naive `O(n²k)` per-coefficient loop over cluster pairs; Pustejovsky-Tipton (2018) Appendix B has a scores-based formulation avoiding the full `n×n` `M`. Switch when a user hits a large-`n` cluster-robust design. | `linalg.py::_compute_cr2_bm` | Phase 1a | Heavy | Low | +| Rust-backend HC2: the Rust path only supports HC1; HC2 and CR2 Bell-McCaffrey fall through to NumPy. Noticeable for large-`n` fits. | `rust/src/linalg.rs` | Phase 1a | Mid | Low | +| `SyntheticControl` retains a full `_SyntheticControlFitSnapshot` (pivoted panels) on EVERY fit for the opt-in `in_space_placebo()`, so callers who never run the placebo pay `O(units × periods × predictor-vars)` memory. Store a compact array/index representation, or build the snapshot lazily on first placebo call. | `synthetic_control.py`, `synthetic_control_results.py` | follow-up | Mid | Low | +| Wild cluster bootstrap CI inversion calls `_t_star(r)` ~O(100) times, each materializing a fresh `(B×n)` `y_star` + `(k×B)` refit + `(n×B)` residual arrays. Acceptable for the few-cluster regime; for large-`n`/large-`B`, chunk `_t_star` over draws or precompute the `r`-independent cluster-level pieces (restricted residuals are linear in `r`). | `utils.py::wild_bootstrap_se._t_star` | #543 | Mid | Low | +| `SpilloverDiD` sparse cKDTree path for the staggered nearest-treated-distance helper (mirrors the static helper's sparse branch). `_compute_nearest_treated_distance_staggered` always builds dense `(n_units, n_treated_by_onset)` matrices per cohort; add a sparse branch gated on `n > _CONLEY_SPARSE_N_THRESHOLD`. | `spillover.py` | Wave B | Mid | Low | +| `HeterogeneousAdoptionDiD` Phase 3 Stute: Appendix-D vectorized form replaces the per-iteration OLS refit with a single precomputed `M = I - X(X'X)^{-1}X'` applied to `eps*eta` (~2× faster, functionally identical). Shipped the literal-refit form to match paper text. | `had_pretests.py::stute_test` | Phase 3 | Mid | Low | +| Rust faer SVD ndarray-to-faer conversion overhead (minimal vs SVD cost). | `rust/src/linalg.rs:67` | #115 | Quick | Low | + +### Testing / docs + +| Issue | Location | Origin | Effort | Priority | +|-------|----------|--------|--------|----------| +| Render `docs/methodology/REPORTING.md` and `REGISTRY.md` as in-site Sphinx pages so cross-refs can use `:doc:` instead of off-site `blob/main` URLs (stable-docs readers can otherwise land on a different revision than their package version). Two paths: (a) add `myst-parser` to `conf.py` + docs extras and link with `:doc:`, or (b) convert both to `.rst`. **Note:** REGISTRY.md is ~4.5k lines of LaTeX-heavy markdown — high risk under the `-W` (warnings-as-errors) Sphinx build; budget multiple rounds. | `docs/conf.py`, `docs/api/business_report.rst`, `docs/api/diagnostic_report.rst`, tutorials 18 & 19 | follow-up | Mid | Low | +| `ImputationDiD` covariate-path variance lacks a dedicated parity anchor — only the no-covariate staggered panel is R-parity'd, though the covariate path shares the same validated projection code. Add a small dense-design **hand-calc** for the covariate projection (no external tooling), or a covariate (time-varying X) R `didimputation` golden asserting overall/ES SE parity (the golden variant needs local R). | `tests/test_methodology_imputation.py`, `benchmarks/R/generate_didimputation_golden.R` | imputation-validation | Mid | Low | +| Add true half-sample BRR replicate-weight regressions per estimator family (current tests use Fay-like 0.5/1.5 perturbations; `test_survey_phase6.py` covers true BRR at the helper level). | `tests/test_replicate_weight_expansion.py` | #253 | Mid | Low | +| Port the CI `` extraction into the reviewer-eval harness so `docs/tutorials/*.ipynb` cases (currently guarded out of `verify-corpus`/`run`) can be reviewed with CI-equivalent context. | `tools/reviewer-eval/adapters/ci_prompt.py` | local-review | Mid | Low | + --- -## Known Limitations +## Deferred / Documented + +Not currently actionable. Retained for provenance + AI-review deviation-documentation. + +### Paper-gated / needs methodology derivation -Current limitations that may affect users: +| Issue | Location | PR | Priority | +|-------|----------|----|----------| +| CBWSDID covariate balancing (`StackedDiD(balance="entropy")`) v1 supports only balanced event windows + `weighting="aggregate"`; unbalanced/ragged panels fail closed (unit-count vs observation-count corrector convention unresolved off balanced panels). Matching-based balancing and the repeated `0→1`/`1→0` episode extension are also deferred. Documented in REGISTRY StackedDiD "Covariate balancing (CBWSDID)" Notes. | `stacked_did.py`, `balancing.py`, REGISTRY | follow-up | Low | +| dCDH: Phase-1 per-period placebo `DID_M^pl` has NaN SE (no IF derivation for the per-period aggregation path). Multi-horizon placebos (`L_max ≥ 1`) have valid SE. | `chaisemartin_dhaultfoeuille.py` | #294 | Low | +| dCDH: survey cell-period allocator's post-period attribution is a library convention, not derived from the observation-level survey linearization. MC coverage is empirically close to nominal; a formal derivation (or covariance-aware two-cell alternative) is deferred. Documented in REGISTRY survey IF expansion Note. | `chaisemartin_dhaultfoeuille.py`, REGISTRY | #408 | Medium | +| dCDH by_path: survey-aware backward-horizon (`placebo + predict_het + survey_design`) raises `NotImplementedError` (and `_compute_heterogeneity_test` warn-and-skips to forward-horizon-only heterogeneity) — the Binder TSL cell-period allocator's REGISTRY justification is tied to post-period attribution; backward horizons would put ψ_g mass on a pre-period cell. Needs the pre-period cell allocator derived. | `chaisemartin_dhaultfoeuille.py`, REGISTRY | follow-up | Medium | +| **HonestDiD Δ^RM ARP confidence sets** (consolidates the former "Honest DiD Improvements" checklist): uses a naive FLCI instead of the paper's ARP conditional/hybrid sets (Sections 3.2.1-3.2.2). ARP infrastructure exists but the moment-inequality transformation needs calibration; CIs are conservative (valid coverage). Sub-items folded here: improved C-LF via direct optimization instead of grid search (`honest_did.py:947`); hybrid inference methods; event-study-specific bounds per post-period; simulation-based power analysis for honest bounds. (`CallawaySantAnnaResults` support has **landed**.) | `honest_did.py` | #248 | Medium | +| **Conley `vcov_type` for IF / GMM estimators** (consolidates 8 near-identical rows). No reference implementation exists for any of these spatial-HAC × influence-function/GMM compositions; each was rejected at `__init__` with a deferral pointer here. SunAbraham + WooldridgeDiD-OLS conley have **shipped** (within-transform via `solve_ols`). Per estimator:
• `CallawaySantAnna` — Conley kernel × per-(g,t) IF aggregation (`staggered.py`).
• `TripleDifference` — × the 3-pairwise-DiD IF decomposition `w3·IF_3 + w2·IF_2 - w1·IF_1` (`triple_diff.py`).
• `ImputationDiD` — × Theorem-3 per-unit IF `sigma_sq = (cluster_psi_sums**2).sum()` (`imputation.py`).
• `EfficientDiD` — × per-unit EIF `_compute_se_from_eif` (`efficient_did.py`).
• `TwoStageDiD` — thread into the GMM sandwich meat `_compute_gmm_variance`; the SpilloverDiD `_compute_gmm_corrected_meat` machinery could be adapted to score `S_g = gamma_hat' c_g - X'_{2g} eps_{2g}` but two-stage-GMM × Conley has no reference (`two_stage.py`).
• `StackedDiD` — **methodology-blocked, not plumbing**: the stacked design replicates each control unit across sub-experiments, so Conley's distance matrix sees same-unit copies at distance 0 (`K(0)=1`); needs a per-stack spatial identifier (`stacked_did.py`).
• `SyntheticDiD` — uses `variance_method ∈ {bootstrap, jackknife, placebo}`, no analytical sandwich for Conley to plug into; needs an analytical-sandwich path or a spatial-block bootstrap (Politis-Romano 1994) (`synthetic_did.py`).
• Conley + survey weights / `survey_design` — score-reweighting is mechanical but the PSU×spatial-kernel interaction and replicate-weight spatial variance are non-trivial (Bertanha-Imbens 2014 covers cluster-sample, not Conley); raises `NotImplementedError` at the linalg validator (`linalg.py::_validate_vcov_args`). | (per sub-item) | follow-up · Phase 1b · Phase 5 | Low-Med | +| `HeterogeneousAdoptionDiD` Phase 4.5 C still-open: (a) **replicate-weight designs** (BRR/Fay/JK1/JKn/SDR) — per-replicate weight-ratio rescaling for the OLS-on-residuals refit isn't covered by the multiplier-bootstrap composition; each linearity-family helper raises `NotImplementedError` on replicate weights. (b) **`lonely_psu='adjust'` + singleton-strata** on the Stute family — the pseudo-stratum centering transform isn't derived for the Stute CvM functional. | `had_pretests.py` | Phase 4.5 C | Low | +| `HeterogeneousAdoptionDiD` mass-point `vcov_type in {hc2, hc2_bm}` raises `NotImplementedError` — OLS leverage `x_i'(X'X)^{-1}x_i` is wrong for 2SLS; needs the `x_i'(Z'X)^{-1}(...)(X'Z)^{-1}x_i` correction plus an R/Stata (`ivreg2 small robust`) parity anchor. | `had.py::_fit_mass_point_2sls` | Phase 2a | Medium | +| `HeterogeneousAdoptionDiD` `trends_lin × survey_design`: per-group linear-trend slope under survey weighting is not derived from the paper. Raises `NotImplementedError` across all 3 `trends_lin` surfaces. | `had.py`, `had_pretests.py` | #389 | Low | +| `SpilloverDiD(survey_design=...)` replicate-weight variance (BRR/Fay/JK1/JKn/SDR): Wave E.1 ships Taylor-linearization only. Per Gerber (2026) Appendix A the IF-reweighting shortcut does NOT apply to TwoStageDiD-class estimators (`gamma_hat` is weight-sensitive); correct support needs per-replicate full re-fit of both stages. | `spillover.py`, `survey.py::compute_replicate_refit_variance` | follow-up | Low | +| `SpilloverDiD(vcov_type="conley", conley_lag_cutoff>0, survey_design=...)` no-effective-PSU serial Bartlett HAC: weights-only / strata-only designs without a cluster fallback raise `NotImplementedError` (each pseudo-PSU appears in one period, so the serial cross-period loop contributes zero). Needs a unit-level serial fallback derivation or routing through `conley_unit` with documented IF-allocator asymmetry. | `spillover.py`, `two_stage.py::_compute_stratified_serial_bartlett_meat` | Wave E.2 tail | Low | +| `SpilloverDiD` data-driven `d_bar` selection (Butts 2021b / 2023 JUE Insight cross-validation). | `spillover.py` | follow-up | Low | + +### Needs external reference (R / Stata / Julia) + +Blocked on tooling absent from CI (no workflow installs R/Stata/Julia). A clear path +exists but parity can't be verified without a local toolchain. + +| Issue | Location | PR | Priority | +|-------|----------|----|----------| +| `StaggeredTripleDifference` R cross-validation: CSV fixtures not committed (gitignored); tests skip without local R + `triplediff`. Commit fixtures or generate deterministically. | `tests/test_methodology_staggered_triple_diff.py` | #245 | Medium | +| `StaggeredTripleDifference` R parity: benchmark only tests the no-covariate path (`xformla=~1`). Add covariate-adjusted scenarios + aggregation-SE parity assertions. | `benchmarks/R/benchmark_staggered_triplediff.R` | #245 | Medium | +| `StaggeredTripleDifference` per-cohort group-effect SEs include WIF (conservative vs R's `wif=NULL`); documented in REGISTRY. Could override the mixin for an exact R match (verification needs R `triplediff`). | `staggered_triple_diff.py` | #245 | Low | +| **WooldridgeDiD follow-up cluster** (PR-B Stage D/E fail-closed surfaces; re-enable after R/Stata validation):
• QMLE sandwich uses `aweight` cluster adjustment `(G/(G-1))·(n-1)/(n-k)` vs Stata's `G/(G-1)` (conservative); add a `qmle` weight type if Stata goldens confirm a material difference (`wooldridge.py`, `linalg.py`).
• response-scale APE / log-link coefficient bridge for R `etwfe(family=poisson|logit)` cell-level parity — needs `emfx()` APE extraction or link-inversion with baseline-mean adjustment (`generate_wooldridge_golden.R`, `test_methodology_wooldridge.py`).
• `aggregate(weights="cohort_share")` on survey-weighted fits: `_n_g_per_cohort` uses raw `unit.nunique()`; implement design-weighted unit totals per cohort (paper W2025 §7) and lift the `ValueError` gate (`wooldridge.py`, `wooldridge_results.py`).
• unconditional inference for `cohort_share` accounting for ω̂_g sampling uncertainty (W2025 §7.5); currently NaN-closed (`wooldridge_results.py`).
• `cohort_trends=True × survey_design` and `× control_group="never_treated"` raise `NotImplementedError` (unvalidated TSL variance / trend columns spanned by placebo cell-dummies) (`wooldridge.py`).
• Stata `jwdid` golden-value `TestReferenceValues` (`tests/test_wooldridge.py`). | `wooldridge.py`, `wooldridge_results.py`, `linalg.py`, benchmarks | #216 · PR-B | Med-Low | +| Extend `WooldridgeDiD` `method ∈ {logit, poisson}` with `vcov_type ∈ {classical, hc2, hc2_bm}`: composing HC2 leverage + Bell-McCaffrey DOF with the QMLE pseudo-residual sandwich needs derivation + R parity vs `clubSandwich::vcovCR(glm, type="CR2")`. Rejected at `__init__`. | `wooldridge.py` | follow-up | Medium | +| `PreTrendsPower` CS/SA `anticipation=1` R-parity fixture: R `pretrends` has no anticipation parameter, so the Python `_extract_pre_period_params` anticipation filter isn't R-parity-locked. Build a synthetic CS/SA result with `anticipation=1` and assert γ_p matches R's `slope_for_power()`. (Mechanism already covered by MC + full-VCV tests.) | `tests/test_methodology_pretrends.py`, `generate_pretrends_golden.R` | PR-C | Low | +| Harmonize SunAbraham's HC1 within-transform finite-sample correction with `fixest::sunab()` — SA applies `n/(n-k_dm)`, fixest applies `n/(n-k_total)` (counts absorbed FE); ~1-2% SE difference, documented as a "Deviation from R" and pinned at `atol=5e-3`. Either thread `df_adjustment` or keep as an intentional, R-verified difference. | `sun_abraham.py`, `linalg.py` | follow-up | Low | +| Rust multiplier-bootstrap weight RNG (`generate_bootstrap_weights_batch`) seeds `Xoshiro256PlusPlus::seed_from_u64(seed+i)` per row; audit Python callers (`sdid.py`, `efficient_did_bootstrap.py`, `bootstrap_utils.py`) for parity-test gaps and, where a numpy-canonical equivalent exists, pre-generate in Python and pass through PyO3 (same fix shape as TROP RNG parity #354). | `rust/src/bootstrap.rs`, `bootstrap_utils.py` | follow-up | Medium | +| `SyntheticDiD` bootstrap cross-language parity anchor vs R `synthdid::vcov(method="bootstrap")` or Julia `Synthdid.jl` (refit-native). Same-library validation is in place; Julia is the cleanest target. Tolerance ~1e-6 (BLAS+RNG paths preclude 1e-10). | `benchmarks/R/`, `benchmarks/julia/`, `tests/` | follow-up | Low | +| CS R helpers hard-code `xformla = ~1`; no covariate-adjusted R benchmark for the IRLS path. | `tests/test_methodology_callaway.py` | #202 | Low | +| `CallawaySantAnna` bootstrap: align p-value computation with R `did`'s symmetric-percentile method (former "CallawaySantAnna Bootstrap Improvements" section). | `staggered.py` | — | Low | +| **`bias_corrected_local_linear` (lprobust) Phase-1c follow-ups:** extend golden parity to `kernel ∈ {triangular, uniform}` (epa-only today); expose `vce ∈ {hc0,hc1,hc2,hc3}` on the public wrapper once R goldens exist (port supports all four; needs a per-mode generator + a hc2/hc3 q-fit-leverage decision); clustered-DGP auto-bandwidth parity is **blocked upstream** on an nprobust singleton-cluster bug in `lpbwselect.mse.dpi` (Phase-1c DGP 4 uses manual `h=b=0.3`). | `_nprobust_port.py`, `local_linear.py`, `generate_nprobust_lprobust_golden.R` | Phase 1c | Low-Med | +| `HeterogeneousAdoptionDiD` Stute-family Stata-bridge parity: no public R `Stutetest` package exists; would add `benchmarks/stata/generate_stute_golden.do` + a Stata dependency. | `benchmarks/stata/`, `tests/test_stute_test_parity.py` | follow-up | Low | +| `HeterogeneousAdoptionDiD` Phase-3 R-parity: ships coverage-rate validation on synthetic DGPs, not tight point parity vs `chaisemartin::stute_test` / `yatchew_test` (needs bootstrap-seed-semantics + `B` alignment across numpy/R). | `tests/test_had_pretests.py` | Phase 3 | Low | + +### Parked — pending user demand / out of scope + +Doable in principle, but no current caller and/or explicitly out of paper scope. + +| Issue | Location | PR | Priority | +|-------|----------|----|----------| +| dCDH parity-test SE/CI assertions only cover pure-direction scenarios; mixed-direction SE comparison is structurally apples-to-oranges (cell-count vs obs-count weighting). | `test_chaisemartin_dhaultfoeuille_parity.py` | #294 | Low | +| `HeterogeneousAdoptionDiD` survey-design API consolidation (**scheduled: next minor bump**): drop the deprecated `survey=` / `weights=` kwargs on all 8 HAD surfaces; only `survey_design=` remains. Also fold the legacy back-end `weights=` routing into the unified `_resolve_survey_for_fit` path. DeprecationWarning has shipped; the removal is ~50 LoC gated on the semver bump. | `had.py`, `had_pretests.py` | next minor | Medium | +| `HeterogeneousAdoptionDiD` joint cross-horizon covariance / sup-t bands: per-horizon SEs use independent sandwiches (paper-faithful pointwise CIs per Pierce-Schott Fig 2). Follow-ups (low demand): IF-based stacking for joint cross-horizon inference; analytical H×H covariance on the weighted ES path; a sup-t band on the unweighted ES path. | `had.py::_fit_event_study` | Phase 2b / 4.5 B | Low | +| `HeterogeneousAdoptionDiD` event-study staggered timing beyond the last cohort: Phase 2b auto-filters to the last cohort (paper App B.2); earlier-cohort effects aren't HAD-identified (redirect to dCDH). Full staggered HAD needs a different identification path (out of paper scope). | `had.py::_validate_had_panel_event_study` | Phase 2b | Low | +| `HeterogeneousAdoptionDiD` survey-aware support-endpoint test (**research, waits on literature**): needs a calibrated support-infimum test under complex sampling (endpoint EVT × survey-aware functional CLT × tail-empirical-process theory). Permanent `NotImplementedError` on `qug_test(survey=...)`; rationale in REGISTRY § "QUG Null Test" Note (Phase 4.5 C0). | `had_pretests.py::qug_test` | Phase 4.5 C0 | Low | +| `HeterogeneousAdoptionDiD` Phase-4.5 weight-aware auto-bandwidth MSE-DPI selector (~300 LoC); users pass `h`/`b` explicitly today. Plus replicate-weight SurveyDesigns on the continuous-dose paths (Rao-Wu-style per-replicate weight-ratio rescaling for the local-linear intercept IF). | `_nprobust_port.py::lpbwselect_mse_dpi`, `had.py::_aggregate_unit_resolved_survey` | Phase 4.5 | Low | +| `HeterogeneousAdoptionDiD` Phase-4 Pierce-Schott (2016) replication harness — **waived (2026-05-20)**: R parity at `atol=1e-8` on the same 3 DGPs is a strictly stronger anchor than reproducing Fig 2's pointwise CIs on the LBD-restricted PNTR panel (paper §5.2 self-acknowledges NP estimators too noisy there). Re-open only on user demand. See REGISTRY HAD Deviations Notes #3/#4. | `benchmarks/`, `tests/` | Phase 2a | Low | +| `HeterogeneousAdoptionDiD` time-varying dose on event study: Phase 2b rejects panels where `D_{g,t}` varies within a unit for `t≥F` (constant-dose convention, App B.2). A time-varying-dose estimator is a future PR; current behavior is front-door rejection. | `had.py::_validate_had_panel_event_study` | Phase 2b | Low | +| `HeterogeneousAdoptionDiD` repeated-cross-section support: paper §2 allows panel OR RCS, but Phase 2a is panel-only (RCS inputs rejected by the balanced-panel validator). Needs an RCS identification path (pre/post cell means) with its own validator + `data_mode` surface. | `had.py::_validate_had_panel` | Phase 2a | Medium | +| `HeterogeneousAdoptionDiD` Phase-3 nprobust bandwidth for Stute variants on continuous regressors (currently OLS residuals from a 2-parameter linear fit, no bandwidth selection). Not in paper scope. | `had_pretests.py::stute_test` | Phase 3 | Low | +| `SpilloverDiD(ring_method="count")`: count-of-treated-in-ring (paper §3.2) is methodologically supported by Butts but re-introduces functional-form dependence; expose behind an explicit kwarg gate + warning. | `spillover.py` | follow-up | Low | +| `TwoStageDiD` paper-permitted estimand variants (Gardner 2022): the Eq.(5) P̄-period-average estimand and the fn.8 full-sample first-stage variant have no public parameter. Documented ⚠️ in `gardner-2022-review.md`; surface as `estimand=` / `first_stage=` if a use case arises. | `two_stage.py` | follow-up | Low | +| `bias_corrected_local_linear` multi-eval grid (`neval > 1`) with cross-covariance (`covgrid=TRUE`). Not needed for HAD; useful for multi-dose diagnostics. | `_nprobust_port.py::lprobust` | Phase 1c | Low | +| Rust local-method `estimate_model` → unify to `solve_wls_svd` (the global-method's SVD helper) for sub-1e-14 bootstrap-SE parity. The local-method bootstrap parity test passes at `atol=1e-5`; the residual ~1e-7 is roundoff, not a user-visible correctness bug. | `rust/src/trop.rs`, `rust/src/linalg.rs` | follow-up | Low | +| Validate the `.txt` AI guides (`llms-full.txt`, `llms-practitioner.txt`) as executable snippets — **not low-lift** (re-scoped 2026-06-01): only ~20% of ~112 fenced blocks are standalone-runnable; the rest are signature pseudo-code, context fragments, or data-shape-specific. Needs signature-block detection + a context/data skip-allowlist + per-snippet fixtures. | `tests/test_doc_snippets.py` | #239 | Low | +| `TestWorkflowDoesNotExecutePRHeadCode` (CodeQL #14 guard) doesn't model `bash/sh/./source