We treat CellRank output as probabilistic fate surrogate and quantify Differential Effect (DE) of CRISPRi-based gene KD on fate bias from a time-course Perturb-seq experiment applied on an iPSC differentiation system.
Install the repo in editable mode (from the repo root) so src imports resolve:
pip install -e .1. Observed data (after filtering)
- Cells
$i=1,\dots,N$ after filtering to$k_i \le K_{\max}$ , where$k_i$ is the number of detected guides in cell$i$ ($K_{\max}=20$ in current runs). - Days
$d_i \in {0,\dots,D-1}$ and replicates$r_i \in {0,\dots,R-1}$ ($D=4$ for d0--d3;$R=2$ in current data). - CellRank fate probabilities
$p_i=(p_{i,\mathrm{EC}},p_{i,\mathrm{MES}},p_{i,\mathrm{NEU}})$ with$\sum_f p_{i,f}=1$ . - Guides
$g=1,\dots,G$ (non-NTC), genes$\ell=1,\dots,L$ , and map$\ell(g)$ (current data:$G=2000$ ,$L=300$ ). - Embedding-sum representation per cell: define
$g_{i,m}$ and$m_{i,m}$ for$m=1,\dots,K_{\max}$ , where$g_{i,m}$ isguide_ids[i,m]and$m_{i,m}$ ismask[i,m]with$m_{i,m}\in{0,1}$ indicating real vs padding. - Hard-zero convention: NTC guides map to
$g=0$ (real entry,$m_{i,m}=1$ ); padding uses$g=0$ with$m_{i,m}=0$ . Definegene_of_guide[0]=0andgene_of_guide[g]=\ell(g)for$g\ge 1$ . - Guide burden:
- Centered guide burden:
2. Latent parameters and indexing
- Reference fate: EC. Non-reference fates:
$\mathcal{F}^\star = {\mathrm{MES}, \mathrm{NEU}}$ , indexed by$f^\star$ . - Gene-by-day effects
$\theta_{\ell,f^\star,d}$ and guide deviations$\delta_{g,f^\star}$ (time-invariant, shared across days). - Mean-zero constraint within each gene: for each gene
$\ell$ and fate$f^\star$ ,$\frac{1}{|G_\ell|}\sum_{g\in G_\ell}\delta_{g,f^\star}=0$ (enforced by centering). - Nuisance: day intercepts
$\alpha_{f^\star,d}$ , replicate effects$b_{f^\star,r}$ , burden slope$\gamma_{f^\star}$ . - Hard-zero baseline rows:
$\theta_{0,f^\star,d}=0$ and$\delta_{0,f^\star}=0$ .
3. Guide-level effect decomposition
with
Because
4. Cell-level linear predictor (MOI-aware)
Let
This embedding-sum is the MOI correction: each guide's effect is estimated conditional on other guides co-occurring in the same cell.
5. Softmax mapping
6. Likelihood (soft-label cross-entropy)
7. Priors (fixed-scale, weakly regularizing)
Day/rep/burden effects:
Gene effects (random walk with per-interval scales):
For intervals
Guide deviations (mean-zero within gene):
8. Non-centered parameterization
We fit non-centered latents
9. SVI training objective with minibatching
For minibatch
Implementation uses pyro.plate("cells", N, subsample_size=B) and scales the
log-likelihood by ClippedAdam; likelihood_weight.
10. Primary contrast (MES-EC)
11. Posterior summaries + exports (theta/beta)
Draw
- Gene/day summaries (theta only):
and posterior SD
- Guide/day summaries (beta):
with mean and posterior SD across draws.
Exports:
-
gene_daywise_for_mash.csv:gene,betahat_d*,se_d*from$\theta$ only. -
guide_daywise_for_mash.csv:guide,gene,betahat_d*,se_d*from$\beta$ . -
theta_posterior_summary.npzanddelta_posterior_summary.npz. - QC:
qc_delta_mean_by_gene.csv(mean$\delta$ per gene) andqc_theta_beta_offset_by_gene.csv(mean over guides of$\beta$ minus$\theta$ across days), both expected near 0.
Optional across-day summary (legacy ash):
with
12. Empirical Bayes shrinkage + hit calling (mashr two-mode + aggregation)
- Run mashr on
gene_daywise_for_mash.csvandguide_daywise_for_mash.csvin two modes (conservative/enriched). - Aggregate guide-level mash outputs to gene-level meta-fixed estimates
(
gene_from_guide_mash_*.csv) with daywise postmean/postsd/lfsr andactive_days. - Default discovery uses the gene-theta mash output (
mash_gene_*.csv).
Daywise call at day
lfsr_d < mash_lfsr_thresh_day, and
where the tail probability is computed from a Normal(postmean, postsd).
Any-day call uses the configured correction (bonferroni by default) and
requires min_active_days_for_hit_anyday active days.
13. Minimal diagnostics (recommended)
- Held-out cross-entropy: full model vs nuisance-only.
- Negative control: permute guides within (day, rep, k-bin) and confirm hits collapse.
- Sanity checks: known regulators show expected MES-EC direction.
Primary entrypoints:
- Fit + export:
scripts/fit_pyro_export.py - Shrinkage (mashr two-mode):
scripts/run_mashr_two_mode.R - Guide→gene aggregation:
scripts/aggregate_guides_to_genes.py - Mode comparison:
scripts/compare_mash_modes.py - Rank hits:
scripts/rank_hits.py - Diagnostics:
scripts/run_diagnostics.py - Simulation:
scripts/simulate_recovery.py - Snakemake:
Snakefile(real data and simulation targets)
Configuration:
-
config.yamlcontrols input paths, model sizes, priors, and diagnostics. -
time_scaleis deprecated in the current model (per-interval$\sigma_{\mathrm{time}}$ is learned).
Required inputs:
adata_path: AnnData withadata.obsm[fate_prob_key]containing EC/MES/NEU probabilities.adata.obsm[guide_key]containing the guide-by-cell matrix.adata.obsm[covar_key]withrep_keyfor replicate labels.adata.obs[day_key]with day labels (e.g., d0–d3).
guide_map_csv: CSV with columnsguide_name,gene_name,is_ntc.
End-to-end on real data (uses config.yaml):
snakemake --use-conda --cores 1Using existing conda envs on an HPC (skip YAML env creation): edit the env
paths at the top of Snakefile:
PYRO_ENV = "/oak/stanford/groups/engreitz/Users/tri/envs/sc-dl-gpu"
R_ENV = "/oak/stanford/groups/engreitz/Users/tri/envs/scrnaR"Then run Snakemake normally with --use-conda (it will activate those envs):
snakemake --use-conda --cores 1To submit to Slurm and use the per-rule mem_mb values from the Snakefile:
snakemake --use-conda --jobs 50 \
--cluster "sbatch -p {resources.partition} -t {resources.time} -c {threads} --mem={resources.mem_mb}"For the newer Slurm executor, the Snakefile sets slurm_partition per rule. Run:
snakemake --use-conda --executor slurm --jobs 50Use the provided profile to run with the Slurm executor:
snakemake --profile profiles/slurmOverride resources for specific rules on the CLI (example: increase memory/time
and threads for fit_pyro_export):
snakemake --use-conda --jobs 50 \
--cluster "sbatch -p {resources.partition} -t {resources.time} -c {threads} --mem={resources.mem_mb}" \
--set-resources fit_pyro_export:mem_mb=200000,fit_pyro_export:time=12:00:00 \
--set-threads fit_pyro_export=8GPU requests are taken from the rule resources (see GPU_GPUS at the top of
Snakefile) and mapped to --gpus by the slurm executor.
For the slurm executor, override the partition key directly:
snakemake --use-conda --executor slurm --jobs 50 \
--set-resources fit_pyro_export:slurm_partition=gpuThis runs: fit_pyro_export → mashr (gene + guide, both modes) → guide aggregation →
rank_hits (default mode) and produces out_fate_pipeline/hits_ranked.csv
plus out_fate_pipeline/mash_mode_comparison.csv.
Simulation (full chain, including mashr + aggregation + ranking):
snakemake --use-conda --cores 1 sim_allThe simulation settings live in config.yaml under sim_* keys. You can
override scenarios by editing:
sim_cells,sim_genes,sim_guides,sim_days,sim_reps,sim_kmaxsim_concentration,sim_num_steps,sim_num_drawssim_sweep_*andsim_tau_sweep_*for prior sweeps
Each simulation run writes out_fate_pipeline_sim/sim_metadata.yaml with the
exact inputs used (cells/genes/guides, concentration, seed, day counts, etc.).
Override the path with sim_metadata_path in config.yaml.
To disable the forced rerun of simulation, set:
sim_always_run: false
Run simulation directly (no Snakemake):
python scripts/simulate_recovery.py \
--cells 200 \
--genes 30 \
--guides 90 \
--days 4 \
--kmax 4 \
--concentration 50 \
--num-steps 500 \
--s-time 0.3 \
--s-guide 0.5 \
--s-tau 1.0 \
--time-scale 1.0,1.3,1.6 \
--metadata-out out_fate_pipeline_sim/sim_metadata.yaml \
--run-export \
--write-anndata \
--forceCommon simulation knobs:
--concentration: lower values increase fate-probability noise.--cellsand--num-steps: control sample size and optimization effort.--s-time,--s-guide,--s-tau: prior scales.--time-scale: per-interval random-walk scaling (lengthdays-1).
Daywise recovery evaluation (sim outputs):
python scripts/eval_daywise_hits.py \
--hits out_fate_pipeline_sim/hits_ranked.csv \
--truth out_fate_pipeline_sim/sim_recovery.csv \
--lfsr-thresh 0.05 \
--truth-thresh 0.2Unit tests (core helpers):
python -m unittest tests/test_pyro_model_helpers.pyIntegration tests (fit + export flow):
python -m unittest tests/test_pyro_model_integration.pyFull test suite (environment check + all tests):
PYTHON_BIN=python scripts/run_tests.shOptional end-to-end logic check:
python scripts/test_pyro_model_logic.pypython scripts/run_diagnostics.py \
--config config.yaml \
--adata data/adata.h5ad \
--guide-map data/guide_map.csv \
--out out_fate_pipeline/diagnostics.jsonOn Sherlock (or any Slurm cluster), run Snakemake with a cluster submission wrapper. Example (adjust partition/account/time/memory to your environment):
snakemake --use-conda --jobs 50 --latency-wait 60 \
--cluster "sbatch -p <partition> -c {threads} --mem {resources.mem_mb}"Tips:
- Keep
--use-condato isolate Python/R dependencies per rule. - Increase
--jobsand--coresbased on your allocation. - If your filesystem is slow, add
--latency-wait. - GPU rules already declare
resources: gpu=1in the Snakefile; adjust the--clusterstring to request GPUs if needed.
├── Snakefile <- Snakemake workflow (real + simulation targets)
├── config.yaml <- Main config for real-data runs
├── envs/ <- Conda environments (pyro.yaml, ash.yaml, mash.yaml)
├── scripts/ <- Pipeline entrypoints + utilities
│ ├── fit_pyro_export.py
│ ├── run_ash.R
│ ├── run_mashr.R
│ ├── run_mashr_two_mode.R
│ ├── aggregate_guides_to_genes.py
│ ├── compare_mash_modes.py
│ ├── rank_hits.py
│ ├── run_diagnostics.py
│ ├── eval_daywise_hits.py
│ ├── simulate_recovery.py
│ ├── simulate_prior_sweep.py
│ ├── simulate_tau_sweep.py
│ ├── simulate_stress.py
│ └── run_tests.sh
├── src/
│ ├── models/ <- Pyro model + helpers
│ │ ├── pyro_model.py
│ │ └── pyro_pipeline.py
│ └── architecture/ <- Model/pipeline specs + notes
│ ├── pyro_context.md
│ ├── pyro_model.md
│ ├── pyro_export_skeleton.md
│ ├── pyro_pipeline_skeleton.md
│ ├── snakefile.md
│ └── memory.md
├── tests/ <- Unit + integration tests
├── data/ <- Input data (adata, guide map) + raw/interim/processed
├── out_fate_pipeline/ <- Real-data outputs
├── out_fate_pipeline_sim/ <- Simulation outputs (incl. sim_metadata.yaml)
├── docs/ <- Sphinx docs (optional)
├── notebooks/ <- Jupyter notebooks
├── references/ <- Data dictionaries and manuals
├── reports/ <- Figures + reports
├── requirements.txt <- Python deps for local use
├── setup.py
├── README.md
├── LICENSE
└── tox.ini