Skip to content

Size SLURM memory from input sequence length#44

Open
DimaMolod wants to merge 8 commits into
mainfrom
feature/length-aware-memory-requests
Open

Size SLURM memory from input sequence length#44
DimaMolod wants to merge 8 commits into
mainfrom
feature/length-aware-memory-requests

Conversation

@DimaMolod
Copy link
Copy Markdown
Collaborator

@DimaMolod DimaMolod commented May 23, 2026

Summary

Sizes SLURM memory (and, optionally, GPUs) from the input sequence length instead of flat per-rule values, and skips folds too large to be worth running. Large complexes get enough resources on the first attempt; small jobs stop over-provisioning. Everything is config-driven and backward compatible.

What it adds

  1. Length-aware memory — host --mem computed per stage at scheduling time:
    • create_featuressafety·(base + per_residue·L) (linear; MSA/DB-bound)
    • structure_inferencesafety·(base + per_token_sq·N²) (quadratic; AlphaFold's pair representation is O(N²))
    • Defaults differ by backend (AF2 is heavier than AF3). First attempt carries a safety margin; OOM retries still escalate.
  2. Length filtering (Closes Add filtering by total length in download_uniprot rule #33) — skip folds over a per-protein (max_protein_length) or total-size cap (max_total_length_alphafold2: 5000 / ..._alphafold3: 7000). Lengths resolved from local FASTA → cache → UniProt (cached, fail-open); skipped folds logged to skipped_folds.tsv.
  3. VRAM-based GPU routing (optional)structure_inference_gpu_tiers ({min_vram_gb, nodes}); a complex excludes too-small GPUs so it runs on any card with enough VRAM (uses the whole pool, not one pinned model).
  4. Robustness fixes — correct sizing when features are precomputed; never cache a missing-file length read (a real cluster submission caught this — it had collapsed sizing to the base).
  5. CI — GitHub Actions runs the test suite (21 tests) on Python 3.10/3.12.

Compatibility

  • Backward compatible: every new behavior is a config knob; an explicit *_ram_bytes is still honored (now as the model's base).
  • Local/desktop runs are unaffected by the memory and GPU settings — those are SLURM resources the local executor ignores. Only length filtering also runs locally (disablable via the caps / length_filter_fetch_uniprot).

Validation

  • 21 unit tests + CI green (3.10 / 3.12).
  • Live on the cluster (values read from the real sbatch calls):
    • memory scales with length — features: 141 aa → --mem 87050, 1210 aa → 140500; inference: N=800/3500/4600 → ~24 / 89 / 139 GB.
    • VRAM routing — N=800 → any GPU; N≥3500 → --exclude all <80 GB nodes.
    • full end-to-end run (download → features → inference → analysis) with the AF3 container produced a predicted structure + interfaces.csv.

Closes #33

🤖 Generated with Claude Code

DimaMolod and others added 8 commits May 23, 2026 13:50
Host RAM for both compute stages is now requested from the input sequence
length instead of a flat per-rule value, so large complexes get enough memory
on the first attempt rather than failing and climbing the OOM-retry ladder,
while small jobs are no longer over-provisioned.

Model (host RAM, MB), evaluated at scheduling time from the FASTAs the
pipeline already stages under <output_directory>/data/:

  create_features      mem = safety * (base + per_residue   * seq_len)
  structure_inference  mem = safety * (base + per_token_sq  * N**2)

N is the total residues of the complex (the AlphaFold token count, summed over
chains and copy numbers). AlphaFold's pair representation is O(N^2), hence the
quadratic inference term. The first attempt carries a safety margin
(mem_safety_factor, default 1.25) and OOM retries still escalate on top via
`*_ram_scaling ** (attempt - 1)`, so a mis-estimate self-heals.

Coefficients are conservative and anchor-calibrated (AF3 performance docs +
documented OOM/demand pairs from the AlphaJudge benchmark), not a dense
empirical fit; the quadratic passes through the observed ~25/82/100 GB pairs at
N=2066/4556/4836 with ~1.2-2.5x head-room. They can be tightened later without
changing the mechanism.

Implementation:
- common.smk: residue_count(), fold_total_tokens(), estimate_feature_mem_mb(),
  estimate_inference_mem_mb(); linear_resources() now forwards `input` to
  resource callbacks that declare it (legacy `(wc, attempt)` callbacks still
  work, wildcards stays positional per Snakemake's calling convention).
- Snakefile: create_features / structure_inference use the length-aware model.
- config.yaml: new knobs mem_safety_factor, max_mem_mb,
  feature_create_ram_per_residue_mb, structure_inference_ram_per_token_sq_mb,
  structure_inference_ram_scaling; the *_ram_bytes keys are now the model base.
  Setting the per-length terms to 0 reproduces the old length-blind behaviour.
- README: documents the model and knobs.
- test/test_memory_resources.py: covers the math, retry escalation, cap, the
  observed-OOM anchors, and the exact linear_resources calling convention.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
AlphaFold-Multimer (AF2) and AlphaFold 3 have materially different memory and
runtime profiles, so the length-aware coefficients now default by backend
(selected from --data_pipeline for features and --fold_backend for inference;
an explicit config value still overrides).

Evidence (benchmark campaign, joined slurm-logs -> sacct -> sequence length):
- AF2 inference host RSS is ~4x higher than AF3 at the same complex size
  (e.g. N~2300: AF2 ~31 GB vs AF3 ~7 GB) and rises quadratically.
- AF2's feature stage runs HHblits, the dominant OOM source; the AF3 pipeline
  (jackhmmer/nhmmer, no HHblits) is lighter.

Defaults:
                       feature base  feature/res  infer base  infer/N^2
  alphafold2 (AF2):       64000 MB      40 MB       24000 MB    0.0055
  alphafold3 (AF3):       40000 MB      25 MB       16000 MB    0.0045

The AF3 inference quadratic is sized to the observed GPU-VRAM demand so that,
with unified memory, the host spill ceiling (host_mem/gpu_vram) covers large
complexes instead of OOM-ing. Safety margin and OOM-retry escalation are
unchanged. Runtime is now configurable per attempt
(structure_inference_runtime_minutes, default 1440 for both backends) but kept
generous because AF3 host-memory spilling can take many hours (measured ~8.5 h)
despite AF3's faster on-GPU compute.

- common.smk: FEATURE_RAM_DEFAULTS, INFERENCE_RAM_DEFAULTS, normalize_backend().
- Snakefile: resolve feature/inference backend; coefficients fall back to the
  backend default when the config key is unset; runtime knob wired in.
- config.yaml: per-length coefficient keys commented out so backend defaults
  apply out of the box; documents the AF2/AF3 default table.
- README: backend defaults table + override guidance.
- tests: backend default ordering (AF2 >= AF3) and AF2 measured-host-RSS anchors
  alongside the AF3 GPU-demand anchors (10 tests pass).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ures

With features supplied via `feature_directory`, a chain is provided by
`symlink_features` and its `download_uniprot`/`create_features` rules never run,
so `data/<chain>.fasta` does not exist. The inference estimator then read length
0 for that chain and silently undercounted N (the default config ships a
`feature_directory`, so this is a common path).

Fix: `chain_residue_count()` falls back to the precomputed
`<features_dir>/<chain>_af3_input.json` (via new `af3_input_residue_count()`)
when the FASTA is absent and the backend is AF3; `fold_total_tokens()` and the
structure_inference rule pass the features dir + backend through. AF2 precomputed
pickles can't be read cheaply, so they fall back to the base allocation plus
retry escalation (documented). Example: precomputed P0001(300)+P0002(2000) now
sizes from N=2300 instead of N=2000.

Adds a regression test for the fallback (11 tests pass).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ow-ups

Adds configurable length limits, a CI workflow, an end-to-end SLURM wiring test,
and closes the remaining review follow-ups.

Length filtering (issue #33 + requested total caps):
- Skip folds before any job is created when they exceed a length limit, so an
  oversized complex never wastes a feature/GPU allocation that would only OOM.
- max_total_length_alphafold2: 5000 / max_total_length_alphafold3: 7000
  (selected by --fold_backend; optional single `max_total_length` override).
- max_protein_length (0=off, issue #33): a protein over the limit drops every
  fold containing it, so it is never downloaded.
- Lengths resolved at parse time: local FASTA -> data/<id>.fasta -> persistent
  cache <output_directory>/.sequence_lengths.tsv -> UniProt REST API
  (length_filter_fetch_uniprot, default true). Skipped folds + reasons go to
  <output_directory>/skipped_folds.tsv; unknown lengths fail open (kept).
- Fixes copy-number parsing: AlphaPulldown spec is name:copies:region, so the
  copy count is the first ':' token after the name (A:2:1-100 = 2 copies), not
  the last.

Review follow-ups:
- #1: integration test feeds our computed mem_mb through the real SLURM plugin's
  get_submit_command and asserts it becomes `sbatch --mem <value>` (skips if the
  plugin is absent).
- #2: GitHub Actions CI (.github/workflows/ci.yml) byte-compiles common.smk and
  runs the dependency-free unit suite on py3.10/3.12.
- #5: AF2 precomputed-feature inference sizing now recovers length from the
  parse-time length cache (no FASTA / no AF3 JSON needed).
- #6: AF3 ligand atoms are intentionally not counted (no sequence); documented,
  with a test asserting ligand entries contribute 0.

The parse-time length cache is shared with memory sizing, so precomputed-feature
runs get correct length-aware memory too. 18 unit tests pass; AF2/AF3/precomputed
dry-runs build clean.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Caught by a real SLURM submission: both create_features jobs requested
mem=80000 (= safety*base, i.e. length 0) despite the FASTAs being present.

Root cause: residue_count()/af3_input_residue_count() were lru_cached. Snakemake's
scheduler evaluates resource functions early — before the upstream download_uniprot
localrule has produced data/<id>.fasta — so a 0 was memoised for the not-yet-existing
file and then returned even after the file appeared, collapsing length-aware sizing
to the base allocation.

Fix:
- Cache only successful (>0) reads; a missing/unreadable file returns 0 without
  caching, so it is re-read once produced.
- create_features now sizes from the parse-time length cache (populated before any
  job runs) with the staged FASTA / rule input as fallback, so it no longer depends
  on when Snakemake evaluates the resource relative to the download.

Verified on the real cluster: P01258 (141 aa) -> sbatch --mem 87050,
P00533 (1210 aa) -> --mem 140500 (= 1.25*(64000+40*L)); two jobs, different
length-aware memory. Adds a regression test that a missing file is not cached.
19 unit tests pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Route each complex to the smallest GPU tier that fits it, instead of pinning a
single GPU model. Peak GPU VRAM is ~0.0045*N^2, so a small complex wastes a big
card and a large one OOMs a small card; this picks the right tier per job.

New config: structure_inference_gpu_model_by_tokens, a map of inclusive upper
total-token bound -> GPU model, e.g. {2000: "3090", 4000: "A100", 99999: "H100"}.
Each fold routes to the smallest tier it fits (larger than all -> last tier).
When set it takes precedence over structure_inference_gpu_model; when unset the
fixed model is used (unchanged behaviour). select_gpu_model() in common.smk; the
gpu_model resource is now a callable that derives N via fold_total_tokens (same
length source + cache as memory sizing) -> sbatch --gpus=<model>:<count>.

Verified on the cluster (3 folds, AF3): N=800 -> --gpus=3090:1 (--mem 23600),
N=3500 -> --gpus=A100:1 (--mem 88906), N=4600 -> --gpus=H100:1 (--mem 139025);
GPU model and memory both scale with length in the real sbatch call. 20 unit
tests pass (added select_gpu_model coverage).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Replaces the token->single-model GPU routing with VRAM-based node exclusion, so a
complex runs on ANY GPU with enough memory instead of one pinned model. This
matters where several GPU models share a VRAM tier (e.g. EMBL gpu-el8 has ~88
nodes at 48 GB across A40+L40s but only 2 at 80 GB) and mirrors the handoff's
"exclude all <80 GB nodes" rescue.

New config (cluster-agnostic; list your own tiers):
  structure_inference_gpu_tiers: [{min_vram_gb, nodes}, ...]
  structure_inference_gpu_vram_headroom: 1.0   # <1.0 tolerates that much host spill
Each complex's estimated peak VRAM (~ structure_inference_ram_per_token_sq * N^2)
picks the smallest tier that fits; nodes of all smaller tiers are excluded (largest
tier if none fits -> spill via unified memory). Drives a per-job
slurm_extra=--exclude=..., merged with the static slurm_exclude_nodes, and overrides
structure_inference_gpu_model (the two would conflict). Stays within one partition
(EMBL's bigger gpu-training cards are out of scope; the tail spills to host).

common.smk: required_gpu_vram_gb(), gpu_exclude_nodes() (replaces select_gpu_model).
config.yaml/README: EMBL gpu-el8 example (24/40/48/80 GB tiers) with notes that the
RTX PRO 6000 (ptxas-incompatible) stays in slurm_exclude_nodes and gpu-training is
separate; emphasises it is just an example.

Verified live (3 AF3 folds): N=800 -> --exclude=gpu50 (any GPU); N=3500 and N=4600
-> --exclude all 24+48 GB nodes + gpu50 (80 GB pool, spill for the tail); --mem
still 23600/88906/139025. 21 unit tests pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Clarify that the length filter (and only the length filter) runs during workflow
parsing, so it affects every profile including local execution; the memory and
GPU-routing settings are SLURM resources that local runs ignore. Document raising
/zeroing max_total_length_* (and disabling the UniProt fetch) to attempt very large
folds on a workstation.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add filtering by total length in download_uniprot rule

1 participant