KosinskiLab · DimaMolod · May 23, 2026 · May 23, 2026 · May 23, 2026 · May 23, 2026
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -0,0 +1,26 @@
+name: tests
+
+on:
+  push:
+    branches: [main]
+  pull_request:
+
+jobs:
+  unit-tests:
+    runs-on: ubuntu-latest
+    strategy:
+      matrix:
+        python-version: ["3.10", "3.12"]
+    steps:
+      - uses: actions/checkout@v4
+      - name: Set up Python ${{ matrix.python-version }}
+        uses: actions/setup-python@v5
+        with:
+          python-version: ${{ matrix.python-version }}
+      - name: Byte-compile common.smk
+        # common.smk is plain Python and carries the memory/length logic.
+        run: python -m py_compile workflow/rules/common.smk
+      - name: Run resource/length unit tests
+        # The suite loads common.smk directly and needs no third-party deps;
+        # the SLURM-plugin integration test self-skips when the plugin is absent.
+        run: python test/test_memory_resources.py
diff --git a/README.md b/README.md
@@ -241,6 +241,27 @@ you hit these.
 
 - **Restrict to one model** with `structure_inference_gpu_model` (e.g. `"A100"`) → the plugin emits
   `--gpus=<model>:<count>`. Accepts a single model name; leave `""` for any.
+- **Route by complex size (VRAM)** with `structure_inference_gpu_tiers` → list your GPU pool as
+  tiers of `{min_vram_gb, nodes}`. A complex's estimated peak VRAM (≈ `per_token_sq·N²`) selects the
+  smallest tier that fits and all *smaller*-GPU nodes are excluded, so the job runs on **any** GPU at
+  or above that tier — using the whole pool, not one pinned model. A complex larger than every tier
+  uses the biggest tier and spills to host RAM via unified memory.
+
+  ```yaml
+  # Example for EMBL gpu-el8 — replace nodes with your cluster's (nothing is hard-coded):
+  structure_inference_gpu_vram_headroom: 1.0   # <1.0 tolerates that fraction of host spill
+  structure_inference_gpu_tiers:
+    - {min_vram_gb: 24, nodes: "gpu21,gpu22,gpu29,gpu30,gpu31,gpu32,gpu33,gpu34,gpu35,gpu36,gpu37"}
+    - {min_vram_gb: 40, nodes: "gpu25,gpu26,gpu27,gpu28"}
+    - {min_vram_gb: 48, nodes: "gpu40,gpu41,gpu42,gpu43,gpu44,gpu45,gpu46,gpu47,gpu48"}
+    - {min_vram_gb: 80, nodes: "gpu38,gpu39"}
+  ```
+
+  When set this drives `--exclude` per job and **overrides** `structure_inference_gpu_model` (the two
+  would conflict). It's the practical "fit to GPU" lever: requested host RAM is a separate pool and
+  does not size GPU VRAM, but excluding too-small GPUs by length does. Use explicit comma node lists
+  (bracket ranges may be glob-expanded by the shell). Multi-partition routing (e.g. EMBL's bigger
+  `gpu-training` cards) is out of scope — keep one partition and let unified memory spill the tail.
 - **Exclude specific nodes** with `slurm_exclude_nodes` → passed verbatim to `sbatch --exclude`
   (e.g. `"gpu50,gpu51"`). Use it for nodes whose GPU the container can't use — e.g. a CUDA compute
   capability newer than the container's bundled `ptxas` (fails `ptxas too old` / `UNIMPLEMENTED`).
@@ -301,6 +322,93 @@ exactly what the fraction is sized against.
 
 </details>
 
+<details>
+<summary>Length-aware memory requests (sized automatically from the input sequences)</summary>
+
+Host RAM for both compute stages is requested **from the input sequence length**, so big
+complexes get enough memory on the first attempt instead of failing and climbing the retry
+ladder, while small jobs are not over-provisioned. The request is computed at scheduling
+time by reading the per-chain FASTA(s) the pipeline already stages under
+`<output_directory>/data/`:
+
+```
+create_features      mem = safety * (feature_create_ram_bytes + per_residue * seq_len)
+structure_inference  mem = safety * (structure_inference_ram_bytes + per_token_sq * N^2)
+```
+
+- `seq_len` is the query length; `N` is the **total residues of the complex** (the
+  AlphaFold token count, summed over chains and copy numbers). AlphaFold's pair
+  representation is `O(N^2)`, hence the quadratic inference term.
+- **The coefficients default by backend** (selected from `--data_pipeline` / `--fold_backend`).
+  AlphaFold-Multimer (AF2) is heavier than AlphaFold 3 — measured AF2 inference host RSS was
+  ~4× higher than AF3 at the same complex size, and AF2's feature stage runs HHblits (the
+  main OOM source), whereas the AF3 pipeline is lighter. Defaults:
+
+  | backend | feature base | feature /residue | inference base | inference /N² |
+  |---|---|---|---|---|
+  | `alphafold2` | 64000 MB | 40 MB | 24000 MB | 0.0055 |
+  | `alphafold3` | 40000 MB | 25 MB | 16000 MB | 0.0045 |
+
+  The AF3 inference quadratic is sized to the observed GPU-VRAM demand so that, with unified
+  memory, the host spill ceiling (`host_mem / gpu_vram`) covers large complexes instead of
+  OOM-ing.
+- The first attempt already includes `mem_safety_factor` (default `1.25`) of head-room.
+  **OOM retries still escalate** on top, multiplying by `..._ram_scaling ** (attempt - 1)`,
+  so a bad estimate self-heals.
+- Override any backend default by setting the matching key in `config/config.yaml`
+  (`feature_create_ram_bytes`, `feature_create_ram_per_residue_mb`,
+  `structure_inference_ram_bytes`, `structure_inference_ram_per_token_sq_mb`); an explicit
+  value applies to all backends. Also tune `mem_safety_factor`, the `..._ram_scaling`
+  factors, `structure_inference_runtime_minutes`, and `max_mem_mb` (set it to your largest
+  node's RAM where an over-estimate would otherwise never schedule; `0` = no cap).
+- The `..._ram_bytes` keys are the **fixed base** of each model rather than a flat request;
+  raising a base only raises the floor. Setting `per_residue`/`per_token_sq` to `0`
+  reproduces the old length-blind behaviour (a flat base × retry scaling).
+- **Precomputed features:** when a chain is supplied via `feature_directory`, no
+  `data/<chain>.fasta` is generated. Length is then recovered from the precomputed
+  `<chain>_af3_input.json` (AF3) or from the parse-time length cache written by the length
+  filter below (covers AF2 too). If neither is available the job falls back to the base
+  allocation plus retry escalation. AF3 ligand atoms are not counted (no sequence), a small
+  undercount absorbed by the safety margin.
+
+</details>
+
+<details>
+<summary>Skipping over-large complexes (length filtering)</summary>
+
+Folds that are too large to be worth submitting are **skipped before any job is created**,
+so a single oversized complex (or one giant chain) doesn't waste a GPU/feature allocation
+that will only OOM. Two configurable limits (in `config/config.yaml`):
+
+```yaml
+# Max TOTAL complex length (sum of all chains), per backend — selected by --fold_backend.
+max_total_length_alphafold2: 5000     # AF2-Multimer
+max_total_length_alphafold3: 7000     # AF3 handles larger inputs
+# max_total_length: 6000              # optional single override for both backends
+# Max length of any SINGLE protein; 0 = off (issue #33). A protein over this drops every
+# fold containing it, so it is never even downloaded.
+max_protein_length: 0
+length_filter_fetch_uniprot: true     # set false for fully offline runs
+```
+
+- Lengths are resolved at **parse time** from, in order: a local FASTA, an
+  already-downloaded `data/<id>.fasta`, the persistent cache
+  `<output_directory>/.sequence_lengths.tsv`, and finally the UniProt REST API (cached for
+  next time). Set a limit to `0` to disable it; if both are `0`, no resolution/fetching
+  happens at all.
+- Skipped folds are listed with reasons in `<output_directory>/skipped_folds.tsv` and logged
+  as a `[length-filter]` warning. **Unknown lengths fail open** (the fold is kept), so a
+  UniProt outage never silently drops work.
+- First parse of a large all-UniProt sheet will fetch each unique length once (cached
+  afterwards); already-downloaded inputs and local FASTAs are read without any network call.
+- **Applies to every profile, including local/workstation runs** (it runs during workflow
+  parsing, not in the executor). It's the only length-aware feature that does — the memory
+  and GPU-routing settings are SLURM resources that local runs ignore. To attempt a complex
+  larger than the caps on a big workstation, raise or zero the `max_total_length_*` values
+  (and set `length_filter_fetch_uniprot: false` for offline use).
+
+</details>
+
 ### Using precomputed features
 
 If you have precomputed protein features, specify the directory:

diff --git a/config/config.yaml b/config/config.yaml
@@ -52,10 +52,65 @@ analyze_structure_arguments:
 
 # Memory allocation for feature creation and structure inference.
 # NOTE: despite the "_bytes" suffix these values are in MEGABYTES (used directly as
-# the SLURM --mem request), so 64000 = 64000 MB ~= 64 GB. They scale with retries.
-feature_create_ram_bytes: 64000      # MB
-feature_create_ram_scaling: 1.1
-structure_inference_ram_bytes: 64000 # MB
+# the SLURM --mem request), so 64000 = 64000 MB ~= 64 GB.
+#
+# Host RAM for both stages is sized automatically from the input sequence length
+# with a safety margin, and still escalates on OOM retries. The requested memory is:
+#   create_features      : safety * (feature_create_ram_bytes + per_residue * seq_len)
+#   structure_inference  : safety * (structure_inference_ram_bytes + per_token_sq * N^2)
+# where seq_len is the query length, N is the total residues of the complex, and the
+# request is multiplied by (scaling ** (attempt - 1)) on each retry. AlphaFold's pair
+# representation is O(N^2), hence the quadratic term for inference.
+#
+# The base / per-length coefficients DEFAULT BY BACKEND (AF2 is heavier than AF3),
+# selected automatically from --data_pipeline / --fold_backend:
+#                              feature base   feature /res   infer base   infer /N^2
+#   alphafold2 (AF2 multimer):    64000 MB       40 MB         24000 MB     0.0055
+#   alphafold3 (AF3):             40000 MB       25 MB         16000 MB     0.0045
+# Leave the keys below commented to use those backend defaults. Uncomment any key
+# to override it for ALL backends.
+
+# Safety margin applied to the first-attempt estimate of every length-aware stage.
+mem_safety_factor: 1.25
+# Optional hard ceiling (MB) on any single memory request; 0 = no cap. Set this to
+# your largest node's RAM on clusters where an over-estimate would never schedule.
+max_mem_mb: 0
+
+# create_features (CPU/MSA) — base is database/MSA-tool dominated (AF2 HHblits is the
+# main OOM source), with a mild linear dependence on query length.
+# feature_create_ram_bytes: 64000        # MB, fixed base (default: backend-specific)
+# feature_create_ram_per_residue_mb: 30  # MB per query residue (default: backend-specific)
+feature_create_ram_scaling: 1.1          # per-retry escalation on OOM
+
+# structure_inference (GPU host RAM) — base plus a quadratic term in complex size.
+# With unified memory enabled the XLA spill fraction is derived from this host
+# allocation, so this also sizes the effective GPU memory ceiling.
+# structure_inference_ram_bytes: 24000           # MB, fixed base (default: backend-specific)
+# structure_inference_ram_per_token_sq_mb: 0.0045  # MB per residue^2 (default: backend-specific)
+structure_inference_ram_scaling: 1.1             # per-retry escalation on OOM
+# Wall-time minutes per attempt for structure_inference (capped by
+# structure_inference_max_runtime). Default 1440; AF3 on an adequate GPU finishes far
+# sooner, but host-memory spilling can take many hours, so the default stays generous.
+# structure_inference_runtime_minutes: 1440
+
+# Length filtering: skip folds that are too large to be worth submitting. Lengths
+# are resolved at parse time (local FASTA / already-downloaded data / cache / the
+# UniProt REST API) and cached in <output_directory>/.sequence_lengths.tsv; skipped
+# folds are listed in <output_directory>/skipped_folds.tsv. Unknown lengths fail
+# open (the fold is kept). Set a limit to 0 to disable it.
+#
+# Max TOTAL complex length (sum of all chains, residues), per backend; selected by
+# --fold_backend. AF3 handles larger inputs than AF2-Multimer.
+max_total_length_alphafold2: 5000
+max_total_length_alphafold3: 7000
+# Optional single override applied to BOTH backends (takes precedence if set):
+# max_total_length: 6000
+# Max length of any SINGLE protein (residues); 0 disables (issue #33). A protein
+# over this drops every fold containing it, so it is never downloaded/predicted.
+max_protein_length: 0
+# Whether parse-time length resolution may query the UniProt REST API for IDs that
+# have no local FASTA / cached length yet. Set false for fully offline runs.
+length_filter_fetch_uniprot: true
 
 # Number of threads for AlphaFold inference
 alphafold_inference_threads: 8
@@ -67,6 +122,23 @@ structure_inference_gpus_per_task: 1
 # Restrict structure_inference to one GPU model (sbatch --gpus=<model>:N), e.g. "3090".
 # Leave "" to let SLURM pick any GPU in the partition.
 structure_inference_gpu_model: "3090"
+# Optional length-based GPU routing by VRAM (within this partition). List your GPU
+# pool as tiers of {min_vram_gb, nodes}; a complex's estimated peak VRAM
+# (~ structure_inference_ram_per_token_sq * N^2) selects the smallest tier that fits
+# and all SMALLER-GPU nodes are excluded, so the job runs on ANY GPU >= that tier
+# (the whole pool, not one pinned model). A complex larger than every tier uses the
+# biggest tier and spills to host RAM via unified memory. When set, this drives
+# --exclude and OVERRIDES structure_inference_gpu_model. Use explicit comma node lists
+# (avoid bracket ranges, which the shell may glob). The example below is for the EMBL
+# gpu-el8 partition - replace nodes with your cluster's; nothing is hard-coded.
+# structure_inference_gpu_vram_headroom: 1.0   # <1.0 tolerates that fraction of host spill
+# structure_inference_gpu_tiers:
+#   - {min_vram_gb: 24, nodes: "gpu21,gpu22,gpu29,gpu30,gpu31,gpu32,gpu33,gpu34,gpu35,gpu36,gpu37"}  # RTX 3090
+#   - {min_vram_gb: 40, nodes: "gpu25,gpu26,gpu27,gpu28"}                                # A100 40GB
+#   - {min_vram_gb: 48, nodes: "gpu40,gpu41,gpu42,gpu43,gpu44,gpu45,gpu46,gpu47,gpu48"}  # L40s/A40 48GB
+#   - {min_vram_gb: 80, nodes: "gpu38,gpu39"}                                            # H100 PCIe 80GB
+# Note: RTX PRO 6000 (gpu50-53, 96GB) are ptxas-incompatible -> keep in slurm_exclude_nodes.
+# H100-SXM/H200/B200 live on the separate gpu-training partition (not routed here).
 # Optional: comma-separated nodes to keep structure_inference OFF, passed to sbatch
 # as --exclude. Useful for GPUs the prediction container cannot use (e.g. a CUDA
 # compute capability the bundled ptxas is too old for). Example: