diff --git a/README.md b/README.md
index 5d7a1a53..a060391e 100644
--- a/README.md
+++ b/README.md
@@ -12,7 +12,7 @@ Install required dependencies:
```bash
conda env create \
-n snake \
- -f https://raw.githubusercontent.com/KosinskiLab/AlphaPulldownSnakemake/2.1.8/workflow/envs/alphapulldown.yaml
+ -f https://raw.githubusercontent.com/KosinskiLab/AlphaPulldownSnakemake/2.4.0/workflow/envs/alphapulldown.yaml
conda activate snake
```
@@ -28,7 +28,7 @@ Create a new processing directory for your project:
snakedeploy deploy-workflow \
https://github.com/KosinskiLab/AlphaPulldownSnakemake \
AlphaPulldownSnakemake \
- --tag 2.1.8
+ --tag 2.4.0
cd AlphaPulldownSnakemake
```
@@ -223,6 +223,8 @@ slurm_qos: "normal" # optional QoS if your site uses it
structure_inference_gpus_per_task: 1 # number of GPUs each inference job needs
structure_inference_gpu_model: "3090" # optional GPU model constraint (remove to allow any)
structure_inference_tasks_per_gpu: 0 # <=0 keeps --ntasks-per-gpu unset in the plugin
+slurm_exclude_nodes: "" # optional comma-separated nodes to avoid (sbatch --exclude)
+structure_inference_max_runtime: 10080 # cap wall time (min) at the partition MaxTime
```
`structure_inference_gpus_per_task` and `structure_inference_gpu_model` are read by the
@@ -234,6 +236,75 @@ fields keeps the job submission consistent across clusters.
the default `0` prevents that flag, which avoids conflicting with the Tres-per-task request on many
systems. Set it to a positive integer only if your site explicitly requires `--ntasks-per-gpu`.
+The remaining optional fields help with two common cluster issues: keeping inference off GPUs it
+can't use, and large complexes running out of GPU memory. Defaults are sensible; expand below only if
+you hit these.
+
+
+Avoiding unsuitable GPUs (slurm_exclude_nodes, gpu_model) and the runtime cap
+
+- **Restrict to one model** with `structure_inference_gpu_model` (e.g. `"A100"`) → the plugin emits
+ `--gpus=:`. Accepts a single model name; leave `""` for any.
+- **Exclude specific nodes** with `slurm_exclude_nodes` → passed verbatim to `sbatch --exclude`
+ (e.g. `"gpu50,gpu51"`). Use it for nodes whose GPU the container can't use — e.g. a CUDA compute
+ capability newer than the container's bundled `ptxas` (fails `ptxas too old` / `UNIMPLEMENTED`).
+ `--exclude` is allowed in `slurm_extra` whereas `--constraint`/`--gres`/`--gpus` are not, so it is
+ the supported way to drop a few nodes while keeping the rest of the partition.
+- **`structure_inference_max_runtime`** caps per-job wall time (minutes). Wall time scales as
+ `1440 * attempt`, so without a cap enough retries exceed the partition `MaxTime` and SLURM rejects
+ the job with `Requested time limit is invalid`. Set it to your partition's `MaxTime`
+ (`scontrol show partition `); default 7 days (10080).
+
+
+
+
+Unified memory for large complexes (structure_inference_unified_memory)
+
+Large AlphaFold 3 inputs (or smaller-VRAM GPUs) can fail with `RESOURCE_EXHAUSTED` /
+`Allocator (GPU_0_bfc) ran out of memory`. Inference enables JAX/XLA **unified (managed) memory** by
+default so the model spills from GPU VRAM into host RAM instead of OOM-ing (slower while spilling, but
+it completes) — the
+[DeepMind-recommended setting](https://github.com/google-deepmind/alphafold3/blob/main/docs/performance.md)
+for large inputs. It is exported inside the prediction container as:
+
+```sh
+export TF_FORCE_UNIFIED_MEMORY=true
+export XLA_PYTHON_CLIENT_PREALLOCATE=false # don't grab a huge VRAM chunk up front
+export XLA_CLIENT_MEM_FRACTION=$FRACTION # how far past physical VRAM XLA may allocate
+export XLA_PYTHON_CLIENT_MEM_FRACTION=$FRACTION
+```
+
+`XLA_PYTHON_CLIENT_PREALLOCATE=false` is required: without it XLA reserves a large
+slice of VRAM immediately, which defeats the point of letting the allocator grow into
+host RAM on demand.
+
+```yaml
+structure_inference_unified_memory: true # set false to fail fast on OOM instead
+structure_inference_xla_mem_fraction: auto # "auto", or pin a number like 3.2
+```
+
+With the default `structure_inference_xla_mem_fraction: auto`, the fraction is computed
+**per job at run time** as `(allocated host RAM) / (physical GPU VRAM)`: the GPU VRAM is
+read with `nvidia-smi` once the job lands on a node, and the host RAM is the job's SLURM
+`--mem` allocation (which scales with retry attempts). This keeps the unified-memory
+ceiling within the SLURM allocation so XLA cannot oversubscribe host RAM beyond what the
+job requested — which would otherwise get the job OOM-killed. The chosen fraction is
+logged as a `[unified-memory]` line at the top of the job log. Pin a number instead if
+you want a fixed multiplier regardless of GPU/RAM (mirrors the EMBL `run_AF_multimer.sh`
+convention).
+
+> The fraction is computed in the job shell rather than via the SLURM executor: the
+> executor passes the submit environment through with `--export=ALL` but offers no
+> per-job env hook, and the value depends on which GPU the job lands on (only known at
+> run time). Computing it in the container shell also avoids the apptainer env-crossing
+> that submit-side env vars would need.
+
+Because spilling is slower, make sure the job also requests enough host RAM
+(`structure_inference_ram_bytes`, in MB) to hold the overflow — under `auto` that RAM is
+exactly what the fraction is sized against.
+
+
+
### Using Precomputed Features
If you have precomputed protein features, specify the directory: