AF_cache: fast inference of AlphaFold2 / AlphaFold3 predictions in large-scale studies

Setup

curl -s https://get.nextflow.io | bash
mv nextflow ~/bin/

Clone this repository:

git clone https://github.com/clami66/AF_cache.git

Run the pipeline for the first time. This will automatically download and setup all the necessary DBs, tools and AF2 parameters (could take a few hours to set up).
```
cd AF_cache/
nextflow AF_cache.nf --fasta test_data/inputs/fasta/
```

The pipeline uses Docker containers to automatically install dependencies. Alternatively, apptainer or conda can also be used. See below to configure this behavior.
The pipeline automatically installs ColabFold MSA DBs, AlphaFold2 PDB template DBs and AlphaFold2 parameters. If these are already present on the system, this step can be skipped. See below to configure this behavior.

Pipeline inputs

The input to a pipeline is a directory full of .fasta files containing all the sequences for a large-scale experiment:

nextflow AF_cache.nf --fasta fasta_dir/

If only one fasta file with multiple sequences is available, we provide a script to split it into multiple files containig a single sequence:

mkdir fasta_dir/; python bin/split_fasta.py all_seqs.fasta fasta_dir/
nextflow AF_cache.nf --fasta fasta_dir/

Subsetting pairs, running multimers (trimers, etc.)

By default, the pipeline produces predictions for all-vs-all pairs of sequences in the input fasta.

To restrict the interactions to a list of prot1 prot2 pairs, use the parameter --pair_list. Even though the pipeline is thought for dimer interactions, more than two partners can be specified:

$ head multimers_list 
YP00901869113 YP00901869113
YP00901869012 YP00901869012 YP00901869113
...

# will generate predictions for one homodimer and one heterotrimer:
nextflow AF_cache.nf --fasta test_data/inputs/fasta/ --pair_list --fasta test_data/inputs/fasta/pair_list.txt

Resuming runs

Runs are automatically cached by nextflow so that intermediate results can be reused in case of crashes, or if the user changes some settings. Just use the -resume flag to resume the latest run:

nextflow AF_cache.nf --fasta test_data/inputs/fasta/ -resume

To resume an older run, the user can find its job hash ID in .nextflow/history:

$ tail .nextflow/history 
2025-02-21 11:11:17	-	suspicious_moriondo	-	94cc730a68d28d281326bb5abc139f75	aa3ea913-abd3-43fb-a8c9-a1e0276f5bbd	nextflow AF_cache.nf --fasta test_data/inputs/fasta/
2025-02-21 11:12:16	-	prickly_feynman	-	433cbbabaa8870cd8cab34357bf44bbd	fc260aa7-583c-4fbe-baa1-de34a33fbf48	nextflow AF_cache.nf --fasta test_data/inputs/fasta/
2025-02-21 11:13:25	-	marvelous_wiles	-	ac318a13d87b2c7ebe5170cba82444ba	10c69218-e6c5-4fcb-bfdf-a240ca678ff5	nextflow AF_cache.nf --fasta test_data/inputs/fasta/

Then resume the desired run with `-resume <job_hash>:

nextflow AF_cache.nf --fasta test_data/inputs/fasta/ -resume aa3ea913-abd3-43fb-a8c9-a1e0276f5bbd

Other configuration options

The pipeline's behavior can be customized depending on what is available on the host system. Most configuration is done within nextflow.config:

Choosing between Docker, apptainer/singularity and conda

The pipeline uses Docker containers to automatically get all the requirements. Alternatively, apptainer or conda can also be used by running different profiles:

Apptainer/singularity:

nextflow AF_cache.nf --fasta test_data/inputs/fasta/ -profile apptainer

Conda/Mamba:

nextflow AF_cache.nf --fasta test_data/inputs/fasta/ -profile conda

AlphaFold2/AlphaFold3 configuration

AlphaFold2 parameters will be downloaded automatically by the pipeline. If they are already on the system, simply point the pipeline to the parameter file locations inside nextflow.config:

af2_data_dir = '/path/to/alphafold2_data/'

AlphaFold3 parameters must be downloaded manually. The parameter location can be configured in the following line in nextflow.config:

af3_model_dir = '/path/to/af3_model_parameters/'

Other behaviors for AF2 and AF3 (number of recycles, number of seeds etc.) should be set inside the flagfiles provided inside flags/af2.flag and flags/af3.flag. For example, to change the number of recycles inside AF2 and use two NN models, flags/af2.flag might look as follows:

--max_recycles=3
--models_to_use=model_1_multimer_v3,models_2_multimer_v3

Running AlphaFold3

AlphaFold3 can be run by simply adding the --af3 flag:

nextflow AF_cache.nf --fasta test_data/inputs/fasta/ --af3

This will automatically install the necessary environment, according to the docker/apptainer/conda preferences described above.

Notice: the AF3 docker container is not maintained by us.

Notice: the AF3 parameters must be downloaded by the user according to the official docs

Using structural templates

The template step in the pipeline can be enabled/skipped, either by permanently setting skip_template inside nextflow.config:

skip_templates = true

By default, templates are always skipped. This will also avoid downloading the template DBs the first time the pipeline is run.

Like all params options, this behavior can be changed at runtime:

nextflow AF_cache.nf --fasta test_data/inputs/fasta/ --skip_templates=false

Using ColabFold DBs that are already on the system

If the ColabFold DBs have been downloaded through the original ColabFold setup script, these can be reused so that the pipeline doesn't download an extra copy. This can be done by pointing the database directory inside nextflow.config to the right location:

mmseqs_db = "/path/to/ColabFold/DB"

The directory should contain the files DOWNLOADS_READY, UNIREF30_READY, COLABDB_READY. The pipeline will look for these files to skip the download step.

Using AlphaFold2 template DBs that are already on the system

If template DBs (`pdb_mmcif`, `pdb_seqres`) are already on the system, these can be used to avoid downloading an extra copy.

Set the correct paths inside nextflow.config

template_mmcif_dir = "/path/to/pdb_mmcif/mmcif_files"
obsolete_pdbs_path = "/path/to/pdb_mmcif/obsolete.dat"
pdb_seqres_database_path = "/path/to/pdb_seqres/pdb_seqres.txt"

Make sure you add an empty file called PDB_MMCIF_READY inside the directory of the ColabFold DBs (mmseqs_db = "/path/to/ColabFold/DB"). This will avoid triggering a new download of the DBs.

Using a local installation of MMseqs2

Compiling MMseqs2 from the source code is sometimes better than using the pre-compiled binaries included in AF_cache. If an installation of MMseqs2 is already present on the system, that can be used instead of the pre-compiled version by changing the path to the MMseqs2 binary in nextflow.config:

mmseqs_bin = "/path/to/bin/mmseqs"

Job scheduling and resource management

Depending whether the pipeline runs on an HPC sytem or locally, some parameters can be varied to send jobs to different schedulers or to run them on local CPU/GPU resourcers.

For example: on a SLURM-based system, one could send the alignment job to a node with 8 GPUs and all AF inference jobs to single-GPU nodes. Other lighter tasks (e.g. parsing features, copying files) can be sent to CPU-only nodes, or run locally (on the front node). That would be accomplished with the following settings in nextflow.config:

withName:'mmseqs_align' {
    executor = 'slurm'
    clusterOptions = '--account xxx-yyy-zzz --gpus 8 --time 12:00:00'
}

withName:'run_af2_jobs|run_af3_jobs' {
    executor = 'slurm'
    clusterOptions = '--account xxx-yyy-zzz --gpus 1 --time 12:00:00'
}

withName:'ln_fasta|split_fasta|collect_pickles|collect_jsons' {
    executor = 'local'
}

withName:'convert_alignments_af2|convert_alignments_af2|parse_features_af2|parse_features_af3|format_jobs_af2|format_jobs_af3' {
    executor = 'local'
}

If all tasks are running on a local machine, all executors may be set to local, then edit the executor to make sure that the job queue size is not larger than the number of GPUs available on said machine. For example, if four GPUs are on a local machine:

executor{
    name = "local"
    queueSize = 4
    cpus = 32
}

Heavier CPU tasks to process alignments may be sent to a CPU node, e.g. through SLURM:

...

withName:'convert_alignments_af2|convert_alignments_af2|parse_features_af2|parse_features_af3|format_jobs_af2|format_jobs_af3' {
    executor = 'slurm'
    clusterOptions = '--account xxx-yyy-zzz -N 1 -n 32 --time 2:00:00'
}

Consult the Nextlow docs for more information about setting up different executors/schedulers here.

Name		Name	Last commit message	Last commit date
Latest commit History 336 Commits
alphafold		alphafold
bin		bin
docker		docker
environments		environments
flags		flags
img		img
modules/local		modules/local
subworkflows/local		subworkflows/local
test_data		test_data
tests		tests
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
AF_cache.nf		AF_cache.nf
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
nextflow.config		nextflow.config

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AF_cache: fast inference of AlphaFold2 / AlphaFold3 predictions in large-scale studies

Setup

Pipeline inputs

Subsetting pairs, running multimers (trimers, etc.)

Resuming runs

Other configuration options

Pipeline Figure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AF_cache: fast inference of AlphaFold2 / AlphaFold3 predictions in large-scale studies

Setup

Pipeline inputs

Subsetting pairs, running multimers (trimers, etc.)

Resuming runs

Other configuration options

Pipeline Figure

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages