-
curl -s https://get.nextflow.io | bash mv nextflow ~/bin/
-
Clone this repository:
git clone https://github.com/clami66/AF_cache.git
-
Run the pipeline for the first time. This will automatically download and setup all the necessary DBs, tools and AF2 parameters (could take a few hours to set up).
cd AF_cache/ nextflow AF_cache.nf --fasta test_data/inputs/fasta/
-
The pipeline uses Docker containers to automatically install dependencies. Alternatively, apptainer or conda can also be used. See below to configure this behavior.
-
The pipeline automatically installs ColabFold MSA DBs, AlphaFold2 PDB template DBs and AlphaFold2 parameters. If these are already present on the system, this step can be skipped. See below to configure this behavior.
The input to a pipeline is a directory full of .fasta files containing all the sequences for a large-scale experiment:
nextflow AF_cache.nf --fasta fasta_dir/
If only one fasta file with multiple sequences is available, we provide a script to split it into multiple files containig a single sequence:
mkdir fasta_dir/; python bin/split_fasta.py all_seqs.fasta fasta_dir/
nextflow AF_cache.nf --fasta fasta_dir/
By default, the pipeline produces predictions for all-vs-all pairs of sequences in the input fasta.
To restrict the interactions to a list of prot1 prot2 pairs, use the parameter --pair_list. Even though the pipeline is thought for dimer interactions, more than two partners can be specified:
$ head multimers_list
YP00901869113 YP00901869113
YP00901869012 YP00901869012 YP00901869113
...
# will generate predictions for one homodimer and one heterotrimer:
nextflow AF_cache.nf --fasta test_data/inputs/fasta/ --pair_list --fasta test_data/inputs/fasta/pair_list.txt
Runs are automatically cached by nextflow so that intermediate results can be reused in case of crashes, or if the user changes some settings. Just use the -resume flag to resume the latest run:
nextflow AF_cache.nf --fasta test_data/inputs/fasta/ -resume
To resume an older run, the user can find its job hash ID in .nextflow/history:
$ tail .nextflow/history
2025-02-21 11:11:17 - suspicious_moriondo - 94cc730a68d28d281326bb5abc139f75 aa3ea913-abd3-43fb-a8c9-a1e0276f5bbd nextflow AF_cache.nf --fasta test_data/inputs/fasta/
2025-02-21 11:12:16 - prickly_feynman - 433cbbabaa8870cd8cab34357bf44bbd fc260aa7-583c-4fbe-baa1-de34a33fbf48 nextflow AF_cache.nf --fasta test_data/inputs/fasta/
2025-02-21 11:13:25 - marvelous_wiles - ac318a13d87b2c7ebe5170cba82444ba 10c69218-e6c5-4fcb-bfdf-a240ca678ff5 nextflow AF_cache.nf --fasta test_data/inputs/fasta/
Then resume the desired run with `-resume <job_hash>:
nextflow AF_cache.nf --fasta test_data/inputs/fasta/ -resume aa3ea913-abd3-43fb-a8c9-a1e0276f5bbd
The pipeline's behavior can be customized depending on what is available on the host system. Most configuration is done within nextflow.config:
Choosing between Docker, apptainer/singularity and conda
The pipeline uses Docker containers to automatically get all the requirements. Alternatively, apptainer or conda can also be used by running different profiles:
Apptainer/singularity:
nextflow AF_cache.nf --fasta test_data/inputs/fasta/ -profile apptainer
Conda/Mamba:
nextflow AF_cache.nf --fasta test_data/inputs/fasta/ -profile conda
AlphaFold2/AlphaFold3 configuration
AlphaFold2 parameters will be downloaded automatically by the pipeline. If they are already on the system, simply point the pipeline to the parameter file locations inside nextflow.config:
af2_data_dir = '/path/to/alphafold2_data/'
AlphaFold3 parameters must be downloaded manually. The parameter location can be configured in the following line in nextflow.config:
af3_model_dir = '/path/to/af3_model_parameters/'
Other behaviors for AF2 and AF3 (number of recycles, number of seeds etc.) should be set inside the flagfiles provided inside flags/af2.flag and flags/af3.flag. For example, to change the number of recycles inside AF2 and use two NN models, flags/af2.flag might look as follows:
--max_recycles=3
--models_to_use=model_1_multimer_v3,models_2_multimer_v3
Running AlphaFold3
AlphaFold3 can be run by simply adding the --af3 flag:
nextflow AF_cache.nf --fasta test_data/inputs/fasta/ --af3
This will automatically install the necessary environment, according to the docker/apptainer/conda preferences described above.
Notice: the AF3 docker container is not maintained by us.
Notice: the AF3 parameters must be downloaded by the user according to the official docs
Using structural templates
The template step in the pipeline can be enabled/skipped, either by permanently setting skip_template inside nextflow.config:
skip_templates = true
By default, templates are always skipped. This will also avoid downloading the template DBs the first time the pipeline is run.
Like all params options, this behavior can be changed at runtime:
nextflow AF_cache.nf --fasta test_data/inputs/fasta/ --skip_templates=false
Using ColabFold DBs that are already on the system
If the ColabFold DBs have been downloaded through the original ColabFold setup script, these can be reused so that the pipeline doesn't download an extra copy. This can be done by pointing the database directory inside nextflow.config to the right location:
mmseqs_db = "/path/to/ColabFold/DB"
The directory should contain the files DOWNLOADS_READY, UNIREF30_READY, COLABDB_READY. The pipeline will look for these files to skip the download step.
Using AlphaFold2 template DBs that are already on the system
If template DBs (`pdb_mmcif`, `pdb_seqres`) are already on the system, these can be used to avoid downloading an extra copy.- Set the correct paths inside
nextflow.config
template_mmcif_dir = "/path/to/pdb_mmcif/mmcif_files"
obsolete_pdbs_path = "/path/to/pdb_mmcif/obsolete.dat"
pdb_seqres_database_path = "/path/to/pdb_seqres/pdb_seqres.txt"
- Make sure you add an empty file called
PDB_MMCIF_READYinside the directory of the ColabFold DBs (mmseqs_db = "/path/to/ColabFold/DB"). This will avoid triggering a new download of the DBs.
Using a local installation of MMseqs2
Compiling MMseqs2 from the source code is sometimes better than using the pre-compiled binaries included in AF_cache. If an installation of MMseqs2 is already present on the system, that can be used instead of the pre-compiled version by changing the path to the MMseqs2 binary in nextflow.config:
mmseqs_bin = "/path/to/bin/mmseqs"
Job scheduling and resource management
Depending whether the pipeline runs on an HPC sytem or locally, some parameters can be varied to send jobs to different schedulers or to run them on local CPU/GPU resourcers.
For example: on a SLURM-based system, one could send the alignment job to a node with 8 GPUs and all AF inference jobs to single-GPU nodes. Other lighter tasks (e.g. parsing features, copying files) can be sent to CPU-only nodes, or run locally (on the front node). That would be accomplished with the following settings in nextflow.config:
withName:'mmseqs_align' {
executor = 'slurm'
clusterOptions = '--account xxx-yyy-zzz --gpus 8 --time 12:00:00'
}
withName:'run_af2_jobs|run_af3_jobs' {
executor = 'slurm'
clusterOptions = '--account xxx-yyy-zzz --gpus 1 --time 12:00:00'
}
withName:'ln_fasta|split_fasta|collect_pickles|collect_jsons' {
executor = 'local'
}
withName:'convert_alignments_af2|convert_alignments_af2|parse_features_af2|parse_features_af3|format_jobs_af2|format_jobs_af3' {
executor = 'local'
}
If all tasks are running on a local machine, all executors may be set to local, then edit the executor to make sure that the job queue size is not larger than the number of GPUs available on said machine. For example, if four GPUs are on a local machine:
executor{
name = "local"
queueSize = 4
cpus = 32
}
Heavier CPU tasks to process alignments may be sent to a CPU node, e.g. through SLURM:
...
withName:'convert_alignments_af2|convert_alignments_af2|parse_features_af2|parse_features_af3|format_jobs_af2|format_jobs_af3' {
executor = 'slurm'
clusterOptions = '--account xxx-yyy-zzz -N 1 -n 32 --time 2:00:00'
}
Consult the Nextlow docs for more information about setting up different executors/schedulers here.

