BioinfoMachineLearning
diff --git a/‎CHANGELOG.md‎
Lines changed: 1 addition & 1 deletion b/‎CHANGELOG.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/.docs.environment.yaml‎
Lines changed: 34 additions & 34 deletions b/‎docs/.docs.environment.yaml‎
Lines changed: 34 additions & 34 deletions
diff --git a/‎environments/chai_lab_environment.yaml‎
Lines changed: 1 addition & 2 deletions b/‎environments/chai_lab_environment.yaml‎
Lines changed: 1 addition & 2 deletions
diff --git a/‎environments/diffdock_environment.yaml‎
Lines changed: 2 additions & 4 deletions b/‎environments/diffdock_environment.yaml‎
Lines changed: 2 additions & 4 deletions
diff --git a/‎environments/dynamicbind_environment.yaml‎
Lines changed: 1 addition & 2 deletions b/‎environments/dynamicbind_environment.yaml‎
Lines changed: 1 addition & 2 deletions
diff --git a/‎environments/fabind_environment.yaml‎
Lines changed: 1 addition & 2 deletions b/‎environments/fabind_environment.yaml‎
Lines changed: 1 addition & 2 deletions
diff --git a/‎forks/DiffDock/README.md‎
Lines changed: 40 additions & 39 deletions b/‎forks/DiffDock/README.md‎
Lines changed: 40 additions & 39 deletions
diff --git a/‎forks/DiffDock/app/README.md‎
Lines changed: 4 additions & 5 deletions b/‎forks/DiffDock/app/README.md‎
Lines changed: 4 additions & 5 deletions
diff --git a/‎forks/DiffDock/environment.yml‎
Lines changed: 26 additions & 26 deletions b/‎forks/DiffDock/environment.yml‎
Lines changed: 26 additions & 26 deletions
diff --git a/‎forks/DiffDockv1/README.md‎
Lines changed: 21 additions & 16 deletions b/‎forks/DiffDockv1/README.md‎
Lines changed: 21 additions & 16 deletions
@@ -11,7 +11,7 @@
 **Changes**:
 
 - Introducing **DockGen-E**, a new version of the DockGen benchmark dataset featuring enhanced biomolecular context for docking and co-folding predictions - namely, now all DockGen complexes represent the first (biologically relevant) bioassembly of the corresponding PDB structure
-- For the single-ligand datasets (i.e., Astex Diverse, PoseBusters Benchmark, and DockGen), now providing each baseline method with primary *and cofactor* ligand SMILES strings for prediction, to enhance the biomolecular context of these methods' predicted structures - as a result, for these single-ligand datasets, now the predicted ligand *most similar* to the primary ligand (in terms of both Tanimoto and structural similarity) is selected for scoring (which adds an additional layer of challenges for baseline methods)
+- For the single-ligand datasets (i.e., Astex Diverse, PoseBusters Benchmark, and DockGen), now providing each baseline method with primary _and cofactor_ ligand SMILES strings for prediction, to enhance the biomolecular context of these methods' predicted structures - as a result, for these single-ligand datasets, now the predicted ligand _most similar_ to the primary ligand (in terms of both Tanimoto and structural similarity) is selected for scoring (which adds an additional layer of challenges for baseline methods)
 - Updated Chai-1's inference code to commit `44375d5d4ea44c0b5b7204519e63f40b063e4a7c`, and ran it also with standardized (paired) MSAs
 - Replaced all AlphaFold 3 server predictions of each dataset's protein structures with predictions from AlphaFold 3's local inference code
 
 
@@ -22,37 +22,37 @@ dependencies:
   - pdbfixer=1.9=pyh1a96a4e_0
   - python=3.10.14=hd12c33a_0_cpython
   - pip:
-    - beartype
-    - biopandas
-    - biopython
-    - docutils==0.20.1 # NOTE: currently required due to an `m2r2` bug: https://github.com/CrossNox/m2r2/issues/68
-    - furo
-    - ipython
-    - lxml_html_clean
-    - m2r2
-    - matplotlib
-    - nbsphinx
-    - nbsphinx-link
-    - nbstripout
-    - pandas
-    - pandoc
-    - pdb4amber
-    - pip
-    - posebusters
-    - prody
-    - pydocstyle
-    - pypdb
-    - rdkit
-    - rdkit-pypi
-    - rootutils
-    - sphinx
-    - sphinx-copybutton
-    - sphinx-inline-tabs
-    - sphinx_mdinclude
-    - sphinxext-opengraph
-    - sphinxcontrib-gtagjs
-    - sphinxcontrib-jquery
-    - sphinx_codeautolink
-    - wrapt_timeout_decorator
-    - tqdm
-    - watermark
+      - beartype
+      - biopandas
+      - biopython
+      - docutils==0.20.1 # NOTE: currently required due to an `m2r2` bug: https://github.com/CrossNox/m2r2/issues/68
+      - furo
+      - ipython
+      - lxml_html_clean
+      - m2r2
+      - matplotlib
+      - nbsphinx
+      - nbsphinx-link
+      - nbstripout
+      - pandas
+      - pandoc
+      - pdb4amber
+      - pip
+      - posebusters
+      - prody
+      - pydocstyle
+      - pypdb
+      - rdkit
+      - rdkit-pypi
+      - rootutils
+      - sphinx
+      - sphinx-copybutton
+      - sphinx-inline-tabs
+      - sphinx_mdinclude
+      - sphinxext-opengraph
+      - sphinxcontrib-gtagjs
+      - sphinxcontrib-jquery
+      - sphinx_codeautolink
+      - wrapt_timeout_decorator
+      - tqdm
+      - watermark
@@ -178,5 +178,4 @@ dependencies:
       - wcwidth==0.2.13
       - wrapt==1.16.0
       - yarl==1.12.1
-prefix:
-  forks/chai-lab/chai-lab
+prefix: forks/chai-lab/chai-lab
@@ -1,5 +1,4 @@
-name:
-  DiffDock
+name: DiffDock
 channels:
   - pyg
   - pytorch
@@ -342,5 +341,4 @@ dependencies:
       - yarl==1.9.2
       - zope-event==5.0
       - zope-interface==6.0
-prefix:
-  forks/DiffDock/DiffDock
+prefix: forks/DiffDock/DiffDock
@@ -261,5 +261,4 @@ dependencies:
       - unicodedata2==15.0.0
       - urllib3==1.26.15
       - wheel==0.38.4
-prefix:
-  forks/DynamicBind/DynamicBind
+prefix: forks/DynamicBind/DynamicBind
@@ -160,5 +160,4 @@ dependencies:
       - tzdata==2023.4
       - werkzeug==3.0.1
       - zipp==3.17.0
-prefix:
-  forks/FABind/FABind
+prefix: forks/FABind/FABind
@@ -9,17 +9,16 @@ app_file: main.py
 pinned: false
 ---
 
-
 ## How to use this space
 
 This is a simple app intended to showcase [DiffDock](https://github.com/gcorso/DiffDock).
 One can upload a protein and ligand, and calculate the predicted structure. The results are visualized in 3D and can be downloaded.
 
-* This app is designed to take 1 protein (in PDB format) and 1 ligand (in SDF format) at a time. For bulk inference, use the [command line interface](https://github.com/gcorso/DiffDock).
+- This app is designed to take 1 protein (in PDB format) and 1 ligand (in SDF format) at a time. For bulk inference, use the [command line interface](https://github.com/gcorso/DiffDock).
 
-* Our demonstration space uses a CPU, so it may take a few minutes to run. For faster results, use a GPU. 
-One can duplicate this space (at their own expense) by selecting "⋮" -> "Duplicate this space" in the top right corner, and then selecting a GPU in the "Settings" tab.
+- Our demonstration space uses a CPU, so it may take a few minutes to run. For faster results, use a GPU.
+  One can duplicate this space (at their own expense) by selecting "⋮" -> "Duplicate this space" in the top right corner, and then selecting a GPU in the "Settings" tab.
 
-----------
+---
 
 Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
@@ -10,31 +10,31 @@ dependencies:
   - pip
   # Need to install torch in order to build openfold, so install it first
   - pip:
-    - --extra-index-url https://download.pytorch.org/whl/cu117
-    - --find-links https://pytorch-geometric.com/whl/torch-1.13.1+cu117.html
-    - torch==1.13.1+cu117
+      - --extra-index-url https://download.pytorch.org/whl/cu117
+      - --find-links https://pytorch-geometric.com/whl/torch-1.13.1+cu117.html
+      - torch==1.13.1+cu117
   - pip:
-    - --extra-index-url https://download.pytorch.org/whl/cu117
-    - --find-links https://pytorch-geometric.com/whl/torch-1.13.1+cu117.html
-    - dllogger @ git+https://github.com/NVIDIA/dllogger.git
-    - e3nn==0.5.0
-    - fair-esm[esmfold]==2.0.0
-    - networkx==2.8.4
-    - openfold @ git+https://github.com/aqlaboratory/openfold.git@4b41059694619831a7db195b7e0988fc4ff3a307
-    - pandas==1.5.1
-    - prody==2.2.0
-    - prody==2.2.0
-    - pybind11==2.11.1
-    - rdkit==2022.03.3
-    - scikit-learn==1.1.0
-    - scipy==1.12.0
-    - torch==1.13.1+cu117
-    - torch-cluster==1.6.0+pt113cu117
-    - torch-geometric==2.2.0
-    - torch-scatter==2.1.0+pt113cu117
-    - torch-sparse==0.6.16+pt113cu117
-    - torch-spline-conv==1.2.1+pt113cu117
-    - torchmetrics==0.11.0
+      - --extra-index-url https://download.pytorch.org/whl/cu117
+      - --find-links https://pytorch-geometric.com/whl/torch-1.13.1+cu117.html
+      - dllogger @ git+https://github.com/NVIDIA/dllogger.git
+      - e3nn==0.5.0
+      - fair-esm[esmfold]==2.0.0
+      - networkx==2.8.4
+      - openfold @ git+https://github.com/aqlaboratory/openfold.git@4b41059694619831a7db195b7e0988fc4ff3a307
+      - pandas==1.5.1
+      - prody==2.2.0
+      - prody==2.2.0
+      - pybind11==2.11.1
+      - rdkit==2022.03.3
+      - scikit-learn==1.1.0
+      - scipy==1.12.0
+      - torch==1.13.1+cu117
+      - torch-cluster==1.6.0+pt113cu117
+      - torch-geometric==2.2.0
+      - torch-scatter==2.1.0+pt113cu117
+      - torch-sparse==0.6.16+pt113cu117
+      - torch-spline-conv==1.2.1+pt113cu117
+      - torchmetrics==0.11.0
   - pip:
-    - gradio==3.50.*
-    - requests
+      - gradio==3.50.*
+      - requests
@@ -1,28 +1,28 @@
 # DiffDock: Diffusion Steps, Twists, and Turns for Molecular Docking
+
 [![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/diffdock-diffusion-steps-twists-and-turns-for/blind-docking-on-pdbbind)](https://paperswithcode.com/sota/blind-docking-on-pdbbind?p=diffdock-diffusion-steps-twists-and-turns-for)
 
 ### [Paper on arXiv](https://arxiv.org/abs/2210.01776)
 
-Implementation of DiffDock, state-of-the-art method for molecular docking, by Gabriele Corso*, Hannes Stark*, Bowen Jing*, Regina Barzilay and Tommi Jaakkola.
-This repository contains all code, instructions and model weights necessary to run the method or to retrain a model. 
+Implementation of DiffDock, state-of-the-art method for molecular docking, by Gabriele Corso*, Hannes Stark*, Bowen Jing\*, Regina Barzilay and Tommi Jaakkola.
+This repository contains all code, instructions and model weights necessary to run the method or to retrain a model.
 If you have any question, feel free to open an issue or reach out to us: [gcorso@mit.edu](gcorso@mit.edu), [hstark@mit.edu](hstark@mit.edu), [bjing@mit.edu](bjing@mit.edu).
 
 ![Alt Text](visualizations/overview.png)
 
 The repository also contains all the scripts to run the baselines and generate the figures.
 Additionally, there are visualization videos in `visualizations`.
 
-You might also be interested in this [Google Colab notebook](https://colab.research.google.com/drive/1CTtUGg05-2MtlWmfJhqzLTtkDDaxCDOQ#scrollTo=zlPOKLIBsiPU) to run DiffDock by Brian Naughton. 
+You might also be interested in this [Google Colab notebook](https://colab.research.google.com/drive/1CTtUGg05-2MtlWmfJhqzLTtkDDaxCDOQ#scrollTo=zlPOKLIBsiPU) to run DiffDock by Brian Naughton.
 
 # Dataset
 
 The files in `data` contain the names for the time-based data split.
 
-If you want to train one of our models with the data then: 
-1. download it from [zenodo](https://zenodo.org/record/6408497) 
-2. unzip the directory and place it into `data` such that you have the path `data/PDBBind_processed`
-
+If you want to train one of our models with the data then:
 
+1. download it from [zenodo](https://zenodo.org/record/6408497)
+2. unzip the directory and place it into `data` such that you have the path `data/PDBBind_processed`
 
 ## Setup Environment
 
@@ -45,27 +45,28 @@ Then you need to install ESM that we use both for protein sequence embeddings an
     pip install 'dllogger @ git+https://github.com/NVIDIA/dllogger.git'
     pip install 'openfold @ git+https://github.com/aqlaboratory/openfold.git@4b41059694619831a7db195b7e0988fc4ff3a307'
 
-
 # Running DiffDock on your own complexes
+
 We support multiple input formats depending on whether you only want to make predictions for a single complex or for many at once.\
 The protein inputs need to be `.pdb` files or sequences that will be folded with ESMFold. The ligand input can either be a SMILES string or a filetype that RDKit can read like `.sdf` or `.mol2`.
 
 For a single complex: specify the protein with `--protein_path protein.pdb` or `--protein_sequence GIQSYCTPPYSVLQDPPQPVV` and the ligand with `--ligand ligand.sdf` or `--ligand "COc(cc1)ccc1C#N"`
 
-For many complexes: create a csv file with paths to proteins and ligand files or SMILES. It contains as columns `complex_name` (name used to save predictions, can be left empty), `protein_path` (path to `.pdb` file, if empty uses sequence), `ligand_description` (SMILE or file path)  and `protein_sequence` (to fold with ESMFold in case the protein_path is empty).
+For many complexes: create a csv file with paths to proteins and ligand files or SMILES. It contains as columns `complex_name` (name used to save predictions, can be left empty), `protein_path` (path to `.pdb` file, if empty uses sequence), `ligand_description` (SMILE or file path) and `protein_sequence` (to fold with ESMFold in case the protein_path is empty).
 An example .csv is at `data/protein_ligand_example_csv.csv` and you would use it with `--protein_ligand_csv protein_ligand_example_csv.csv`.
 
 And you are ready to run inference:
 
     python -m inference --protein_ligand_csv data/protein_ligand_example_csv.csv --out_dir results/user_predictions_small --inference_steps 20 --samples_per_complex 40 --batch_size 10 --actual_steps 18 --no_final_step_noise
 
-When providing the `.pdb` files you can run DiffDock also on CPU, however, if possible, we recommend using a GPU as the model runs significantly faster. Note that the first time you run DiffDock on a device the program will precompute and store in cache look-up tables for SO(2) and SO(3) distributions (typically takes a couple of minutes), this won't be repeated in following runs.  
-
+When providing the `.pdb` files you can run DiffDock also on CPU, however, if possible, we recommend using a GPU as the model runs significantly faster. Note that the first time you run DiffDock on a device the program will precompute and store in cache look-up tables for SO(2) and SO(3) distributions (typically takes a couple of minutes), this won't be repeated in following runs.
 
 # Retraining DiffDock
+
 Download the data and place it as described in the "Dataset" section above.
 
 ### Generate the ESM2 embeddings for the proteins
+
 First run:
 
     python datasets/pdbbind_lm_embedding_preparation.py
@@ -80,10 +81,11 @@ Then run the command:
     python datasets/esm_embeddings_to_pt.py
 
 ### Using the provided model weights for evaluation
+
 We first generate the language model embeddings for the testset, then run inference with DiffDock, and then evaluate the files that DiffDock produced:
 
     python datasets/esm_embedding_preparation.py --protein_ligand_csv data/testset_csv.csv --out_file data/prepared_for_esm_testset.fasta
-    git clone https://github.com/facebookresearch/esm 
+    git clone https://github.com/facebookresearch/esm
     cd esm
     pip install -e .
     cd ..
@@ -92,13 +94,15 @@ We first generate the language model embeddings for the testset, then run infere
     python evaluate_files.py --results_path results/user_predictions_testset --file_to_exclude rank1.sdf --num_predictions 40
 
 <!--
-To predict binding structures using the provided model weights run: 
+To predict binding structures using the provided model weights run:
 
     python -m evaluate --model_dir workdir/paper_score_model --ckpt best_ema_inference_epoch_model.pt --confidence_ckpt best_model_epoch75.pt --confidence_model_dir workdir/paper_confidence_model --run_name DiffDockInference --inference_steps 20 --split_path data/splits/timesplit_test --samples_per_complex 40 --batch_size 10 --actual_steps 18 --no_final_step_noise
 
 To additionally save the .sdf files of the generated molecules, add the flag `--save_visualisation`
 -->
+
 ### Training a model yourself and using those weights
+
 Train the large score model:
 
     python -m train --run_name big_score_model --test_sigma_intervals --esm_embeddings_path data/esm2_3billion_embeddings.pt --log_dir workdir --lr 1e-3 --tr_sigma_min 0.1 --tr_sigma_max 19 --rot_sigma_min 0.03 --rot_sigma_max 1.55 --batch_size 16 --ns 48 --nv 10 --num_conv_layers 6 --dynamic_max_cross --scheduler plateau --scale_by_sigma --dropout 0.1 --remove_hs --c_alpha_max_neighbors 24 --receptor_radius 15 --num_dataloader_workers 1 --cudnn_benchmark --val_inference_freq 5 --num_inference_complexes 500 --use_ema --distance_embed_dim 64 --cross_distance_embed_dim 64 --sigma_embed_dim 64 --scheduler_patience 30 --n_epochs 850
@@ -122,22 +126,23 @@ Now everything is trained and you can run inference with:
 
     python -m evaluate --model_dir workdir/big_score_model --ckpt best_ema_inference_epoch_model.pt --confidence_ckpt best_model_epoch75.pt --confidence_model_dir workdir/confidence_model --run_name DiffDockInference --inference_steps 20 --split_path data/splits/timesplit_test --samples_per_complex 40 --batch_size 10 --actual_steps 18 --no_final_step_noise
 
-Note: the notebook `data/apo_alignment.ipynb` contains the code used to align the ESMFold-generated apo-structures to the holo-structures. 
+Note: the notebook `data/apo_alignment.ipynb` contains the code used to align the ESMFold-generated apo-structures to the holo-structures.
 
 ## Citation
+
     @article{corso2023diffdock,
-          title={DiffDock: Diffusion Steps, Twists, and Turns for Molecular Docking}, 
+          title={DiffDock: Diffusion Steps, Twists, and Turns for Molecular Docking},
           author = {Corso, Gabriele and Stärk, Hannes and Jing, Bowen and Barzilay, Regina and Jaakkola, Tommi},
           journal={International Conference on Learning Representations (ICLR)},
           year={2023}
     }
 
 ## License
+
 MIT
 
 ## Acknowledgements
 
 We thank Wei Lu and Rachel Wu for pointing out some issues with the code.
 
-
 ![Alt Text](visualizations/example_6agt_symmetric.gif)